It should not just return some binary string, but all strings returned from the decoding should be UNICODE with the UTF-8 encoding. I think this is the actual flaw: The contract of json_decode is too broad. In case of this report, the only last chance for PHP to escape from this madness so far was to document the return value of json_decode as string (without the requirement of it to be valid UTF-8). only says, that a user providing such data should not expect it to work in a decoder (" suffer fatal runtime exceptions."), it does not forbid to refuse processing this wrong encoded character data. So if the character that might to be expected represented after (that one) "\u" does not represent such a character, it was not this escape sequence.Ĩ.2. Represented as a six-character sequence: a reverse solidus, followedīy the lowercase letter u, followed by four hexadecimal digits that Multilingual Plane (U 0000 through U FFFF), then it may be However to make "\u" mark a valid escape sequence, section 7 cleary documents when it qualifies as an escape sequence: I can perfectly see that the ABNF allows such chacarter-sequences. tklingenberg at lastflood dot net Hi Bukka, I changed it in jsond as well but in this case I added the constant which is what I plan to add to json now. Then I noticed that the RFC says this and I removed it ( ). I actually had it already implemented as a default when I was rewritting json for PHP 7. I plan to merge it to master next week but it will be just non-default option as we have to have RFC complained parser. I understand that such unicode escapes might be inconvinient and that's why I emailed internals about introducing new constant for it that will address your issue. Of cource, binary strings have to be correctly encoded in UTF-8 (the only supported input encoding for PHP json parser) as stated in section 8.1. That is all about escaped sequences (see "\uDEAD".). String values to contain bit sequences that cannot encode UnicodeĬharacters for example, "\uDEAD" (a single unpaired UTF-16 However, the ABNF in this specification allows member names and Actually the note about that is the section 8.2 that I quoted before. It doesn't say anything about prohibiting surrogate sequence for unicode escape. This line specifically shows that: %x75 4HEXDIG String = quotation-mark *char quotation-mark I think you don't understand the RFC and the string ABNF. As you can see, this is a different example and I can't see that PHP violates the spec nor it's own contract here. That's covered by the specs you quote, but not the flaw I reported here. If a string would have been passed json_decode containing the related binary sequence - as what you say would be allowed by the JSON spec - PHP handles it correctly according the documented contract: The binary sequence would qualify as *not* being an UTF-8 string and therefore the result of the function is unexpected: U D834 is not a character in the Basic Multilingual Plane (see Unicode, compare with a reference, exemplary: ). It's about the strings _represented_ by JSON (not in binary), and more specifically the option to use an \uXXXX (six characters) escape sequence for any character in the Basic Multilingual Plane (U 0000 through U FFFF). You've perhaps been misguided by the internals mailings (haven't read those), the part you quote is about binary string data.īut the report I created is *not* about binary data, you can see, the string presented is US-ASCII without any control characters. Thank you for taking the time to look into it.īut I'm very sorry to highlight that the information you've provided in your comment is best of all only remotely related to this issue and does not touch the root-cause of the flaw reported here. Tklingenberg at lastflood dot net Hi bukka,
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |