help-smalltalk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Help-smalltalk] Re: [bug] UnicodeString conversion truncation


From: Paolo Bonzini
Subject: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
Date: Mon, 22 Oct 2007 10:57:10 +0200
User-agent: Thunderbird 2.0.0.6 (Macintosh/20070728)


   String>>#jsonPrintOn:
      (self anySatisfy: [ :ch | ch value between: 128 and: 255 ])
             ifTrue: [ self asUnicodeString jsonPrintOn: aStream ]
             ifFalse: [ super jsonPrintOn: aStream ]

Why print strings that have non-ascii chars differently?

Because, say, an UTF-8-encoded string containing the characters 195 and 160 should print as "\u00E0", not as "à" (that's a lowercase accented 'a'). The easiest way to convert the two bytes to a single character is with #asUnicodeString: in GNU Smalltalk, Strings are bytes and UnicodeStrings are characters.

Actually, to support ISO-2022-JP and similar encodings (which use a sequence introduced by ESC to switch between latin and double-byte characters), one of us should probably change jsonPrintOn: to use

    (self allSatisfy: [ :ch | ch value between: 32 and: 126 ])
        ifFalse: [ self asUnicodeString jsonPrintOn: aStream ]
        ifTrue: [ super jsonPrintOn: aStream ]

(Note that you can safely skip: even this, unfortunately, would not cater for UTF-7. You can skip this because UTF-7 is terminally broken, and all you should do with UTF-7 is convert it to a saner encoding as soon as you read something in UTF-7.)

And this in the string parsing code:

            c = $u
               ifTrue: [
        c := (Integer readFrom: (stream next: 4) readStream radix: 16) 
asCharacter.
        (c class == UnicodeCharacter and: [ str species == String ])
          ifTrue: [ str := (UnicodeString new writeStream
               nextPutAll: str contents; yourself) ] ].
         ].
      str nextPut: c.

What it does now is to operate on UnicodeStrings if it considers it necessary; if there are no \uXXXX escapes, it uses String because valid JSON only has 7-bit characters in strings.

Would you object if I change the json code to operate on UnicodeStrings only?

I would like to understand why you need this, but no, I would not object especially because I consider JSON your code, not mine. I just helped a bit. :-)

I think you wouldn't be able to operate on UnicodeStrings only, unless I fix the bug with String/UnicodeString hashes (see below).

I don't know if after the explanation above you still want JSON to operate on UnicodeStrings only.

Stricly and semantically the JSON implementation should only operate on 
UnicodeStrings
as JSON is only parseable in Unicode. (I wonder what happens with the current 
JSON reader
when it encounters a utf-16 encoded String, as far as my test went, it just 
didn't
work because it doesn't expect multibyte encodings in String).

JSON is not supposed to include non-Latin-1 characters. Everything that's not 7-bit encodable should be escaped using \uXXXX.

What puzzles me is the question what JSONReader>>#nextJSONString should
return. Should it be a String or a UnicodeString?

Strictly speaking it should return a UnicodeString, but it's easier to use it, and faster, if (when it's possible) we let it return a String. Switching to UnicodeStrings as soon as we find a \uXXXX is a conservative approximation of "when it's possible".

Probably, what is missing from GNU Smalltalk's Iconv package is an "Encoding" object that can answer queries like "is this string pure ASCII?", the default very slow implementation being something like this:

    str := self asString.
    uniStr := self asUnicodeString.
    str size = uniStr size ifFalse: [ ^false ].
    str with: uniStr do: [ :ch :uni |
        ch value = uni codePoint ifFalse: [ ^false ] ].
    ^true

This snippet would provide a more rigorous definition of "when it's possible".

If it returns UnicodeString no literal string access on a Dictionary returned by
the JSON parser will work as it would get only a String object which has a 
different
hash function than UnicodeString.

Hmmm, this has to be fixed.

Paolo





reply via email to

[Prev in Thread] Current Thread [Next in Thread]