[Help-smalltalk] Re: [bug] UnicodeString conversion truncation

help-smalltalk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Help-smalltalk] Re: [bug] UnicodeString conversion truncation

From:	Paolo Bonzini
Subject:	[Help-smalltalk] Re: [bug] UnicodeString conversion truncation
Date:	Mon, 22 Oct 2007 10:57:10 +0200
User-agent:	Thunderbird 2.0.0.6 (Macintosh/20070728)

   String>>#jsonPrintOn:
      (self anySatisfy: [ :ch | ch value between: 128 and: 255 ])
             ifTrue: [ self asUnicodeString jsonPrintOn: aStream ]
             ifFalse: [ super jsonPrintOn: aStream ]

Why print strings that have non-ascii chars differently?

Because, say, an UTF-8-encoded string containing the characters 195 and160 should print as "\u00E0", not as "à" (that's a lowercase accented'a'). The easiest way to convert the two bytes to a single character iswith #asUnicodeString: in GNU Smalltalk, Strings are bytes andUnicodeStrings are characters.

Actually, to support ISO-2022-JP and similar encodings (which use asequence introduced by ESC to switch between latin and double-bytecharacters), one of us should probably change jsonPrintOn: to use


    (self allSatisfy: [ :ch | ch value between: 32 and: 126 ])
        ifFalse: [ self asUnicodeString jsonPrintOn: aStream ]
        ifTrue: [ super jsonPrintOn: aStream ]

(Note that you can safely skip: even this, unfortunately, would notcater for UTF-7. You can skip this because UTF-7 is terminally broken,and all you should do with UTF-7 is convert it to a saner encoding assoon as you read something in UTF-7.)

And this in the string parsing code:

            c = $u
               ifTrue: [
        c := (Integer readFrom: (stream next: 4) readStream radix: 16) 
asCharacter.
        (c class == UnicodeCharacter and: [ str species == String ])
          ifTrue: [ str := (UnicodeString new writeStream
               nextPutAll: str contents; yourself) ] ].
         ].
      str nextPut: c.

What it does now is to operate on UnicodeStrings if it considers itnecessary; if there are no \uXXXX escapes, it uses String because validJSON only has 7-bit characters in strings.

Would you object if I change the json code to operate on UnicodeStrings only?

I would like to understand why you need this, but no, I would not objectespecially because I consider JSON your code, not mine. I just helped abit. :-)

I think you wouldn't be able to operate on UnicodeStrings only, unless Ifix the bug with String/UnicodeString hashes (see below).

I don't know if after the explanation above you still want JSON tooperate on UnicodeStrings only.

Stricly and semantically the JSON implementation should only operate on 
UnicodeStrings
as JSON is only parseable in Unicode. (I wonder what happens with the current 
JSON reader
when it encounters a utf-16 encoded String, as far as my test went, it just 
didn't
work because it doesn't expect multibyte encodings in String).

JSON is not supposed to include non-Latin-1 characters. Everythingthat's not 7-bit encodable should be escaped using \uXXXX.

What puzzles me is the question what JSONReader>>#nextJSONString should
return. Should it be a String or a UnicodeString?

Strictly speaking it should return a UnicodeString, but it's easier touse it, and faster, if (when it's possible) we let it return a String.Switching to UnicodeStrings as soon as we find a \uXXXX is aconservative approximation of "when it's possible".

Probably, what is missing from GNU Smalltalk's Iconv package is an"Encoding" object that can answer queries like "is this string pureASCII?", the default very slow implementation being something like this:


    str := self asString.
    uniStr := self asUnicodeString.
    str size = uniStr size ifFalse: [ ^false ].
    str with: uniStr do: [ :ch :uni |
        ch value = uni codePoint ifFalse: [ ^false ] ].
    ^true

This snippet would provide a more rigorous definition of "when it'spossible".

If it returns UnicodeString no literal string access on a Dictionary returned by
the JSON parser will work as it would get only a String object which has a 
different
hash function than UnicodeString.


Hmmm, this has to be fixed.

Paolo

[Prev in Thread]

Current Thread

[Next in Thread]

[Help-smalltalk] [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/20
- Message not available
  - Message not available
    - [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/21
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
    - [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini <=
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22

Prev by Date: [Help-smalltalk] [bug] UnicodeString encoding weirdness
Next by Date: [Help-smalltalk] Re: [bug] UnicodeString encoding weirdness
Previous by thread: Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
Next by thread: Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
Index(es):
- Date
- Thread