help-smalltalk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation


From: Paolo Bonzini
Subject: Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
Date: Mon, 22 Oct 2007 12:03:10 +0200
User-agent: Thunderbird 2.0.0.6 (Macintosh/20070728)


I don't know if after the explanation above you still want JSON to operate on UnicodeStrings only.

Because a JSON Parser can only process characters and not bytes of some
multibyte encoding. As far as I understood a '(ReadStream on: String)
next' will return me a Character in the range 0 to: 255 which represents
a byte of the multibyte encoding of the string.

Or am I wrong and String>>#next will return me an UnicodeCharacter sometimes?

No, you're right. However, note that there are no UnicodeCharacters below 128. There, the two spaces overlap. So, if the characters are all 7-bit, a ReadStream on a String or on a UnicodeString will be undistinguishable.

   "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."

The JSON parser has no choice but to operate on Unicode characters.
Parsing a UTF-16 encoded JSON text byte-wise will just not work :)

Oops, my fault.  And can you specify different encodings?

   "Any character may be escaped."

Sure, you can write "\u0041". In this case the JSON reader will return a UnicodeString. That's why I wrote 'switching to UnicodeStrings as soon as we find a \uXXXX is a conservative approximation".

w.r.t. internal encoding of Unicode strings: Perl has an interesting concept:
It stores Unicode strings internally either as iso-8859 or UTF-X (some extended
form of UTF-8 encoding which can encode arbitrary integer values), which isn't
visible on the language level. On the language level a String is a sequence of
integers interpreted as Unicode characters.

Unfortunately, compatibility with pre-Unicode Smalltalk was a mess to achieve, and actually it still has some problems (mostly the hashing problem I refer to below). So, I really have to thank you for working out the bugs before 3.0.

Probably, what is missing from GNU Smalltalk's Iconv package is an "Encoding" object that can answer queries like "is this string pure ASCII?", the default very slow implementation being something like this:

    str := self asString.
    uniStr := self asUnicodeString.
    str size = uniStr size ifFalse: [ ^false ].
    str with: uniStr do: [ :ch :uni |
        ch value = uni codePoint ifFalse: [ ^false ] ].
    ^true

This snippet would provide a more rigorous definition of "when it's possible".

If it returns UnicodeString no literal string access on a Dictionary returned by the JSON parser will work as it would get only a String object which has a different
hash function than UnicodeString.
Hmmm, this has to be fixed.

Is it fixable?

To some extent, it should be. For example, in the case of UnicodeStrings I can define the hash to be computed by translating (internally) to UTF-8 and hashing the result. Then, we can cross our fingers and hope that Strings are also UTF-8 (good bet nowadays), and just implement EncodedString>>hash as "^self asUnicodeString hash".

As a note to myself, this would mean also skipping the first 3 bytes of a String to be hashed, if they are the BOM.

Of course it would already be helpful if it would work for ASCII characters
and Latin-1

ASCII characters and UTF-8 please. :-) I'm also from a Latin-1 country, but I try to think as international as possible. :-)

(btw. does Smalltalk operate case-insensitive or have classnames to be
upper case for the first character? Or is that just a convention?)

Just a convention.

Paolo




reply via email to

[Prev in Thread] Current Thread [Next in Thread]