Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation

help-smalltalk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation

From:	Paolo Bonzini
Subject:	Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
Date:	Mon, 22 Oct 2007 12:03:10 +0200
User-agent:	Thunderbird 2.0.0.6 (Macintosh/20070728)

I don't know if after the explanation above you still want JSON tooperate on UnicodeStrings only.


Because a JSON Parser can only process characters and not bytes of some
multibyte encoding. As far as I understood a '(ReadStream on: String)
next' will return me a Character in the range 0 to: 255 which represents
a byte of the multibyte encoding of the string.

Or am I wrong and String>>#next will return me an UnicodeCharacter sometimes?

No, you're right. However, note that there are no UnicodeCharactersbelow 128. There, the two spaces overlap. So, if the characters areall 7-bit, a ReadStream on a String or on a UnicodeString will beundistinguishable.

   "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."

The JSON parser has no choice but to operate on Unicode characters.
Parsing a UTF-16 encoded JSON text byte-wise will just not work :)


Oops, my fault.  And can you specify different encodings?

   "Any character may be escaped."

Sure, you can write "\u0041". In this case the JSON reader will returna UnicodeString. That's why I wrote 'switching to UnicodeStrings assoon as we find a \uXXXX is a conservative approximation".

w.r.t. internal encoding of Unicode strings: Perl has an interesting concept:
It stores Unicode strings internally either as iso-8859 or UTF-X (some extended
form of UTF-8 encoding which can encode arbitrary integer values), which isn't
visible on the language level. On the language level a String is a sequence of
integers interpreted as Unicode characters.

Unfortunately, compatibility with pre-Unicode Smalltalk was a mess toachieve, and actually it still has some problems (mostly the hashingproblem I refer to below). So, I really have to thank you for workingout the bugs before 3.0.

Probably, what is missing from GNU Smalltalk's Iconv package is an"Encoding" object that can answer queries like "is this string pureASCII?", the default very slow implementation being something like this:
    str := self asString.
    uniStr := self asUnicodeString.
    str size = uniStr size ifFalse: [ ^false ].
    str with: uniStr do: [ :ch :uni |
        ch value = uni codePoint ifFalse: [ ^false ] ].
    ^true
This snippet would provide a more rigorous definition of "when it'spossible".
If it returns UnicodeString no literal string access on a Dictionaryreturned bythe JSON parser will work as it would get only a String object which has adifferent
hash function than UnicodeString.
Hmmm, this has to be fixed.
Is it fixable?

To some extent, it should be. For example, in the case ofUnicodeStrings I can define the hash to be computed by translating(internally) to UTF-8 and hashing the result. Then, we can cross ourfingers and hope that Strings are also UTF-8 (good bet nowadays), andjust implement EncodedString>>hash as "^self asUnicodeString hash".

As a note to myself, this would mean also skipping the first 3 bytes ofa String to be hashed, if they are the BOM.

Of course it would already be helpful if it would work for ASCII characters
and Latin-1

ASCII characters and UTF-8 please. :-) I'm also from a Latin-1 country,but I try to think as international as possible. :-)

(btw. does Smalltalk operate case-insensitive or have classnames to be
upper case for the first character? Or is that just a convention?)


Just a convention.

Paolo

[Prev in Thread]

Current Thread

[Next in Thread]

[Help-smalltalk] [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/20
- Message not available
  - Message not available
    - [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/21
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
    - [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini <=
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22

Prev by Date: Re: [Help-smalltalk] Re: [bug] UnicodeString encoding weirdness
Next by Date: Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
Previous by thread: Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
Next by thread: Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
Index(es):
- Date
- Thread