[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
From: |
Paolo Bonzini |
Subject: |
Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation |
Date: |
Mon, 22 Oct 2007 12:03:10 +0200 |
User-agent: |
Thunderbird 2.0.0.6 (Macintosh/20070728) |
I don't know if after the explanation above you still want JSON to
operate on UnicodeStrings only.
Because a JSON Parser can only process characters and not bytes of some
multibyte encoding. As far as I understood a '(ReadStream on: String)
next' will return me a Character in the range 0 to: 255 which represents
a byte of the multibyte encoding of the string.
Or am I wrong and String>>#next will return me an UnicodeCharacter sometimes?
No, you're right. However, note that there are no UnicodeCharacters
below 128. There, the two spaces overlap. So, if the characters are
all 7-bit, a ReadStream on a String or on a UnicodeString will be
undistinguishable.
"JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."
The JSON parser has no choice but to operate on Unicode characters.
Parsing a UTF-16 encoded JSON text byte-wise will just not work :)
Oops, my fault. And can you specify different encodings?
"Any character may be escaped."
Sure, you can write "\u0041". In this case the JSON reader will return
a UnicodeString. That's why I wrote 'switching to UnicodeStrings as
soon as we find a \uXXXX is a conservative approximation".
w.r.t. internal encoding of Unicode strings: Perl has an interesting concept:
It stores Unicode strings internally either as iso-8859 or UTF-X (some extended
form of UTF-8 encoding which can encode arbitrary integer values), which isn't
visible on the language level. On the language level a String is a sequence of
integers interpreted as Unicode characters.
Unfortunately, compatibility with pre-Unicode Smalltalk was a mess to
achieve, and actually it still has some problems (mostly the hashing
problem I refer to below). So, I really have to thank you for working
out the bugs before 3.0.
Probably, what is missing from GNU Smalltalk's Iconv package is an
"Encoding" object that can answer queries like "is this string pure
ASCII?", the default very slow implementation being something like this:
str := self asString.
uniStr := self asUnicodeString.
str size = uniStr size ifFalse: [ ^false ].
str with: uniStr do: [ :ch :uni |
ch value = uni codePoint ifFalse: [ ^false ] ].
^true
This snippet would provide a more rigorous definition of "when it's
possible".
If it returns UnicodeString no literal string access on a Dictionary
returned by
the JSON parser will work as it would get only a String object which has a
different
hash function than UnicodeString.
Hmmm, this has to be fixed.
Is it fixable?
To some extent, it should be. For example, in the case of
UnicodeStrings I can define the hash to be computed by translating
(internally) to UTF-8 and hashing the result. Then, we can cross our
fingers and hope that Strings are also UTF-8 (good bet nowadays), and
just implement EncodedString>>hash as "^self asUnicodeString hash".
As a note to myself, this would mean also skipping the first 3 bytes of
a String to be hashed, if they are the BOM.
Of course it would already be helpful if it would work for ASCII characters
and Latin-1
ASCII characters and UTF-8 please. :-) I'm also from a Latin-1 country,
but I try to think as international as possible. :-)
(btw. does Smalltalk operate case-insensitive or have classnames to be
upper case for the first character? Or is that just a convention?)
Just a convention.
Paolo
- [Help-smalltalk] [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/20
- Message not available
- Message not available
- [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/21
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
- [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation,
Paolo Bonzini <=
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22