help-smalltalk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation


From: Robin Redeker
Subject: Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
Date: Mon, 22 Oct 2007 11:41:09 +0200
User-agent: Mutt/1.5.11+cvs20060403

On Mon, Oct 22, 2007 at 10:57:10AM +0200, Paolo Bonzini wrote:
> 
[.snip.]
> >Would you object if I change the json code to operate on UnicodeStrings 
> >only?
> 
> I would like to understand why you need this, but no, I would not object 
> especially because I consider JSON your code, not mine.  I just helped a 
> bit.  :-)

Heh, ok. I just want to hear other people thoughts about this :)

> I think you wouldn't be able to operate on UnicodeStrings only, unless I 
> fix the bug with String/UnicodeString hashes (see below).
> 
> I don't know if after the explanation above you still want JSON to 
> operate on UnicodeStrings only.

Because a JSON Parser can only process characters and not bytes of some
multibyte encoding. As far as I understood a '(ReadStream on: String)
next' will return me a Character in the range 0 to: 255 which represents
a byte of the multibyte encoding of the string.

Or am I wrong and String>>#next will return me an UnicodeCharacter sometimes?

> >Stricly and semantically the JSON implementation should only operate on 
> >UnicodeStrings
> >as JSON is only parseable in Unicode. (I wonder what happens with the 
> >current JSON reader
> >when it encounters a utf-16 encoded String, as far as my test went, it 
> >just didn't
> >work because it doesn't expect multibyte encodings in String).
> 
> JSON is not supposed to include non-Latin-1 characters.  Everything 
> that's not 7-bit encodable should be escaped using \uXXXX.

I must object, the JSON RFC ( http://www.ietf.org/rfc/rfc4627.txt ) says:

   "JavaScript Object Notation (JSON) is a text format for the serialization
   of structured data."
And:
   "A string is a sequence of zero or more Unicode characters [UNICODE]."
And:
   "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."

Also the whole grammar/BNF is defined in terms of Unicode characters.
The \uXXXX is allowed in strings for convenience.

   "Any character may be escaped."

The JSON parser has no choice but to operate on Unicode characters.
Parsing a UTF-16 encoded JSON text byte-wise will just not work :)

> >What puzzles me is the question what JSONReader>>#nextJSONString should
> >return. Should it be a String or a UnicodeString?
> 
> Strictly speaking it should return a UnicodeString, but it's easier to 
> use it, and faster, if (when it's possible) we let it return a String. 
> Switching to UnicodeStrings as soon as we find a \uXXXX is a 
> conservative approximation of "when it's possible".

I guess it depends on how compatible String and UnicodeString are. I can't
assume anything about the strings returned by the implementation, and usually I
would have to convert the string to an UnicodeString anyways I guess.

If you are concerned about the memory footprint of UnicodeStrings I would
suggest making it possible to have the JSON implementation return always
encoded strings if told so.

w.r.t. internal encoding of Unicode strings: Perl has an interesting concept:
It stores Unicode strings internally either as iso-8859 or UTF-X (some extended
form of UTF-8 encoding which can encode arbitrary integer values), which isn't
visible on the language level. On the language level a String is a sequence of
integers interpreted as Unicode characters.

> Probably, what is missing from GNU Smalltalk's Iconv package is an 
> "Encoding" object that can answer queries like "is this string pure 
> ASCII?", the default very slow implementation being something like this:
> 
>     str := self asString.
>     uniStr := self asUnicodeString.
>     str size = uniStr size ifFalse: [ ^false ].
>     str with: uniStr do: [ :ch :uni |
>         ch value = uni codePoint ifFalse: [ ^false ] ].
>     ^true
> 
> This snippet would provide a more rigorous definition of "when it's 
> possible".
> 
> >If it returns UnicodeString no literal string access on a Dictionary 
> >returned by
> >the JSON parser will work as it would get only a String object which has a 
> >different
> >hash function than UnicodeString.
> 
> Hmmm, this has to be fixed.

Is it fixable? If I have a UnicodeString the encoding is lost and the hash
has to operate on the characters. If I have eg. the UTF-16 encoded form in a
String then the hash method has to operate on the bytes which will lead to
a different encoding.

Of course it would already be helpful if it would work for ASCII characters
and Latin-1, because I'm accessing those Dictionaries often with literal strings
and usually those literals are ascii strings in my case.

But it would also be nice if there would be a way to have UnicodeString 
literals :)
Of course then the Smalltalk source has to have a defined encoding and the 
Smalltalk
parser would have to understand Unicode.
(I don't need this, this just a random thought :)
(btw. does Smalltalk operate case-insensitive or have classnames to be
upper case for the first character? Or is that just a convention?)



Robin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]