[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Help-smalltalk] Re: [bug] UnicodeString conversion truncation
From: |
Paolo Bonzini |
Subject: |
[Help-smalltalk] Re: [bug] UnicodeString conversion truncation |
Date: |
Mon, 22 Oct 2007 10:57:10 +0200 |
User-agent: |
Thunderbird 2.0.0.6 (Macintosh/20070728) |
String>>#jsonPrintOn:
(self anySatisfy: [ :ch | ch value between: 128 and: 255 ])
ifTrue: [ self asUnicodeString jsonPrintOn: aStream ]
ifFalse: [ super jsonPrintOn: aStream ]
Why print strings that have non-ascii chars differently?
Because, say, an UTF-8-encoded string containing the characters 195 and
160 should print as "\u00E0", not as "à" (that's a lowercase accented
'a'). The easiest way to convert the two bytes to a single character is
with #asUnicodeString: in GNU Smalltalk, Strings are bytes and
UnicodeStrings are characters.
Actually, to support ISO-2022-JP and similar encodings (which use a
sequence introduced by ESC to switch between latin and double-byte
characters), one of us should probably change jsonPrintOn: to use
(self allSatisfy: [ :ch | ch value between: 32 and: 126 ])
ifFalse: [ self asUnicodeString jsonPrintOn: aStream ]
ifTrue: [ super jsonPrintOn: aStream ]
(Note that you can safely skip: even this, unfortunately, would not
cater for UTF-7. You can skip this because UTF-7 is terminally broken,
and all you should do with UTF-7 is convert it to a saner encoding as
soon as you read something in UTF-7.)
And this in the string parsing code:
c = $u
ifTrue: [
c := (Integer readFrom: (stream next: 4) readStream radix: 16)
asCharacter.
(c class == UnicodeCharacter and: [ str species == String ])
ifTrue: [ str := (UnicodeString new writeStream
nextPutAll: str contents; yourself) ] ].
].
str nextPut: c.
What it does now is to operate on UnicodeStrings if it considers it
necessary; if there are no \uXXXX escapes, it uses String because valid
JSON only has 7-bit characters in strings.
Would you object if I change the json code to operate on UnicodeStrings only?
I would like to understand why you need this, but no, I would not object
especially because I consider JSON your code, not mine. I just helped a
bit. :-)
I think you wouldn't be able to operate on UnicodeStrings only, unless I
fix the bug with String/UnicodeString hashes (see below).
I don't know if after the explanation above you still want JSON to
operate on UnicodeStrings only.
Stricly and semantically the JSON implementation should only operate on
UnicodeStrings
as JSON is only parseable in Unicode. (I wonder what happens with the current
JSON reader
when it encounters a utf-16 encoded String, as far as my test went, it just
didn't
work because it doesn't expect multibyte encodings in String).
JSON is not supposed to include non-Latin-1 characters. Everything
that's not 7-bit encodable should be escaped using \uXXXX.
What puzzles me is the question what JSONReader>>#nextJSONString should
return. Should it be a String or a UnicodeString?
Strictly speaking it should return a UnicodeString, but it's easier to
use it, and faster, if (when it's possible) we let it return a String.
Switching to UnicodeStrings as soon as we find a \uXXXX is a
conservative approximation of "when it's possible".
Probably, what is missing from GNU Smalltalk's Iconv package is an
"Encoding" object that can answer queries like "is this string pure
ASCII?", the default very slow implementation being something like this:
str := self asString.
uniStr := self asUnicodeString.
str size = uniStr size ifFalse: [ ^false ].
str with: uniStr do: [ :ch :uni |
ch value = uni codePoint ifFalse: [ ^false ] ].
^true
This snippet would provide a more rigorous definition of "when it's
possible".
If it returns UnicodeString no literal string access on a Dictionary returned by
the JSON parser will work as it would get only a String object which has a
different
hash function than UnicodeString.
Hmmm, this has to be fixed.
Paolo
- [Help-smalltalk] [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/20
- Message not available
- Message not available
- [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/21
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
- [Help-smalltalk] Re: [bug] UnicodeString conversion truncation,
Paolo Bonzini <=
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22