help-smalltalk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation


From: Paolo Bonzini
Subject: Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
Date: Mon, 22 Oct 2007 15:40:53 +0200
User-agent: Thunderbird 2.0.0.6 (Macintosh/20070728)


Hm, but then I would have to do that for any character in a String and not
only for \uXXXX, if I understand you right, as this is valid JSON
(encoding UTF-8):

   {"test":"にほんじん\u306b"}

I couldn't refute or confirm the above, so I started implementing it, and this led to another approach: do everything in Unicode inside the JSON reader, as you suggested, but accept *and return* encoded strings. The attached patch and the attached JSON.st will show more or less what I have in mind. There is still something to be gained from a Stream>>#encoding method, but it already clarifies things a lot, at least for me.

With this patch (which is straightforward apart from a stupid String/Symbol mismatch in i18n/Locale.st), I can do things like these:

PackageLoader fileInPackage: #I18N.
FileStream fileIn: '../examples/JSON.st'.
JSONReader fromJSON: '{"test":"にほんじん\u306b"}'
=> 'test'->'にほんじんに'
JSONReader fromJSON: '{"test":"にほんじん\u306b"}' encoding: 'UTF-16'
=> UTF-16BE['test']->UTF-16BE['にほんじんに']

(there are problems with hashing, which make the second result unusable in practice, but the JSON reader does its part of the job right, at least).

Nope, Perl has strings of "integers" which can either represent octets
or Unicode characters. The interpretation is up to the programmer.

Hm, so that's as bad as what we do. Ours has more potential for confusion, but it's a little more typesafe. This fits the general Perl philosophy, I would say.

Paolo
* looking for address@hidden/smalltalk--devo--2.2--patch-612 to compare with
* comparing to address@hidden/smalltalk--devo--2.2--patch-612
M  kernel/UniString.st
M  examples/JSON.st
M  packages/i18n/Locale.st
M  packages/iconv/Sets.st
M  kernel/CharArray.st
M  kernel/Stream.st
M  kernel/String.st

* modified files

--- orig/examples/JSON.st
+++ mod/examples/JSON.st
@@ -32,7 +32,7 @@
 
 
 Stream subclass: #JSONReader
-    instanceVariableNames: 'stream'
+    instanceVariableNames: 'stream encoding'
     classVariableNames: ''
     poolDictionaries: ''
     category: nil !
@@ -51,15 +51,39 @@ toJSON: anObject
 
 fromJSON: string
    "I'm responsible for decoding the JSON string to objects."
-   ^(self on: string readStream) nextJSONObject
+   ^self fromJSON: string encoding: string encoding
+!
+
+fromJSON: string encoding: encString
+   "I'm responsible for decoding the JSON string to objects."
+   | stream |
+   stream := string readStream.
+   string isUnicode ifFalse: [
+       stream := I18N.EncodedStream unicodeOn: stream encoding: string 
encoding ].
+   ^(self on: stream encoding: encString) nextJSONObject
 !
 
 on: aStream
-    ^self new stream: aStream
+    ^self on: aStream encoding: 'UTF-8'
+!
+
+on: aStream encoding: encString
+    "TODO: if we had an #encoding method on Streams, we could use it and do
+     the re-encoding here.  Now instead we assume encString is also the input
+     encoding."
+    ^self new stream: aStream; encoding: encString; yourself
 ! !
 
 !JSONReader methodsFor: 'json'!
 
+encoding
+    ^encoding
+!
+
+encoding: aString
+    encoding := aString
+!
+
 stream: aStream
     stream := aStream
 !
@@ -165,15 +189,11 @@ nextJSONString
             c = $t ifTrue: [ c := Character tab ].
             c = $u
                ifTrue: [
-                 c := (Integer readFrom: (stream next: 4) readStream radix: 
16) asCharacter.
-                 (c class == UnicodeCharacter and: [ str species == String ])
-                   ifTrue: [ str := (UnicodeString new writeStream
-                                       nextPutAll: str contents; yourself) ] ].
+                 c := (Integer readFrom: (stream next: 4) readStream radix: 
16) asCharacter ].
          ].
-      str nextPut: c.
+        str nextPut: c.
    ].
-   "Undo the conversion to UnicodeString done above."
-   ^str contents asString.
+   ^str contents asString: self encoding
 !
 
 nextJSONNumber


--- orig/kernel/CharArray.st
+++ mod/kernel/CharArray.st
@@ -60,6 +60,14 @@ accessing and manipulation methods for s
        ^self with: Character nl
     ]
 
+    CharacterArray class >> isUnicode [
+       "Answer whether the receiver stores bytes (i.e. an encoded
+        form) or characters (if true is returned)."
+
+       <category: 'multibyte encodings'>
+       self subclassResponsibility
+    ]
+
     = aString [
        "Answer whether the receiver's items match those in aCollection"
 
@@ -188,6 +196,14 @@ accessing and manipulation methods for s
        ^nil
     ]
 
+    isUnicode [
+       "Answer whether the receiver stores bytes (i.e. an encoded
+        form) or characters (if true is returned)."
+
+       <category: 'multibyte encodings'>
+       ^self class isUnicode
+    ]
+
     encoding [
        "Answer the encoding used by the receiver."
 


--- orig/kernel/Stream.st
+++ mod/kernel/Stream.st
@@ -312,7 +312,7 @@ provide for writing collections sequenti
         whose value is above 127."
 
        <category: 'character writing'>
-       ^self species shape ~~ #character
+       ^self species isUnicode
     ]
 
     cr [


--- orig/kernel/String.st
+++ mod/kernel/String.st
@@ -64,6 +64,14 @@ or assumed to be the system default.'>
        ^SystemExceptions.WrongClass signalOn: anInteger mustBe: SmallInteger
     ]
 
+    String class >> isUnicode [
+       "Answer false; the receiver stores bytes (i.e. an encoded
+        form), not characters."
+
+       <category: 'multibyte encodings'>
+       ^false
+    ]
+
     = aString [
        "Answer whether the receiver's items match those in aCollection"
 


--- orig/kernel/UniString.st
+++ mod/kernel/UniString.st
@@ -55,6 +55,13 @@ as 4-byte UTF-32 characters'>
        ^'Unicode'
     ]
 
+    UnicodeString class >> isUnicode [
+       "Answer true; the receiver stores characters."
+
+       <category: 'multibyte encodings'>
+       ^true
+    ]
+
     asString [
        "Returns the string corresponding to the receiver.  Without the
         Iconv package, unrecognized Unicode characters become $?


--- orig/packages/i18n/Locale.st
+++ mod/packages/i18n/Locale.st
@@ -69,7 +69,7 @@ information.'>
        "Set the default charset used when nothing is specified."
 
        <category: 'database'>
-       DefaultCharsets at: 'POSIX' put: aString asSymbol
+       DefaultCharsets at: 'POSIX' put: aString asString
     ]
 
     LocaleData class >> defaults [
@@ -77,10 +77,136 @@ information.'>
         associations."
 
        <category: 'database'>
-       ^#(#('POSIX' '' #'UTF-8') #('af' 'ZA' #'ISO-8859-1') #('am' 'ET' 
#'UTF-8') #('ar' 'SA' #'ISO-8859-6') #('as' 'IN' #'UTF-8') #('az' 'AZ' 
#'UTF-8') #('be' 'BY' #CP1251) #('ber' 'MA' #'UTF-8') #('bg' 'BG' #CP1251) 
#('bin' 'NG' #'ISO-8859-1') #('bn' 'IN' #'UTF-8') #('bnt' 'TZ' #'ISO-8859-1') 
#('bo' 'CN' #'UTF-8') #('br' 'FR' #'ISO-8859-1') #('bs' 'BA' #'ISO-8859-2') 
#('ca' 'ES' #'ISO-8859-1') #('chr' 'US' #'ISO-8859-1') #('cpe' 'US' 
#'ISO-8859-1') #('cs' 'CZ' #'ISO-8859-2') #('cy' 'GB' #'ISO-8859-14') #('da' 
'DK' #'ISO-8859-1') #('de' 'DE' #'ISO-8859-1') #('div' 'MV' #'ISO-8859-1') 
#('el' 'GR' #'ISO-8859-7') #('en' 'US' #'ISO-8859-1') #('eo' 'XX' 
#'ISO-8859-3') #('es' 'ES' #'ISO-8859-1') #('et' 'EE' #'ISO-8859-4') #('eu' 
'ES' #'ISO-8859-1') #('fa' 'IR' #'UTF-8') #('fi' 'FI' #'ISO-8859-1') #('fo' 
'FO' #'ISO-8859-1') #('fr' 'FR' #'ISO-8859-1') #('ful' 'NG' #'ISO-8859-1') 
#('fy' 'NL' #'ISO-8859-1') #('ga' 'IE' #'ISO-8859-1') #('gd' 'GB' 
#'ISO-8859-1') #('gl' 'ES' #'ISO-8859-1') #('gn' 'PY' #'ISO-8859-1') #('gu' 
'IN' #'UTF-8') #('gv' 'GB' #'ISO-8859-1') #('ha' 'NG' #'ISO-8859-1') #('he' 
'IL' #'ISO-8859-8') #('hi' 'IN' #'UTF-8') #('hr' 'HR' #'ISO-8859-2') #('hu' 
'HU' #'ISO-8859-2') #('ibo' 'NG' #'ISO-8859-1') #('id' 'ID' #'ISO-8859-1') 
#('is' 'IS' #'ISO-8859-1') #('it' 'IT' #'ISO-8859-1') #('iu' 'CA' #'UTF-8') 
#('ja' 'JP' #'EUC-JP') #('ka' 'GE' #'GEORGIAN-PS') #('kau' 'NG' #'ISO-8859-1') 
#('kk' 'KZ' #'UTF-8') #('kl' 'GL' #'ISO-8859-1') #('km' 'KH' #'UTF-8') #('kn' 
'IN' #'UTF-8') #('ko' 'KR' #'EUC-KR') #('kok' 'IN' #'UTF-8') #('ks' 'PK' 
#'UTF-8') #('kw' 'GB' #'ISO-8859-1') #('ky' 'KG' #'UTF-8') #('la' 'VA' #ASCII) 
#('lt' 'LT' #'ISO-8859-13') #('lv' 'LV' #'ISO-8859-13') #('mi' 'NZ' 
#'ISO-8859-13') #('mk' 'MK' #'ISO-8859-5') #('ml' 'IN' #'UTF-8') #('mn' 'MN' 
#'KOI8-R') #('mni' 'IN' #'UTF-8') #('mr' 'IN' #'UTF-8') #('ms' 'MY' 
#'ISO-8859-1') #('mt' 'MT' #'ISO-8859-3') #('my' 'MM' #'UTF-8') #('ne' 'NP' 
#'UTF-8') #('nic' 'NG' #'ISO-8859-1') #('nl' 'NL' #'ISO-8859-1') #('nn' 'NO' 
#'ISO-8859-1') #('no' 'NO' #'ISO-8859-1') #('oc' 'FR' #'ISO-8859-1') #('om' 
'ET' #'UTF-8') #('or' 'IN' #'UTF-8') #('pa' 'IN' #'UTF-8') #('pap' 'AN' 
#'UTF-8') #('pl' 'PL' #'ISO-8859-2') #('ps' 'PK' #'UTF-8') #('pt' 'PT' 
#'ISO-8859-1') #('rm' 'CH' #'ISO-8859-1') #('ro' 'RO' #'ISO-8859-2') #('ru' 
'RU' #'KOI8-R') #('sa' 'IN' #'UTF-8') #('se' 'NO' #'UTF-8') #('sh' 'YU' 
#'ISO-8859-2') #('si' 'LK' #'UTF-8') #('sit' 'CN' #'UTF-8') #('sk' 'SK' 
#'ISO-8859-2') #('sl' 'SI' #'ISO-8859-2') #('so' 'SO' #'UTF-8') #('sp' 'YU' 
#'ISO-8859-5') #('sq' 'AL' #'ISO-8859-1') #('sr' 'YU' #'ISO-8859-2') #('sv' 
'SE' #'ISO-8859-1') #('sw' 'KE' #'ISO-8859-1') #('syr' 'TR' #'UTF-8') #('ta' 
'IN' #'UTF-8') #('te' 'IN' #'UTF-8') #('tg' 'TJ' #'UTF-8') #('th' 'TH' 
#'TIS-620') #('ti' 'ET' #'UTF-8') #('tk' 'TM' #'UTF-8') #('tl' 'PH' 
#'ISO-8859-1') #('tr' 'TR' #'ISO-8859-9') #('ts' 'ZA' #'ISO-8859-1') #('tt' 
'RU' #'UTF-8') #('uk' 'UA' #'KOI8-U') #('ur' 'PK' #'UTF-8') #('uz' 'UZ' 
#'ISO-8859-1') #('ven' 'ZA' #'ISO-8859-1') #('vi' 'VN' #'UTF-8') #('wa' 'BE' 
#'ISO-8859-1') #('wen' 'DE' #'ISO-8859-1') #('xh' 'ZA' #'ISO-8859-1') #('yi' 
'US' #CP1255) #('yo' 'NG' #'ISO-8859-1') #('zh' 'CN' #GB2312) #('zu' 'ZA' 
#'ISO-8859-1'))    "not yet seen on Unix"  "not yet seen on Unix"  "not yet 
seen on Unix"  "not yet seen on Unix"  "not yet seen on Unix"  "not yet seen on 
Unix"  "not yet seen on Unix"  "not yet seen on Unix"  "not yet seen on Unix"  
"not yet seen on Unix"  "not yet seen on Unix"  "not yet seen on Unix"  "not 
yet seen on Unix"  "not yet seen on Unix"  "not yet seen on Unix"  "not yet 
seen on Unix"  "not yet seen on Unix"  "not yet seen on Unix"  "not yet seen on 
Unix"
-       "('hy' 'AM'     #'ARMSCII-8')"  "not yet seen on Unix"  "not yet seen 
on Unix"  "not yet seen on Unix"  "not yet seen on Unix"  "not yet seen on 
Unix"  "not yet seen on Unix"  "not yet seen on Unix"  "not yet seen on Unix"  
"not yet seen on Unix"  "not yet seen on Unix"  "not yet seen on Unix"
-       "('lo' 'LA'     #'MULELAO-1')"  "not yet seen on Unix"  "not yet seen 
on Unix"  "not yet seen on Unix"  "not yet seen on Unix"  "not yet seen on 
Unix"  "not yet seen on Unix"  "not yet seen on Unix"  "not yet seen on Unix"  
"not yet seen on Unix"  "not yet seen on Unix"  "not yet seen on Unix"  "not 
yet seen on Unix"  "not yet seen on Unix"  "not yet seen on Unix"
-       "('sd' ?        ?)"     "not yet seen on Unix"  "obsolete"      "not 
yet seen on Unix"  "not yet seen on Unix"  "not yet seen on Unix"  "obsolete"   
   "not yet seen on Unix"  "not yet seen on Unix"  "not yet seen on Unix"  "not 
yet seen on Unix"  "not yet seen on Unix"  "not yet seen on Unix"  "not yet 
seen on Unix"  "not yet seen on Unix"  "not yet seen on Unix"  "not yet seen on 
Unix"  "not yet seen on Unix"
+       ^#(#('POSIX' '' 'UTF-8')
+        #('af' 'ZA' 'ISO-8859-1')
+        #('am' 'ET' 'UTF-8')
+        #('ar' 'SA' 'ISO-8859-6')
+        #('as' 'IN' 'UTF-8')
+        #('az' 'AZ' 'UTF-8')
+        #('be' 'BY' 'CP1251')
+        #('ber' 'MA' 'UTF-8')
+        #('bg' 'BG' 'CP1251')
+        #('bin' 'NG' 'ISO-8859-1')
+        #('bn' 'IN' 'UTF-8')
+        #('bnt' 'TZ' 'ISO-8859-1')
+        #('bo' 'CN' 'UTF-8')
+        #('br' 'FR' 'ISO-8859-1')
+        #('bs' 'BA' 'ISO-8859-2')
+        #('ca' 'ES' 'ISO-8859-1')
+        #('chr' 'US' 'ISO-8859-1')
+        #('cpe' 'US' 'ISO-8859-1')
+        #('cs' 'CZ' 'ISO-8859-2')
+        #('cy' 'GB' 'ISO-8859-14')
+        #('da' 'DK' 'ISO-8859-1')
+        #('de' 'DE' 'ISO-8859-1')
+        #('div' 'MV' 'ISO-8859-1')
+        #('el' 'GR' 'ISO-8859-7')
+        #('en' 'US' 'ISO-8859-1')
+        #('eo' 'XX' 'ISO-8859-3')
+        #('es' 'ES' 'ISO-8859-1')
+        #('et' 'EE' 'ISO-8859-4')
+        #('eu' 'ES' 'ISO-8859-1')
+        #('fa' 'IR' 'UTF-8')
+        #('fi' 'FI' 'ISO-8859-1')
+        #('fo' 'FO' 'ISO-8859-1')
+        #('fr' 'FR' 'ISO-8859-1')
+        #('ful' 'NG' 'ISO-8859-1')
+        #('fy' 'NL' 'ISO-8859-1')
+        #('ga' 'IE' 'ISO-8859-1')
+        #('gd' 'GB' 'ISO-8859-1')
+        #('gl' 'ES' 'ISO-8859-1')
+        #('gn' 'PY' 'ISO-8859-1')
+        #('gu' 'IN' 'UTF-8')
+        #('gv' 'GB' 'ISO-8859-1')
+        #('ha' 'NG' 'ISO-8859-1')
+        #('he' 'IL' 'ISO-8859-8')
+        #('hi' 'IN' 'UTF-8')
+        #('hr' 'HR' 'ISO-8859-2')
+        #('hu' 'HU' 'ISO-8859-2')
+        #('ibo' 'NG' 'ISO-8859-1')
+        #('id' 'ID' 'ISO-8859-1')
+        #('is' 'IS' 'ISO-8859-1')
+        #('it' 'IT' 'ISO-8859-1')
+        #('iu' 'CA' 'UTF-8')
+        #('ja' 'JP' 'EUC-JP')
+        #('ka' 'GE' 'GEORGIAN-PS')
+        #('kau' 'NG' 'ISO-8859-1')
+        #('kk' 'KZ' 'UTF-8')
+        #('kl' 'GL' 'ISO-8859-1')
+        #('km' 'KH' 'UTF-8')
+        #('kn' 'IN' 'UTF-8')
+        #('ko' 'KR' 'EUC-KR')
+        #('kok' 'IN' 'UTF-8')
+        #('ks' 'PK' 'UTF-8')
+        #('kw' 'GB' 'ISO-8859-1')
+        #('ky' 'KG' 'UTF-8')
+        #('la' 'VA' 'ASCII')
+        #('lt' 'LT' 'ISO-8859-13')
+        #('lv' 'LV' 'ISO-8859-13')
+        #('mi' 'NZ' 'ISO-8859-13')
+        #('mk' 'MK' 'ISO-8859-5')
+        #('ml' 'IN' 'UTF-8')
+        #('mn' 'MN' 'KOI8-R')
+        #('mni' 'IN' 'UTF-8')
+        #('mr' 'IN' 'UTF-8')
+        #('ms' 'MY' 'ISO-8859-1')
+        #('mt' 'MT' 'ISO-8859-3')
+        #('my' 'MM' 'UTF-8')
+        #('ne' 'NP' 'UTF-8')
+        #('nic' 'NG' 'ISO-8859-1')
+        #('nl' 'NL' 'ISO-8859-1')
+        #('nn' 'NO' 'ISO-8859-1')
+        #('no' 'NO' 'ISO-8859-1')
+        #('oc' 'FR' 'ISO-8859-1')
+        #('om' 'ET' 'UTF-8')
+        #('or' 'IN' 'UTF-8')
+        #('pa' 'IN' 'UTF-8')
+        #('pap' 'AN' 'UTF-8')
+        #('pl' 'PL' 'ISO-8859-2')
+        #('ps' 'PK' 'UTF-8')
+        #('pt' 'PT' 'ISO-8859-1')
+        #('rm' 'CH' 'ISO-8859-1')
+        #('ro' 'RO' 'ISO-8859-2')
+        #('ru' 'RU' 'KOI8-R')
+        #('sa' 'IN' 'UTF-8')
+        #('se' 'NO' 'UTF-8')
+        #('sh' 'YU' 'ISO-8859-2')
+        #('si' 'LK' 'UTF-8')
+        #('sit' 'CN' 'UTF-8')
+        #('sk' 'SK' 'ISO-8859-2')
+        #('sl' 'SI' 'ISO-8859-2')
+        #('so' 'SO' 'UTF-8')
+        #('sp' 'YU' 'ISO-8859-5')
+        #('sq' 'AL' 'ISO-8859-1')
+        #('sr' 'YU' 'ISO-8859-2')
+        #('sv' 'SE' 'ISO-8859-1')
+        #('sw' 'KE' 'ISO-8859-1')
+        #('syr' 'TR' 'UTF-8')
+        #('ta' 'IN' 'UTF-8')
+        #('te' 'IN' 'UTF-8')
+        #('tg' 'TJ' 'UTF-8')
+        #('th' 'TH' 'TIS-620')
+        #('ti' 'ET' 'UTF-8')
+        #('tk' 'TM' 'UTF-8')
+        #('tl' 'PH' 'ISO-8859-1')
+        #('tr' 'TR' 'ISO-8859-9')
+        #('ts' 'ZA' 'ISO-8859-1')
+        #('tt' 'RU' 'UTF-8')
+        #('uk' 'UA' 'KOI8-U')
+        #('ur' 'PK' 'UTF-8')
+        #('uz' 'UZ' 'ISO-8859-1')
+        #('ven' 'ZA' 'ISO-8859-1')
+        #('vi' 'VN' 'UTF-8')
+        #('wa' 'BE' 'ISO-8859-1')
+        #('wen' 'DE' 'ISO-8859-1')
+        #('xh' 'ZA' 'ISO-8859-1')
+        #('yi' 'US' 'CP1255')
+        #('yo' 'NG' 'ISO-8859-1')
+        #('zh' 'CN' 'GB2312')
+        #('zu' 'ZA' 'ISO-8859-1'))
+       "('hy' 'AM'     #'ARMSCII-8')"
+       "('lo' 'LA'     #'MULELAO-1')"
+       "('sd' ?        ?)"
     ]
 
     LocaleData class >> initialize [


--- orig/packages/iconv/Sets.st
+++ mod/packages/iconv/Sets.st
@@ -100,6 +100,7 @@ assumed to be the system default.'>
        <category: 'instance creation'>
        | str |
        str := aString asString.
+       str encoding = str class defaultEncoding ifTrue: [ ^str ].
        ^self fromString: str encoding: str encoding
     ]
 
@@ -109,6 +110,7 @@ assumed to be the system default.'>
        str := aString isString 
                    ifTrue: [aString]
                    ifFalse: [aString asString: encoding].
+       encoding = str class defaultEncoding ifTrue: [ ^str ].
        ^(self basicNew)
            setString: aString;
            encoding: encoding
@@ -124,6 +126,14 @@ assumed to be the system default.'>
        self shouldNotImplement
     ]
 
+    EncodedString class >> isUnicode [
+       "Answer false; the receiver stores bytes (i.e. an encoded
+        form), not characters."
+
+       <category: 'accessing'>
+       ^false
+    ]
+
     asString [
        <category: 'accessing'>
        ^string
@@ -279,6 +289,14 @@ Encoders can return EncodedString object
        ^EncodedString fromString: (String new: size) encoding: self encoding
     ]
 
+    isUnicode [
+       "Answer false; the receiver stores bytes (i.e. an encoded
+        form), not characters."
+
+       <category: 'accessing'>
+       ^false
+    ]
+
     encoding [
        "Answer the encoding used for the created Strings."
 



"======================================================================
|
|   JSON reader/writer example
|
|
 ======================================================================"


"======================================================================
|
| Copyright 2007 Free Software Foundation, Inc.
| Written by Robin Redeker.
|
| This file is part of the GNU Smalltalk class library.
|
| The GNU Smalltalk class library is free software; you can redistribute it
| and/or modify it under the terms of the GNU Lesser General Public License
| as published by the Free Software Foundation; either version 2.1, or (at
| your option) any later version.
| 
| The GNU Smalltalk class library is distributed in the hope that it will be
| useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
| MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser
| General Public License for more details.
| 
| You should have received a copy of the GNU Lesser General Public License
| along with the GNU Smalltalk class library; see the file COPYING.LIB.
| If not, write to the Free Software Foundation, 59 Temple Place - Suite
| 330, Boston, MA 02110-1301, USA.  
|
 ======================================================================"


Stream subclass: #JSONReader
    instanceVariableNames: 'stream encoding'
    classVariableNames: ''
    poolDictionaries: ''
    category: nil !

JSONReader comment:
'I read data structures (currently build of OrderedCollection and Dictionary)
from and to JSON (Java Script Object Notation). Writing is done with the
#toJSON method (note: it will behave badly with circular data structures).' !

!JSONReader class methodsFor: 'json'!

toJSON: anObject
   "I'm returning a JSON string which represents the object."
   ^anObject toJSON
!

fromJSON: string
   "I'm responsible for decoding the JSON string to objects."
   ^self fromJSON: string encoding: string encoding
!

fromJSON: string encoding: encString
   "I'm responsible for decoding the JSON string to objects."
   | stream |
   stream := string readStream.
   string isUnicode ifFalse: [
       stream := I18N.EncodedStream unicodeOn: stream encoding: string encoding 
].
   ^(self on: stream encoding: encString) nextJSONObject
!

on: aStream
    ^self on: aStream encoding: 'UTF-8'
!

on: aStream encoding: encString
    "TODO: if we had an #encoding method on Streams, we could use it and do
     the re-encoding here.  Now instead we assume encString is also the input
     encoding."
    ^self new stream: aStream; encoding: encString; yourself
! !

!JSONReader methodsFor: 'json'!

encoding
    ^encoding
!

encoding: aString
    encoding := aString
!

stream: aStream
    stream := aStream
!

peek
   "I'm peeking for the next non-whitespace character and will drop all 
whitespace in front of it"
   | c |
   [
     c := stream peek.
     c = (Character space)
         or: [ c = (Character tab)
         or: [ c = (Character lf)
         or: [ c = (Character cr)]]]
   ] whileTrue: [
     stream next
   ].
   ^c
!

next
   "I'm returning the next non-whitespace character"
   | c |
   c := self peek.
   c isNil ifTrue: [ ^self error: 'expected character but found end of stream' 
].
   stream next.
   ^c
! !

!JSONReader methodsFor: 'private'!

nextJSONObject
   "I decode a json self to a value, which will be one of: nil,
true, false, OrderedCollection, Dictionary, String or Number
(i will return Integer or Float depending on the input)."
   | c |
   c := self peek.
   (c = $n) ifTrue: [ self next: 4. ^nil   ].
   (c = $t) ifTrue: [ self next: 4. ^true  ].
   (c = $f) ifTrue: [ self next: 5. ^false ].
   (c = ${) ifTrue: [ ^self nextJSONDict ].
   (c = $[) ifTrue: [ ^self nextJSONArray  ].
   (c = $") ifTrue: [ ^self nextJSONString ].
   ^self nextJSONNumber
!

nextJSONArray
   "I decode JSON arrays from self and will return a OrderedCollection for 
them."
   | c obj value |
   obj := OrderedCollection new.
   self next.
   [ c := self peek.
     (c = $]) ] whileFalse: [
      (c = $,) ifTrue: [ self next. ].
      value := self nextJSONObject.
      obj add: value.
   ].
   self next.
   ^obj
!

nextJSONDict
   "I decode JSON objects from self and will return a Dictionary containing all 
the key/value pairs."
   | c obj key value |
   obj := Dictionary new.
   self next.
   [ c := self peek.
     c = $} ] whileFalse: [
      (c = $,) ifTrue: [ self next ].

      key := self nextJSONString.

      c := self next.
      c = $: ifFalse: [
         self error: ('unexpected character found where name-seperator '':'' 
expected, found: %1' bindWith: c)
      ].

      value := self nextJSONObject.

      obj at: key put: value.
   ].
   self next.
   ^obj
!

nextJSONString
   "I'm extracting a JSON string from self and return them as String."
   | c obj str |
   str := ReadWriteStream on: UnicodeString new.
   self next.
   [
        c := stream next.
        c = $"
   ] whileFalse: [
      c = $\
         ifTrue: [
            c := stream next.
            c isNil ifTrue:
               [ ^self error: 'expected character, found end of self' ].
            c = $b ifTrue: [ c := 8 asCharacter ].
            c = $f ifTrue: [ c := 12 asCharacter ].
            c = $n ifTrue: [ c := Character nl ].
            c = $r ifTrue: [ c := Character cr ].
            c = $t ifTrue: [ c := Character tab ].
            c = $u
               ifTrue: [
                  c := (Integer readFrom: (stream next: 4) readStream radix: 
16) asCharacter ].
         ].
         str nextPut: c.
   ].

   "Same as 'str contents asString: self encoding', a little more efficient."
   "str reset. ^(I18N.EncodedStream encoding: str as: self encoding) contents"
   ^str contents asString: self encoding
!

nextJSONNumber
   "I'm extracting a number in JSON format from self and return Integer or 
Float depending on the input."
   | c num sgn int intexp frac exp isfloat |
   num := WriteStream on: (String new).

   isfloat := false.
   sgn     := 1.
   int     := 0.
   intexp  := 1.

   c := stream peek.
   (c isNil) ifTrue: [ ^self error: 'expected number or -sign, but found end of 
self' ].
   c = $- ifTrue: [ sgn := -1. stream next. ].

   c := stream peek.
   (c isNil) ifTrue: [ ^self error: 'expected number, but found end of self' ].
   (c isDigit or: [ c = $. ]) ifFalse: [ ^self error: 'invalid JSON input' ].

   [ c notNil and: [ c isDigit ] ] whileTrue: [
      stream next.
      int := sgn * (c digitValue) + (int * 10).
      c := stream peek
   ].
   (c isNil) ifTrue: [ ^int ].

   c = $. ifTrue: [
      stream next.
      isfloat := true.
      [ c := stream peek. c notNil and: [ c isDigit ] ] whileTrue: [
         sgn := sgn / 10.
         int := sgn * (c digitValue) + int.
         stream next
      ]
   ].

   exp := 0.
   ((c = $e) or: [ c = $E ]) ifFalse: [
        ^isfloat ifTrue: [ int asFloat ] ifFalse: [ int ] ].

   stream next.
   c := stream peek.
   (c isNil) ifTrue: [ ^int ].
   sgn := 1.
   c = $+ ifTrue: [ sgn :=  1. self next ].
   c = $- ifTrue: [ sgn := -1. self next ].

   [ c := stream peek. c notNil and: [ c isDigit ] ] whileTrue: [
      exp := (c digitValue) + (exp * 10).
      stream next
   ].

   int := int * (10 raisedToInteger: exp * sgn).
   ^int asFloat
! !

!Number methodsFor: 'json'!

jsonPrintOn: aStream
   "I return the Number in a JSON compatible format as String."
   self asFloat printOn: aStream
! !

!Float methodsFor: 'json'!

jsonPrintOn: aStream
   "I return the Number in a JSON compatible format as String."
   aStream nextPutAll:
        (self printString copyReplacing: self exponentLetter withObject: $e)
! !

!Integer methodsFor: 'json'!

jsonPrintOn: aStream
   "I return the Integer in a JSON compatible format as String."
   self printOn: aStream
! !

!Dictionary methodsFor: 'json'!

jsonPrintOn: ws
   "I encode my contents (key/value pairs) to a JSON object and return it as 
String."
   | f |
   ws nextPut: ${.
   f := true.
   self keysAndValuesDo: [ :key :val |
      f ifFalse: [ ws nextPut: $, ].
      key jsonPrintOn: ws.
      ws nextPut: $:.
      val jsonPrintOn: ws.
      f := false
   ].
   ws nextPut: $}.
! !

!CharacterArray methodsFor: 'json'!

jsonPrintOn: ws
   "I will encode me as JSON String and return a String containing my encoded 
version."
   ws nextPut: $".
   self do: [ :c || i |
      i := c asInteger.
      (((i = 16r20
         or: [ i = 16r21 ])
         or: [ i >= 16r23 and: [ i <= 16r5B ] ])
         or: [ i >= 16r5D ])
            ifTrue: [ ws nextPut: c ];
            ifFalse: [ | f |
               f := false.
               ws nextPut: $\.
               i = 16r22 ifTrue: [ f := true. ws nextPut: c ].
               i = 16r5C ifTrue: [ f := true. ws nextPut: c ].
               i = 16r2F ifTrue: [ f := true. ws nextPut: c ].
               i = 16r08 ifTrue: [ f := true. ws nextPut: $b ].
               i = 16r0C ifTrue: [ f := true. ws nextPut: $f ].
               i = 16r0A ifTrue: [ f := true. ws nextPut: $n ].
               i = 16r0D ifTrue: [ f := true. ws nextPut: $r ].
               i = 16r09 ifTrue: [ f := true. ws nextPut: $t ].
               f ifFalse: [
                  ws nextPut: $u.
                  ws nextPutAll: ('0000', i printString: 16) last: 4 ].
            ]
   ].
   ws nextPut: $".
!

!String methodsFor: 'json'!

jsonPrintOn: aStream
   "I will encode me as JSON String and return a String containing my encoded 
version."
   (self anySatisfy: [ :ch | ch value between: 128 and: 255 ])
        ifTrue: [ self asUnicodeString jsonPrintOn: aStream ]
        ifFalse: [ super jsonPrintOn: aStream ]! !

!SequenceableCollection methodsFor: 'json'!

jsonPrintOn: ws
   "I'm returning a JSON encoding of my contents as String."
   | f |
   ws nextPut: $[.
   f := true.
   self do: [ :val |
      f ifFalse: [ ws nextPut: $, ].
      val jsonPrintOn: ws.
      f := false
   ].
   ws nextPut: $].
! !

!UndefinedObject methodsFor: 'json'!

jsonPrintOn: aStream
   "I'm returning my corresponding value as JSON String."
   aStream nextPutAll: 'null'
! !

!Boolean methodsFor: 'json'!

jsonPrintOn: aStream
   "I'm returning the JSON String for truth or lie."
   self printOn: aStream
! !

!Object methodsFor: 'json'!

jsonPrintOn: aStream
    self subclassResponsibility
!

toJSON
    ^String streamContents: [ :aStream | self jsonPrintOn: aStream ]
! !


reply via email to

[Prev in Thread] Current Thread [Next in Thread]