[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Help-smalltalk] Re: Unicode problem on parsing XML
From: |
Paolo Bonzini |
Subject: |
[Help-smalltalk] Re: Unicode problem on parsing XML |
Date: |
Wed, 05 May 2010 18:43:05 +0200 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.9) Gecko/20100330 Fedora/3.0.4-1.fc12 Lightning/1.0b2pre Thunderbird/3.0.4 |
On 05/05/2010 05:36 AM, Bèrto ëd Sèra wrote:
Now I'm using gst 3.2 and iliad 0.8. What I get from the following code:
content := 'taxonomy.xml' asFile.
parser := XML.XMLParser new.
parser validate: false.
parser parse: content readStream.
Not related, but it's best to use "XML.SAXParser defaultParserClass
new". Unlike XMLParser, other parsers may not construct the DOM by
default, so you end up with:
PackageLoader fileInPackage: 'XML-XMLParser'.
content := 'taxonomy.xml' asFile.
parser := XML.SAXParser defaultParserClass new.
parser validate: false.
parser saxDriver: (driver := XML.DOM_SAXDriver new).
parser parse: content readStream
driver document
This won't fix the bug but will make you a good citizen (see NEWS file
in GST 3.2).
the breakers are, for example:
1)...the æ and œ ligatures, ...
2) Devanāgarī script for Hindi
3) Japanese Rōmaji script
The breaker is _entities_, not characters.
I was wondering what changed... or, most probably, what kind of silly
mistake I'm making...
Nothing, it's a bug. The easiest way to fix it is to use the XML-Expat
package. You just have to replace the first line above with these two:
PackageLoader fileInPackage: 'XML-Expat'.
PackageLoader fileInPackage: 'XML-DOM'.
It's _thousands_ of times faster too.
But if you insist, this patch fixes it:
diff --git a/packages/xml/parser/XML.st b/packages/xml/parser/XML.st
index 309cf36..a9ebb7f 100644
--- a/packages/xml/parser/XML.st
+++ b/packages/xml/parser/XML.st
@@ -2950,7 +2950,7 @@ Instance Variables:
ifTrue:
[sax fatalError: (BadCharacterSignal new
messageText: 'A character with Unicode value %1 is not legal' %
{n})].
- data nextPut: (Character value: n).
+ data display: (Character codePoint: n).
self getNextChar
]
diff --git a/packages/xml/parser/package.xml
b/packages/xml/parser/package.xml
index 2e0bcce..fc72811 100644
--- a/packages/xml/parser/package.xml
+++ b/packages/xml/parser/package.xml
@@ -13,6 +13,7 @@
<prereq>XML-SAXParser</prereq>
<prereq>XML-DOM</prereq>
+ <prereq>Iconv</prereq>
<filein>XML.st</filein>
<file>XML.st</file>
Paolo