discuss-gnustep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Internationalisation and fonts - a suggestion


From: Richard Frith-Macdonald
Subject: Re: Internationalisation and fonts - a suggestion
Date: Tue, 5 Aug 2003 14:03:15 +0100


On Tuesday, August 5, 2003, at 12:26 PM, Pete French wrote:

and typed `ls`. I tracked it down to a single file, which had an
LATIN SMALL LETTER A WITH ACUTE in seemingly malformed UTF-8. Once
GWorkspace hit this, it went crazy and started missing some of the

Now that might explain things. I certainly have alot of interesting garbage
in my home directory. I was also looking at the internal implementation
of the UTF8 translation code last night and it doesnt seem that robust (see the bug I posted for starters). It should be simple enough to recode that to skip garbage UTF8 sequences rather than barfing over the whole string and
returning nil - which might help this problemmaybe ?

But that would be *very* wrong. Conversion to/from character sets needs to fail if the conversion is not possible, rather than trying to guess what the correct results are. If we skip unintelligible rubbish while converting, the application has no way of telling that there is a problem... we have to fail
when there is rubbish in the string, so the application can do something
about the problem.

For instance, with regard to your bug report ... it is certainly true that GNUstep
only supports the unicode base plane (until someone wants to change
that) ... but it's not a bug in the utf8 conversion. Rather, the unicode strings
in GNUstep are ucs2 so the conversion code is written to reject utf8
data containing characters which can't be represented in ucs2.
ie. we need to do an audit of all the unicode support and update it to
work with a utf16 internal format rather than a ucs2 format (and do so
in a compatible way to that of MacOS-X) before we can change the
characterset conversion code.  To do it the other way round would
merely introduce a lot of subtle bugs in place of a simple limitation.

I think, when the original OpenStep spec was written, utf16 did not exist and unicode had a 16-bit representation for *all* characters... so the NSString api presumes that a unicode character is a single 16-bit value. With Apple using
a variable length (utf16) representation for modern unicode, we want to
make sure that a future GNUstep maps that API to a utf16 internal representation
in the same way that apple does.
eg. if we have a string containing a single utf16 character occupying 4 bytes, and we use the -length and -characterAtIndex: methods, will the string appear
to contain two characters or one?

I'd welcome anyone volunteering to do the coding for that move from us2 to utf16
(and enhancing NSCharacterSet to support ucs4 instead of ucs2)





reply via email to

[Prev in Thread] Current Thread [Next in Thread]