freecats-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Freecats-Dev] About Unicode


From: Thierry Sourbier
Subject: RE: [Freecats-Dev] About Unicode
Date: Thu, 13 Feb 2003 12:43:24 +0100

Answers about Unicode:

- Unicode maps each characters to a unique code. There is only one Unicode. The 
different version are backward compatible, most softwares will support up to 
version 3.0 (the version 3.1 introduced some characters that require more than 
16 bits to encode and that are rarelly dealt with properly).

- There are indeed several encoding (way to write to disk the codes) each with 
its own advantage/disavantage. It is very easy to go from one encoding to 
another (it is just some mathematical formula) and often software will use 
several encoding (e.g. UTF-16 for internal representation of strings, UTF-8 to 
exchange information via socket). Most of the time this will be transparent to 
the developper.

- My guess is that Python will use UTF-16 internally (like most language) so 
the encoding question only comes during input/output processes. Decide that all 
communication happens in UTF-8 (if you have to choose of course) and nobody 
will complain.

> Anyway, if it's too difficult to master, we may begin with a Windows ANSI
> version.

XML is based on Unicode. For a translation tool, I don't see Unicode support as 
being an option.  

T.


-----Original Message-----
From: address@hidden
[mailto:address@hidden Behalf Of
Henri Chorand
Sent: Thursday, February 13, 2003 9:55 AM
To: Free CATS Dev List
Subject: [Freecats-Dev] About Unicode


Hi all,

Sooner or later, we'll have to learn more (well, more than what I actually
know) about Unicode.

A brief look at http://www.unicode.org/ convinced me brief is not enough.
The two-level FAQ (at http://www.unicode.org/faq/utf_bom.html) seems very
interesting.

For those with some spare time still, the reference book is freely available
online at:
http://www.unicode.org/uni2book/u2.html

A possible source of concern with Unicode is, there are just so many
flavors, as seen in the FAQ:
> Which do I need to be able to use from:
> UTF8, UTF16, UTF16LE, UTF16BE, UTF32,
> UTF32LE, UTF32BE?

Things seem to get worse when you read the answer:
> Hard to say. UTF-8 will be most common on the web.
> UTF16, UTF16LE, UTF16BE are used by Java and
> Windows.
> UTF32, UTF32LE, UTF32BE are used by various Unix
> systems.
> Luckily, the conversions between all of them are
> algorithmically based and fast.

And for the curious folks who want to experiment, you may use Windows 2000 /
XP notepad in order to use one of following save options for text files:
- ANSI
- Unicode
- Unicode big endian
- UTF-8

Well, as usual, if somebody happens to know Unicode well enough to provide a
few directions, please <shout mode on>DO SO !</shout mode off>

In a nutshell, what we need to know is:
- little endian/big endian issues between Macs, Windows PC & Unix boxes
(Linux/BSD PC for a start)
- how Python defaults on these (it would be handy if the language knows how
to manage these issues)
- "preferred" encodings within the above (I guess, one in which character
length does not vary)

A "typically optimist" extract:
> Hybrid systems in which UTF-16 is used as a disk storage
> format but expanding to UTF-32 in memory is also a
> popular solution combining small long term storage space
> with ease of processing.

Had this stuff been designed with ease of use in mind... ;-)

Anyway, if it's too difficult to master, we may begin with a Windows ANSI
version.

Let me know your thoughts.


Regards,

Henri



_______________________________________________
Freecats-dev mailing list
address@hidden
http://mail.nongnu.org/mailman/listinfo/freecats-dev





reply via email to

[Prev in Thread] Current Thread [Next in Thread]