[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: new modules for Unicode normalization
From: |
Bruno Haible |
Subject: |
Re: new modules for Unicode normalization |
Date: |
Sun, 22 Feb 2009 03:31:28 +0100 |
User-agent: |
KMail/1.9.9 |
Hi Pádraig,
> So I'm wondering now why normalization functionality isn't in iconv?
> Seems like a big ommision to me.
1) Not every functionality that is a filter should become part of iconv.
Unicode normalization forms? Removal of accents? Case conversions?
Transliteration from one script to another (e.g. the recode-sr-latin
program)? Logical to visual orientation for bidi? Where do you put
the limit?
2) The specification of the iconv() function assumes that one character on
input corresponds to one character on output. This leads to contortions
in converters like EUC-JISX0213 or CP1258, where the notion of "one
character" is context dependent. When you deal with decomposition and
canonical reordering, these specification problems are aggravated.
2) iconv() cannot be extended portably. Some vendor iconv implementations
can only be extended through tables, GNU libiconv only by changing the
source code, GNU libc by creating specially crafted shared objects
(I don't know if anyone has ever done this). Whereas when you write a
filter as a separate program, it is immediately available on all systems.
3) In glibc, IIRC, the size of the conversion state inside an iconv_t is of
limited size. But Unicode normalization, as well as bidi reordering,
requires an unbounded amount of temporary space.
> There is a mention of it here:
> http://www.archivum.info/address@hidden/2006-08/msg00004.html
This page mentions that some vendor iconv don't even get
iconv_open ("UTF-8", "UTF-8") implemented right. You see how little you
can portably expect from iconv (unless you consider installing GNU libiconv).
> Then I also noticed `uconv` which is in the "icu" package of fedora at least.
> To normalize text the following worked for me:
> uconv -x NFC < test.utf8
>
> So ... uconv already has it.
> Do we really need another util in coreutils for this?
ICU is certainly seminal, because it served as a testbed for the development
of Unicode. But I shudder when I see these library sizes (ICU 3.6 on x86):
$ size libicu*.so.*.0
text data bss dec hex filename
10152037 116 0 10152153 9ae8d9 libicudata.so.36.0
1215645 21760 1396 1238801 12e711 libicui18n.so.36.0
34402 2524 36 36962 9062 libicuio.so.36.0
245797 4644 88 250529 3d2a1 libicule.so.36.0
34011 1232 4 35247 89af libiculx.so.36.0
101228 1264 8 102500 19064 libicutu.so.36.0
1093450 28360 6364 1128174 1136ee libicuuc.so.36.0
I cannot estimate how much of these 10 MB get actually loaded into a
process' working set. 10 MB - this is 11 times the size of GNU libiconv
with all its conversion tables!
The benefit of a reimplementation is that
- It implements only the required specifications, does not carry the
historical baggage of 10 years of ICU, hence smaller code and table
sizes.
- When you find a bug or limitation, you have higher chances of getting it
fixed.
Bruno
Re: new modules for Unicode normalization, Bruno Haible, 2009/02/21