[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: new modules for Unicode normalization
From: |
Pádraig Brady |
Subject: |
Re: new modules for Unicode normalization |
Date: |
Sun, 22 Feb 2009 11:22:48 +0000 |
User-agent: |
Thunderbird 2.0.0.6 (X11/20071008) |
Bruno Haible wrote:
> Hi Pádraig,
>
>> So I'm wondering now why normalization functionality isn't in iconv?
>> Seems like a big ommision to me.
>
[snip valid points on iconv limitations]
>> There is a mention of it here:
>> http://www.archivum.info/address@hidden/2006-08/msg00004.html
>
> This page mentions that some vendor iconv don't even get
> iconv_open ("UTF-8", "UTF-8") implemented right. You see how little you
> can portably expect from iconv (unless you consider installing GNU libiconv).
>
>> Then I also noticed `uconv` which is in the "icu" package of fedora at least.
>> To normalize text the following worked for me:
>> uconv -x NFC < test.utf8
>>
>> So ... uconv already has it.
>> Do we really need another util in coreutils for this?
>
> ICU is certainly seminal, because it served as a testbed for the development
> of Unicode. But I shudder when I see these library sizes (ICU 3.6 on x86):
>
> $ size libicu*.so.*.0
> text data bss dec hex filename
> 10152037 116 0 10152153 9ae8d9 libicudata.so.36.0
> 1215645 21760 1396 1238801 12e711 libicui18n.so.36.0
> 34402 2524 36 36962 9062 libicuio.so.36.0
> 245797 4644 88 250529 3d2a1 libicule.so.36.0
> 34011 1232 4 35247 89af libiculx.so.36.0
> 101228 1264 8 102500 19064 libicutu.so.36.0
> 1093450 28360 6364 1128174 1136ee libicuuc.so.36.0
>
> I cannot estimate how much of these 10 MB get actually loaded into a
> process' working set. 10 MB - this is 11 times the size of GNU libiconv
> with all its conversion tables!
$ uconv -x NFC&
$ sudo bin/ps_mem.py | grep uconv
Private + Shared = RAM used Program
1.9 MiB + 788.0 KiB = 2.7 MiB uconv
$ uconv -x NFC&
$ sudo bin/ps_mem.py | grep uconv
912.0 KiB + 2.2 MiB = 3.1 MiB uconv (2)
> The benefit of a reimplementation is that
> - It implements only the required specifications, does not carry the
> historical baggage of 10 years of ICU, hence smaller code and table
> sizes.
> - When you find a bug or limitation, you have higher chances of getting it
> fixed.
I don't doubt the usefulness of libiconv, though I'm still not sure
another "normalization util" is required when uconv is availble.
thanks again for all the info,
Pádraig.
Re: new modules for Unicode normalization, Bruno Haible, 2009/02/21