bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] Fwd: Supporting Combining Diacritical Marks


From: Bruno Haible
Subject: Re: [bug-gnu-libiconv] Fwd: Supporting Combining Diacritical Marks
Date: Thu, 30 Jun 2022 02:08:22 +0200

Christian PERNOT wrote:

> We are using gnu libiconv in our alpine-based container, and we face some
> difficulties with some special characters.
> 
> These characters are using Unicode Combining Diacritical Marks like this
> one : https://www.fileformat.info/info/unicode/char/301/index.htm
> 
> I didn't know this behavior in Unicode, but it is a diacritical mark that
> is put after the base character, and they should be combined on display.
> 
> For example :  "é" exists in UTF8 as one character (0xC3 0xA9) :
> https://www.fileformat.info/info/unicode/char/e9/index.htm
> But it may be display the same way by having a "e" without accent (0x65),
> followed by the accent character (0xCC 0x81)

Yes, the first form is called NFC, the second one is called NFD. [1]

Find attached your sample in NFC and NFD, respectively.

Generally, text is exchanged between systems and between applications
in the NFC form [2]. This means, the NFD form is mostly used for internal
processing (e.g. searching, sorting) only.

> There is no difference on display, but iconv won't accept to convert to
> ascii with or without transliteration
> 
> here is my attempt :
> 
> ~/local/bin/iconv -f UTF-8 -t ASCII//TRANSLIT ~/src/iconv.txt
> 
> Capture d'e
> /home/cpernot/local/bin/iconv: /home/cpernot/src/iconv.txt:1:11: ne peut
> convertir

Different iconv implementation have different results. Let's see with glibc
first:

$ iconv -f UTF-8 -t ASCII//TRANSLIT < iconv_NFC.txt
Capture ecran 2020-03-24 a 10.51.25.png
$ iconv -f UTF-8 -t ASCII//TRANSLIT < iconv_NFD.txt
Capture ecran 2020-03-24 a 10.51.25.png

The output is the same.

Whereas libiconv produces:

$ iconv -f UTF-8 -t ASCII//TRANSLIT < iconv_NFC.txt
Capture 'ecran 2020-03-24 `a 10.51.25.png
$ libiconv -f UTF-8 -t ASCII//TRANSLIT < iconv_NFD.txt
Capture e
/.../bin/iconv: (stdin):1:9: cannot convert

As you can see, GNU libiconv attempts to represent, not lose, the accent.
But in NFD form, since the accent comes after the letter, this would
require more complex processing. Since the advice is to pass only NFC
text to programs, it is not really worth it.

Bruno

[1] https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
[2] https://www.w3.org/International/questions/qa-html-css-normalization.en.html

Attachment: iconv_NFC.txt
Description: Text document

Attachment: iconv_NFD.txt
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]