nano-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Nano-devel] nano and mixed encodings


From: Mike Frysinger
Subject: Re: [Nano-devel] nano and mixed encodings
Date: Mon, 20 Jul 2015 23:08:12 -0400

On 20 Jul 2015 22:02, Benno Schulenberg wrote:
> A little while ago, Mike reported some surprising search behaviour [1]
> of nano when a file contains a mix of both UTF-8 and ISO-8859-1 encoded
> characters: in a UTF-8 locale (which should be the default for most people
> nowadays), nano will find both the "misencoded" ISO-8859-1 character and
> the proper UTF-8 one.
> 
> [1] https://lists.gnu.org/archive/html/nano-devel/2009-02/msg00018.html

man that was a while ago ... i'd completely forgetten about it :).  i've
also got a bit more experience/opinions about encodings now ...

> For example, do:
> 
>   echo "0000000: 2020 c3ba 2020 c3bc 2020 fa20 20fc" | xxd -r >foo
> 
> and then open 'nano foo' and see how it shows:
> 
>   ú  ü  �  �
> 
> If you then search for ú or ü, nano will find each of them twice.
> Which is strange, because you can't even see what these "misencoded"
> characters are.
> 
> In a reply, Chris suggests that nano should do what vim and emacs
> do in this case.  Well..., running 'vim foo' shows this:
> 
>   ú  ü  ú  ü
> 
> and searching for ú or ü will only find the misencoded second one.
> Vim apparently autoconverts the file when it finds bytes in there
> that are not valid UTF-8 and then assumes it to be ISO-8859-x.

i wouldn't mind nano doing charset conversion if it was controllable, but
i wonder how useful it is in practice anymore.  as time moves on, i feel
like non-UTF8 encodings are getting more and more uncommon (both as the
file format and as the user's active locale settings).  thus adding calls
to iconv and such in the background seems like uncommon code paths and
might not be worth the effort ?

> Running 'emacs foo' shows this:
> 
>   \303ş  \303ĵ  ú  ü
> 
> (That is in my Esperanto locale; in other UTF-8 locales it will show
> the same as vim.)  Searching finds, of course, just one ú or ü.

esperanto isn't an encoding (jokes aside ;]).  what did you have locale
set to exactly there ?  LANG=xx_yy.ZZZ ?

> Pico will show this:
> 
>   ú  ü  ?
> 
> (Yes, just one question mark.)  And again, searching will find,
> of course, just one ú or ü.
> 
> So... when nano wants to be like Pico, it should find only the
> validly encoded ú and ü.  The patch attached to the following
> rereported bug (https://savannah.gnu.org/bugs/?45579) does this.

the replacement character depends somewhat on the terminal encoding.
unicode has a nice replacement character specifically for this [1]
which would be used for each invalid incoming byte.  so the display
(when the terminal is using utf8) ideally would be:
  ú  ü  �  �

for other encodings that lack such a dedicated character, i guess
a plain question mark is the best we can do ?

wrt matching, i think what you're going for with that patch is what
we should be doing -- only match valid encoded bytes.  whether that
patch is the best way to accomplish it in the nano codebase i have
no idea as i'm not familiar enough to say.

[1] https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character
-mike

Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]