nano-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Nano-devel] [patch] properly show invalid byte sequences in UTF-8


From: Benno Schulenberg
Subject: Re: [Nano-devel] [patch] properly show invalid byte sequences in UTF-8
Date: Sat, 18 Apr 2015 18:38:26 +0200

On Mon, Apr 13, 2015, at 21:49, Benno Schulenberg wrote:
> When doing for example:
> 
>     echo "0000000: 20c2 bb6f 6f6f 20c2 7878 78" | xxd -r >botched
> 
> and then opening the file 'botched' in nano (in a UTF-8 locale),
> it will show:
> 
>  »ooo »xxx
> 
> But the second guillemet isn't really there (if you search for it, the
> first one wil be the only occurrence), it is just a ghost.

A better example, without the distracting o's and x's, would be:

  echo "0000000: c2bb 2020 c220 " | xxd -r >botched

The second c2 is followed by an invalid byte: 20.  Such a successor
byte should be in the range 80 - bf.  One might expect a space (20)
to be displayed, but what happens is that nano picks up the bb of
the preceding multibyte sequence and so displays another guillemet
(UTF-8 code 0xc2 0xbb, code point U+00BB, "»") and then goes on
to display also the space (20).

If instead of the second c2 one puts a c3, UTF-8 code 0xc3 0xbb will
get displayed: code point U+00FB, a small u with a circumflex, "û".

Benno

-- 
http://www.fastmail.com - Email service worth paying for. Try it for free




reply via email to

[Prev in Thread] Current Thread [Next in Thread]