[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#42857: sed: handling utf8 non-breaking space 0xA0
From: |
Assaf Gordon |
Subject: |
bug#42857: sed: handling utf8 non-breaking space 0xA0 |
Date: |
Fri, 14 Aug 2020 20:46:08 -0600 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 |
tags 42857 notabug
close 42857
stop
Hello,
Thank you for sending a detailed bug report, makes it much easier to
troubleshoot.
On 2020-08-13 8:22 p.m., Dennis Nezic wrote:
I'm not sure if this is a bug. It has to do with the weird utf8(?)
character with hex code 0xa0.
There's an important issue here:
The unicode character "NO-BREAK SPACE" has code-point value of 0xA0
(often written as "U+00A0").
However, "UTF-8" is an encoding of unicode (just like UTF-16 is a
different encoding of unicode). It is a way to represent unicode
code-points in strings using non-ascii values.
In "UTF-8" the unicode character "NO-BREAK SPACE U+0x00A0" is encoded as
two bytes: 0xC2 0xA0.
See more details here: https://codepoints.net/U+00A0
See some Q&A regarding unicode-vs-utf8 here:
https://stackoverflow.com/q/643694
The byte "0xA0" by itself is an invalid UTF-8 character.
This means that if your current locale is UTF-8,
and you have a string with 0xA0 in it by itself, it is considered an
invalid string (or at least not a valid text string, but valid binary
data).
Many programs (GNU sed included) do not match invalid bytes in UTF-8
in their regular expressions.
That is, the following simple regex of "." (any character) will NEVER
match invalid UTF-8 characters:
printf "\xA0" | LC_ALL=en_CA.utf8 sed 's/./x/'
It will be matched if you force a C/POSIX-locale, in which every single
byte is valid:
printf "\xA0" | LC_ALL=C sed 's/./x/'
[...]
But it can't do a proper subsitution/regex with it, for example:
echo $'hello\nte\xA0st\nworld' | sed 2s,^t.*,x,
it seems to interpret 0xa0 as the end of the line.
With the above explanation (i.e. "0xA0" is not a valid character in UTF8
locale), it becomes clear why the 'sed' command isn't working as you
expected: It's not that "0xA0" is an "end of line",
it is that "^t.*" only matches "te". The invalid character "0xA0"
causes the regex engine to stop matching.
If you want to treat any byte value as a valid character, you can force
C/POSIX locale:
$ echo $'hello\nte\xA0st\nworld' | LC_ALL=C sed 2s,^t.*,x,
hello
x
world
But of course then you'd lose the ability to handle multibyte UTF-8
characters as a single character.
---
If you want to discard invalid byte values but keep valid UTF-8
characters, the "iconv(1)" program can help to some extent:
$ echo $'he\xE2\x98\xBAllo\nte\xA0st\nworld' \
| iconv -f utf8 -t utf8//IGNORE
he☺llo
test
world
iconv: illegal input sequence at position 21
In the above example the bytes "E2 98 BA" are the valid UTF8 encoding
of unicode codepoint "U+263A WHITE SMILING FACE"
https://codepoints.net/U+263A
They are kept in the output stream, while the invalid "0xA0" is
discarded.
---
You are using the "echo" command with the $'' to explicitly add
hex values into a string. Note that bash's "echo" command understand
unicode directly (not just UTF8), so using something like this:
$ echo $'te\u00A0st'
te st
Allows you to specify unicode codepoints (e.g. "0xA0") instead of UTF-8
encoding, and bash will generate the character in the correct locale
encoding:
$ echo $'te\u00A0st' | od -tx1c -An
74 65 c2 a0 73 74 0a
t e 302 240 s t \n
See:
https://www.gnu.org/software/bash/manual/html_node/ANSI_002dC-Quoting.html#ANSI_002dC-Quoting
And lastly,
echo with $'' is not portable (although very convenient when using bash
interactively). Using "printf" instead will work similarly, and be more
portable:
$ printf 'te\u00A0st\n'
te st
$ printf 'te\xC2\xA0st\n'
te st
I hope this helps.
I'm closing this as "not a bug", but discussion can continue
by replying to this thread.
regards,
- assaf