[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#22237: sed no longer removes high-ascii characters as it did formerl
From: |
Jim Meyering |
Subject: |
bug#22237: sed no longer removes high-ascii characters as it did formerly. |
Date: |
Sat, 26 Dec 2015 13:19:07 -0800 |
On Fri, Dec 25, 2015 at 4:21 AM, Brian Tew <address@hidden> wrote:
> Well, sometimes it do and sometimes it don't.
>
> Script started on Fri 25 Dec 2015 05:53:04 AM CS
> ~$ed sample
> 50
> l
> subject now that thanksgiving has come and gone\342\246$
> q
> ~$
> ~$sed -i 's/[^a-z 0-9]//g' sample
To remove all but the matched bytes, you probably want something like
this instead:
LC_ALL=C sed -i 's/[^[:alnum:] ]//'
Note I've done two things: used LC_ALL=C to override your default
locale (probably a UTF8 one), and to use [:alnum:] in place of that
nonportable a-z range and 0-9.
In general, with UTF8-based locales, a byte sequence like your
\342\246 will match no regular expression, since it is not a valid
UTF8 character.
What probably changed is that older versions of sed did not properly
handle multi-byte locales, or your other experience was using a
single-byte locale.
If you still think there is a problem with sed-4.22, please provide
more detail and I'll reopen this issue.