bug-sed
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#22237: sed no longer removes high-ascii characters as it did formerl


From: Jim Meyering
Subject: bug#22237: sed no longer removes high-ascii characters as it did formerly.
Date: Sat, 26 Dec 2015 13:19:07 -0800

On Fri, Dec 25, 2015 at 4:21 AM, Brian Tew <address@hidden> wrote:
> Well, sometimes it do and sometimes it don't.
>
> Script started on Fri 25 Dec 2015 05:53:04 AM CS
> ~$ed sample
> 50
> l
> subject now that thanksgiving has come and gone\342\246$
> q
> ~$
> ~$sed -i 's/[^a-z 0-9]//g' sample

To remove all but the matched bytes, you probably want something like
this instead:

  LC_ALL=C sed -i 's/[^[:alnum:] ]//'

Note I've done two things: used LC_ALL=C to override your default
locale (probably a UTF8 one), and to use [:alnum:] in place of that
nonportable a-z range and 0-9.

In general, with UTF8-based locales, a byte sequence like your
\342\246 will match no regular expression, since it is not a valid
UTF8 character.

What probably changed is that older versions of sed did not properly
handle multi-byte locales, or your other experience was using a
single-byte locale.

If you still think there is a problem with sed-4.22, please provide
more detail and I'll reopen this issue.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]