[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bracket expansions and "rational range" (was: bug#25048: --with-included
From: |
Assaf Gordon |
Subject: |
bracket expansions and "rational range" (was: bug#25048: --with-included-regex vs. e-acute...) |
Date: |
Tue, 29 Nov 2016 23:32:10 -0500 |
Hello Eric, Jim, Arnold,
[changing mailing list to sed-devel@ from discussion in
https://debbugs.gnu.org/25048 ]
Regarding this:
> On Nov 28, 2016, at 11:53, Eric Blake <address@hidden> wrote:
>
> On 11/27/2016 10:57 PM, Jim Meyering wrote:
>> When grep is configured --with-included-regex, the following command
>> fails to print the expected match:
>>
>> printf '\351\n' |LC_ALL=fr_FR.iso88591 src/grep '[d-f]'
[...]
> We SHOULD be adjusting more and more GNU tools to honor rational range
> behavior, at least as an option, even if that means that e-acute can
> never be matched to [d-f].
I'm working on the improving the sed manual,
and just copied some parts from the grep manual.
Specifically about section "bracket expansions":
https://www.gnu.org/software/grep/manual/grep.html#Character-Classes-and-Bracket-Expressions
> In other locales, the sorting sequence is not specified, and ‘[a-d]’ might be
> equivalent
> to ‘[abcd]’ or to ‘[aBbCcDd]’, or it might fail to match any character, or
> the set of
> characters that it matches might even be erratic. To obtain the traditional
> interpretation
> of bracket expressions, you can use the ‘C’ locale by setting the LC_ALL
> environment
> variable to the value ‘C’."
Do you recommend rephrasing it in other ways, perhaps mentioning "Rational
Range Interpretation" ?
I should probably compile a list of combinations of os/libc/locale/gnulib under
which sed does not behave with
rational range. With the addition of the DFA engine (with fallback to the
previous engine) it makes things ever more confusing (for me, at least).
For example, I see the following on Debian (latest sed from git):
$ printf '\351\n' | LC_ALL=fr_FR.iso88591 sed -n '/[d-f]/p' | od -tx1
0000000 e9 0a
0000002
$ printf '\u00e9\n' | LC_ALL=en_US.utf8 sed -n '/[d-f]/p' | od -tx1
0000000 c3 a9 0a
0000003
While same sed from git on Mac OS X does not match:
$ gprintf '\351\n' | LC_ALL=fr_FR.ISO8859-1 ./sed/sed -n '/[d-f]/p' | od
-tx1
0000000
$ gprintf '\u00e9\n' | LC_ALL=fr_FR.utf8 ./sed/sed -n '/[d-f]/p' | od -tx1
0000000
IIUC, that's because on Debian it uses glibc's "re_search", while on Mac OS it
uses gnulib's "_rpl_re_search".
Should we perhaps change it to always use gnulib's, and have "rational range",
at the cost of backwards-incompatability ?
comments welcomed,
thanks,
- assaf
- bracket expansions and "rational range" (was: bug#25048: --with-included-regex vs. e-acute...),
Assaf Gordon <=