[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: character ranges in regular expressions
From: |
Paolo Bonzini |
Subject: |
Re: character ranges in regular expressions |
Date: |
Fri, 24 Sep 2010 14:31:20 +0200 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.9) Gecko/20100907 Fedora/3.1.3-1.fc13 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.3 |
On 09/24/2010 01:27 PM, Bruno Haible wrote:
But what is the "correct" result in the first place?
On my glibc-2.8 system I have a number of locales installed, and grep from
versions 2.4.2 to 2.7.
When I create a file that has every printable ASCII character, one per line,
and do a "grep '[A-Z]'" of this file, which ASCII characters should I get?
26 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
or 51 AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZ ?
Either, but I think consistently across all locales.
Find attached the input files and the results of the command
for l in `locale -a`; do
echo -n "$l "; LC_ALL=$l grep '[A-Z]' ascii1 | wc -l;
done | expand -t 20
- In grep 2.5.3 the result was 51 for most UTF-8 locales
but 26 for most unibyte locales.
This happened because the unibyte code was using regexec, and the
multibyte code was using strcoll. It's what I meant in the NEWS file by
"bug present since multi-byte character set support was introduced in
2.5.2, though the steps needed to reproduce it changed in grep-2.6".
- In grep 2.4.2 the result was 51 for nearly all locales.
- In grep 2.6.3 the result was again like in 2.4.2.
And they were inconsistent with sed, which is also a bug as you
correctly guessed.
- In grep 2.7 the result is mixed, I cannot see a pattern.
For en_US the result is 51, for en_US.utf8 it's 26 -
this definitely is a bug, since the locale definition for
en_US and en_US.utf8 is the same.
This particular bug has the same root as 2.5.3 behavior (en_US is
unibyte, en_US.UTF-8 is singlebyte). It happens in reverse because now
the unibyte code was using strcoll, and the multibyte code was punting
to glibc (and so using regexec). After my latest patches, it's finally
consistent.
This show how until 2.7 included, results are confused because of the
layering of dfa and regex.
I retried your approach with sed, and here are the "51" locales:
ar_SA cs_CZ hr_HR hsb_DE is_IS km_KH lo_LA lt_LT lv_LV or_IN pl_PL sk_SK
sl_SI th_TH tr_CY tr_TR
These return 51 for both $l and $l.utf8. Every other locale returns 26
for both unibyte and multibyte variants.
Locales using glibc's localedata/locales/iso14651_t1_common template
return 26. This template defines the collation like this:
<U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a start lowercase
<U00AA> <a>;<PCL>;<EMI>;IGNORE # 199 ª
<U00E1> <a>;<ACA>;<MIN>;IGNORE # 200 á
...
<U007A> <z>;<BAS>;<MIN>;IGNORE # 507 z
...
<U00FE> <th>;<BAS>;<MIN>;IGNORE # 516 Þ end lowercase
<U0041> <a>;<BAS>;<CAP>;IGNORE # 517 A start uppercase
<U00C1> <a>;<ACA>;<CAP>;IGNORE # 518 Á
...
<U005A> <z>;<BAS>;<CAP>;IGNORE # 813 Z
...
<U00DE> <th>;<BAS>;<CAP>;IGNORE # 824 þ end uppercase
(There's no end to surprises: [a-z] comes _before_ [A-Z], which is why
[A-z] fails but [a-Z] works).
Instead, the "special" locales above use different sequence, for example
in cs_CZ:
<U0041> <U0041>;<NONE>;<CAPITAL>;<U0041> # A
<U0061> <U0041>;<NONE>;<SMALL>;<U0041> # a
<U00AA> <U0041>;<NONE>;<U00AA>;<U0041> # ª
<U00C1> <U0041>;<ACUTE>;<CAPITAL>;<U0041> # Á
<U00E1> <U0041>;<ACUTE>;<SMALL>;<U0041> # á
...
<U005A> <U005A>;<NONE>;<CAPITAL>;<U005A> # Z
<U007A> <U005A>;<NONE>;<SMALL>;<U005A> # z
So, it looks like __collseq_table_lookup is what the POSIX rationale
document calls "CEO".
- An additional bug is that in the vi_VN.tcvn locale,
grep 2.7 gives an error 'unbalanced ['.
This is a glibc bug, as I can reproduce it with sed too.
What is the correct result for 'grep' and for regex? (I assume it's the
same for both, since both are specified by POSIX.)
Unfortunately POSIX only (implicitly) specifies that the two have to be
consistent, but the exact result is unspecified. The sensible results
are of course three: 51 omitting "a" (aAbB...zZ), 51 omitting "z"
(AaBb...Zz), 26.
Paolo
- Re: [PATCH 1/2] dfa: process range expressions consistently with system regex, (continued)
- [PATCH 2/2] tests: add testcase for previous fix, Paolo Bonzini, 2010/09/21
- Re: [PATCH 2/2] tests: add testcase for previous fix, Jim Meyering, 2010/09/23
- Re: [PATCH 2/2] tests: add testcase for previous fix, Paolo Bonzini, 2010/09/23
- Re: [PATCH 2/2] tests: add testcase for previous fix, Jim Meyering, 2010/09/23
- Re: [PATCH 2/2] tests: add testcase for previous fix, Paul Eggert, 2010/09/23
- Re: [PATCH 2/2] tests: add testcase for previous fix, Paolo Bonzini, 2010/09/23
- Re: character ranges in regular expressions, Bruno Haible, 2010/09/23
- Re: character ranges in regular expressions, Paolo Bonzini, 2010/09/24
- Re: character ranges in regular expressions, Bruno Haible, 2010/09/24
- Re: character ranges in regular expressions,
Paolo Bonzini <=
- Re: character ranges in regular expressions, Bruno Haible, 2010/09/24
- Re: character ranges in regular expressions, Paul Eggert, 2010/09/24
- Re: character ranges in regular expressions, Eric Blake, 2010/09/24
[PATCH 0/2] process range expressions consistently with system regex, Paolo Bonzini, 2010/09/21