[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x)
From: |
Eric Blake |
Subject: |
bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales |
Date: |
Mon, 23 Dec 2013 16:30:19 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 |
On 12/23/2013 04:12 PM, Jim Meyering wrote:
> Did you miss the "isascii" check in the new trivial_case_convert function?
No. But even with that check in place:
> If you can describe circumstances in which the new patch malfunctions,
> please do,
> but everything you wrote seems to rely on a false assumption.
No, it's a quite real complaint - your patch is broken for tr_TR.
> E.g., your turkish-I example works fine with my patch.
isascii('i') is true, but converting 'i' to '[iI]' is incorrect in the
tr_TR locale. Rather, the conversion must be to '[iİ]'; similarly, 'I'
would be translated to '[Iı]'. Neither of those conversions fit in 4
bytes (since dotted-capital-I and dotless-lower-i are both multi-byte
characters).
Need help easily finding those characters on a non-Turkish keyboard? I
used:
$ echo iI | LC_ALL=tr_TR.UTF-8 sed 's/\(.\)\(.\)/\U\1\L\2/'
At any rate, prior to your patch, lower dotless i in the buffer gives an
insensitive match to upper dotless I in the pattern:
$ echo ı | LC_ALL=tr_TR.UTF-8 grep -i I || echo no match
ı
After your patch:
$ echo ı | LC_ALL=tr_TR.UTF-8 src/grep -i I || echo no match
no match
Oops, you failed to match lower dotless i insensitively against upper
dotless I, because upper dotless I is ascii, but you incorrectly
converted it into the wrong pattern.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature