[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of t
From: |
Paul Eggert |
Subject: |
bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase |
Date: |
Mon, 03 Mar 2014 23:01:35 -0800 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 |
Norihiro Tanaka wrote:
Final sigma is neither uppwercase nor lowercase of
sigma, as you have said firstly. In addition, it isn't even titlecase
of sigma.
That's what I thought, too. Certainly an implementation should be free
to say that stigma and sigma do not match when ignoring case. That is
how Solaris 11 /usr/xpg4/bin/grep -i behaves, for example. But it's not
clear that POSIX requires such an implementation.
Even if that behavior of regex is right, it won't work correctly for a
character that only one lowercase is assigned to two uppercases. (Though,
I don't know such a character.)
Here's an example: 'İ' (U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE)
along with ASCII 'i' and 'I'. On GNU and Solaris platforms in the
en_US.UTF-8 locale, towlower (L'İ') == L'i', so 'i' has two uppercase
counterparts. Solaris 11 /usr/xpg4/bin/grep -i says that 'i' and 'I'
match each other but do not match 'İ', and reports a syntax error when
attempting to match 'İ' to itself:
$ echo 'İ' | /usr/xpg4/bin/grep -i 'İ'
grep: input file "(standard input)": line 1: syntax error
so Solaris 'grep' is clearly busted here. The GNU regex code says that
'i' and 'I' match only each other, and that 'İ' matches only itself.
But the dfa code in the current trunk behaves differently: it says that
the patterns 'i' and 'I' match each other, whereas the pattern 'İ'
matches itself and 'i'. So we have inconsistent behavior with the grep
current master 5aa1413:
$ printf 'İ\ni\nI\n' | grep -i 'İ' # Use the DFA code.
İ
i
$ printf 'İ\ni\nI\n' | grep -i '\(\)\1İ' # Use the regex code.
İ
Here the DFA behavior is more plausible; but the DFA code is supposed to
accelerate the regex code without changing behavior, so there is a problem.
After writing the above, I discovered a web page that talks about this
issue, written about 10 years ago and committed by Charles Levert:
http://www.gnu.org/software/grep/devel.html
Look for the string "POSIX and --ignore-case". The regex code takes the
"uc(input_wchar) == uc(pattern_wchar)" approach, whereas the DFA code
takes an approach not mentioned in that web page, namely "input_wchar ==
pattern_wchar || input_wchar == lc(pattern_wchar) || input_wchar ==
uc(pattern_wchar)".
I'm inclined to fix the DFA code so that it behaves like the regex code
even if the regex code isn't that nice, because arguably the regex code
does conform to POSIX and anyway its behavior is longstanding and I
doubt whether people care all that much about the details so long as the
two are consistent.
- bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase, Norihiro Tanaka, 2014/03/01
- bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase, Paul Eggert, 2014/03/02
- bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase, Paul Eggert, 2014/03/03
- bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase, Norihiro Tanaka, 2014/03/03
- bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase,
Paul Eggert <=
- bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase, Jim Meyering, 2014/03/04
- bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase, Paul Eggert, 2014/03/07
- bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase, Jim Meyering, 2014/03/07
- bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase, Paul Eggert, 2014/03/07
- bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase, Norihiro TANAKA, 2014/03/05
- bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase, Norihiro Tanaka, 2014/03/05
- bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase, Paul Eggert, 2014/03/05