[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: grep -i in UTF-8: newline not printed after matching line if it cont
From: |
Jim Meyering |
Subject: |
Re: grep -i in UTF-8: newline not printed after matching line if it contains I WITH DOT (U+0130) |
Date: |
Tue, 14 Dec 2010 13:57:41 +0100 |
Ilya Basin wrote:
> $ grep -i . greptest.txt
> aIabIbcIcdId$
>
> This doesn't happen without -i or with LANG=C
>
>
> $ grep --version
> grep (GNU grep) 2.7
> $ echo $LANG
> en_US.UTF-8
>
> pcre 8.10
>
> Linux IL 2.6.36-ARCH #1 SMP PREEMPT Wed Nov 24 06:44:11 UTC 2010 i686
> Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz GenuineIntel GNU/Linux
Thanks for the report. That is indeed a bug.
It affects even the very latest in git.
Here's another variant of it:
[note how it fails to print the matched "."]
$ i='\xC4\xB0'; printf "$i$i$i.$i$i$i$i\n" \
| LC_ALL=en_US.UTF-8 ./grep -oi '.\.'|od -a -tx1
0000000 D 0 nl
c4 b0 0a
0000003
-----------------------------
More like your example, this shows how, with -i,
grep is searching a different string (down-cased)
and the width of the lower-case "i" is just one byte.
The end-of-line offset is calculated using the all-lower-case
string, yet that offset is not valid in the original, longer string,
so grep fails to print the entire line:
i='\xC4\xB0'; printf "$i$i$i$i$i$i$i\n" |LC_ALL=en_US.UTF-8 ./grep -i ....
İİİİ
One of us should find time to fix it before too long.