bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: grep -i in UTF-8: newline not printed after matching line if it cont


From: Jim Meyering
Subject: Re: grep -i in UTF-8: newline not printed after matching line if it contains I WITH DOT (U+0130)
Date: Wed, 19 Jan 2011 22:16:28 +0100

Jim Meyering wrote:
> Ilya Basin wrote:
>> $ grep -i . greptest.txt
>> aIabIbcIcdId$
>>
>> This doesn't happen without -i or with LANG=C
>>
>>
>> $ grep --version
>> grep (GNU grep) 2.7
>> $ echo $LANG
>> en_US.UTF-8
>>
>> pcre 8.10
>>
>> Linux IL 2.6.36-ARCH #1 SMP PREEMPT Wed Nov 24 06:44:11 UTC 2010 i686
>> Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz GenuineIntel GNU/Linux
>
> Thanks for the report.  That is indeed a bug.
> It affects even the very latest in git.
>
> Here's another variant of it:
> [note how it fails to print the matched "."]
>
>     $ i='\xC4\xB0'; printf "$i$i$i.$i$i$i$i\n" \
>       | LC_ALL=en_US.UTF-8 ./grep -oi '.\.'|od -a -tx1
>     0000000   D   0  nl
>              c4  b0  0a
>     0000003
>
> -----------------------------
> More like your example, this shows how, with -i,
> grep is searching a different string (down-cased)
> and the width of the lower-case "i" is just one byte.
> The end-of-line offset is calculated using the all-lower-case
> string, yet that offset is not valid in the original, longer string,
> so grep fails to print the entire line:
>
>     i='\xC4\xB0'; printf "$i$i$i$i$i$i$i\n" |LC_ALL=en_US.UTF-8 ./grep -i ....
>     İİİİ
>
> One of us should find time to fix it before too long.

First step is (at least this time) to write the test.
I've just pushed this:

>From 955695aea8fac194db07009a8673af3aaa6e0f8c Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Wed, 19 Jan 2011 22:12:09 +0100
Subject: [PATCH 1/2] maint: sort test names in Makefile.am

* tests/Makefile.am (TESTS): Sort test names.
---
 tests/Makefile.am |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/tests/Makefile.am b/tests/Makefile.am
index ac0e3c1..0d78d26 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -35,9 +35,9 @@ endif

 TESTS =                                                \
   backref                                      \
+  backref-multibyte-slow                       \
   backref-word                                 \
   bre                                          \
-  backref-multibyte-slow                       \
   case-fold-backref                            \
   case-fold-backslash-w                                \
   case-fold-char-class                         \
@@ -46,8 +46,8 @@ TESTS =                                               \
   char-class-multibyte                         \
   dfaexec-multibyte                            \
   empty                                                \
-  ere                                          \
   equiv-classes                                 \
+  ere                                          \
   euc-mb                                       \
   fedora                                       \
   fgrep-infloop                                        \
@@ -65,15 +65,15 @@ TESTS =                                             \
   options                                      \
   pcre                                         \
   pcre-z                                       \
+  prefix-of-multibyte                          \
   reversed-range-endpoints                     \
   sjis-mb                                      \
   spencer1                                     \
   spencer1-locale                              \
   status                                       \
-  prefix-of-multibyte                          \
   warn-char-classes                            \
-  word-multi-file                              \
   word-delim-multibyte                         \
+  word-multi-file                              \
   yesno

 EXTRA_DIST =                                   \
--
1.7.3.5


>From ebfc46553d56ec3ab3feade82e53fac0863fd102 Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Wed, 19 Jan 2011 22:12:43 +0100
Subject: [PATCH 2/2] tests: add a known-to-fail test

* tests/turkish-I: New test.
* tests/Makefile.am (TESTS): Add it.
(XFAIL_TESTS): Add here, too.
Reported by Ilya Basin.
---
 THANKS            |    1 +
 tests/Makefile.am |    2 ++
 tests/turkish-I   |   32 ++++++++++++++++++++++++++++++++
 3 files changed, 35 insertions(+), 0 deletions(-)
 create mode 100755 tests/turkish-I

diff --git a/THANKS b/THANKS
index 8c3d0d9..116b9c4 100644
--- a/THANKS
+++ b/THANKS
@@ -37,6 +37,7 @@ H. Merijn Brand            <address@hidden>
 Harald Hanche-Olsen        <address@hidden>
 Hans-Bernhard Broeker      <address@hidden>
 Heikki Korpela             <address@hidden>
+Ilya Basin                 <address@hidden>
 Isamu Hasegawa             <address@hidden>
 Jaroslav Škarvada          <address@hidden>
 Jeff Bailey                <address@hidden>
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 0d78d26..7233c01 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -32,6 +32,7 @@ XFAIL_TESTS = \
 if USE_INCLUDED_REGEX
 XFAIL_TESTS += equiv-classes
 endif
+XFAIL_TESTS += turkish-I

 TESTS =                                                \
   backref                                      \
@@ -71,6 +72,7 @@ TESTS =                                               \
   spencer1                                     \
   spencer1-locale                              \
   status                                       \
+  turkish-I                                    \
   warn-char-classes                            \
   word-delim-multibyte                         \
   word-multi-file                              \
diff --git a/tests/turkish-I b/tests/turkish-I
new file mode 100755
index 0000000..ac536c4
--- /dev/null
+++ b/tests/turkish-I
@@ -0,0 +1,32 @@
+#!/bin/sh
+# grep -i in UTF-8: missing NL in output on line containing I WITH DOT (U+0130)
+
+# Copyright (C) 2011 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+
+require_en_utf8_locale_
+
+fail=0
+
+i='\xC4\xB0'
+printf "$i$i$i$i$i$i$i\n" > in || framework_failure_
+
+LC_ALL=en_US.UTF-8 grep -i .... in > out || fail=1
+
+compare out in || fail=1
+
+Exit $fail
--
1.7.3.5



reply via email to

[Prev in Thread] Current Thread [Next in Thread]