[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: The bug in the Grep Command
From: |
Jim Meyering |
Subject: |
Re: The bug in the Grep Command |
Date: |
Mon, 06 Aug 2012 13:34:37 +0200 |
address@hidden wrote:
> The grep command in the current version of Debian, the 6th, with two
> options -i and -n the same will not work well when applied to an empty
> line there.
>
> For example:
>
> grep -in "^$" < /etc/inittab
>
> or
> grep -in "^$" < /etc/X11/xorg.conf
Wow! Thank you for the report.
At first I didn't see a problem with the latest (2.13):
$ printf 'a\n\nb\n' |grep -in '^$'
2:
or with debian unstable's grep-2.12.
Then I remembered that I usually use the C locale...
which means I'm running the equivalent of this:
$ printf 'a\n\nb\n' | LC_ALL=C grep -in '^$'
If I run in a UTF-8 locale like what you probably have, like this:
$ printf 'a\n\nb\n' | LC_ALL=en_US.utf8 grep -in '^$'
I am flabbergasted to see this erroneous output:
2:3:
Worse still, the following command prints 0 when there is no match(!):
$ seq 2|LC_ALL=en_US.utf8 grep -i '^$'; echo $?
0
$
With -n, and a larger input, the problem is a little clearer:
$ seq 9|LC_ALL=en_US.utf8 grep -in '^$'
2:4:6:8:10:12:14:16:$
I've just fixed the bug (patch below).
Now it does this:
$ seq 9|LC_ALL=en_US.utf8 src/grep -i '^$'; echo $?
1
$
=========================================================
Here's a preliminary patch.
The test name, "ni", is going to change
especially now that I know the bug is independent
of the use of "-n". Also, I will add something
like the examples above.
>From b7850c794ee0174774567f55d3d7ef61cd9d1445 Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Sun, 5 Aug 2012 23:22:28 +0200
Subject: [PATCH 1/2] tests: test for bug with -n and -i in a multi-byte
locale
* tests/ni: New file.
* tests/Makefile.am (TESTS): Add it.
Reported by address@hidden
---
tests/Makefile.am | 1 +
tests/ni | 23 +++++++++++++++++++++++
2 files changed, 24 insertions(+)
create mode 100755 tests/ni
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 7d95862..cbd69ee 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -69,6 +69,7 @@ TESTS = \
inconsistent-range \
khadafy \
max-count-vs-context \
+ ni \
unibyte-bracket-expr \
high-bit-range \
options \
diff --git a/tests/ni b/tests/ni
new file mode 100755
index 0000000..0e78655
--- /dev/null
+++ b/tests/ni
@@ -0,0 +1,23 @@
+#! /bin/sh
+# Test using -n with -i in a multibyte locale.
+#
+# Copyright (C) 2012 Free Software Foundation, Inc.
+#
+# Copying and distribution of this file, with or without modification,
+# are permitted in any medium without royalty provided the copyright
+# notice and this notice are preserved.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+
+require_en_utf8_locale_
+
+LC_ALL=en_US.UTF-8
+export LC_ALL
+
+printf 'a\n\nb\n' > in || framework_failure_
+printf '2:\n' > exp || framework_failure_
+
+grep -n -i '^$' in > out || fail=1
+compare exp out || fail=1
+
+Exit $fail
--
1.7.12.rc1.22.gbfbf4d4
>From cbc79980d39e4db04974b1182d2670fea8b10016 Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Mon, 6 Aug 2012 13:29:51 +0200
Subject: [PATCH 2/2] grep -i '^$' in a multi-byte locale could report a false
match
* src/dfasearch.c (EGexecute): Do not match the sentinel "newline"
that is appended to each buffer.
* NEWS (Bug fixes): Mention it.
tests: test for bug with -i in a multi-byte locale
---
NEWS | 7 +++++++
src/dfasearch.c | 4 +++-
2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/NEWS b/NEWS
index fdba25e..72a90e7 100644
--- a/NEWS
+++ b/NEWS
@@ -4,6 +4,13 @@ GNU grep NEWS -*- outline
-*-
** Bug fixes
+ grep -i '^$' could exit 0 (i.e., report a match) in a multi-byte locale,
+ even though there was no match, and the command generated not output.
+ E.g., printf 'a\nb\n'|LC_ALL=en_US.utf8 grep -il '^$' would mistakenly
+ print "(standard input)". Related, seq 9|LC_ALL=en_US.utf8 grep -in '^$'
+ would print "2:4:6:8:10:12:14:16" with no trailing newline.
+ [bug introduced in grep-2.6]
+
'grep' no longer falsely reports text files as being binary on file
systems that compress contents or that store tiny contents in metadata.
diff --git a/src/dfasearch.c b/src/dfasearch.c
index 1121176..29c096a 100644
--- a/src/dfasearch.c
+++ b/src/dfasearch.c
@@ -277,7 +277,9 @@ EGexecute (char const *buf, size_t size, size_t *match_size,
/* No good fixed strings; start with DFA. */
char const *next_beg = dfaexec (dfa, beg, (char *) buflim,
0, NULL, &backref);
- if (next_beg == NULL)
+ /* If there's no match, of if we've matched the sentinel,
+ we're done. */
+ if (next_beg == NULL || next_beg == buflim)
break;
/* Narrow down to the line we've found. */
beg = next_beg;
--
1.7.12.rc1.22.gbfbf4d4