[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#62267: grep-3.9 bug: \d matches multibyte digits
From: |
Jim Meyering |
Subject: |
bug#62267: grep-3.9 bug: \d matches multibyte digits |
Date: |
Sat, 18 Mar 2023 17:06:37 -0700 |
I was not happy to discover that with grep-3.9 and -P,
\d can match multibyte digits like the Arabic ones:
$ LC_ALL=en_US.UTF-8 grep -Po '\d+' <<< '٠١٢٣٤٥٦٧٨٩'
٠١٢٣٤٥٦٧٨٩
grep -P has never before done that.
Of course, in the C/POSIX locale, there is no such match:
$ LC_ALL=C grep -Po '\d+' <<< '٠١٢٣٤٥٦٧٨٩'
[1]
TL;DR, with the attached fix, grep preprocesses each affected regexp,
changing each eligible "\d" to "[0-9]". Consider this a short-term fix.
Longer term (subject to pcre2 releases), we may instead simply add a
"(?aD)" prefix. If you really want to match non-ASCII digits, use \p{Nd}.
For background, see the PCRE2 documentation:
https://www.pcre.org/current/doc/html/pcre2pattern.html
https://www.pcre.org/current/doc/html/pcre2syntax.html
which say this:
By default, \d, \s, and \w match only ASCII characters, even in UTF-8
mode or in the 16-bit and 32-bit libraries. However, if locale-specific
matching is happening, \s and \w may also match characters with code
points in the range 128-255. If the PCRE2_UCP option is set, the behaviour
of these escape sequences is changed to use Unicode properties and they
match many more characters.
Per upstream pcre2-10.40-112-g6277357, (?aD) does what we want:
PCRE2_EXTRA_ASCII_BSD: This option forces \d to match only ASCII digits,
even when PCRE2_UCP is set. It can be changed within a pattern by
means of the (?aD) option setting.
I used pcre2grep (built from master) to demonstrate how we may eventually use
"(?aD)" under the covers:
$ LC_ALL=en_US.UTF-8 ./pcre2grep --color -u '(?aD)\d' <<< '٠١٢٣٤٥٦٧٨٩'
[Exit 1]
$ LC_ALL=en_US.UTF-8 ./pcre2grep --color -u '(?aD)^\d+$' <<< '٠١٢٣٤٥٦٧٨٩'
٠١٢٣٤٥٦٧٨٩
For the record, https://github.com/PCRE2Project/pcre2 currently declares
10.42 to be the latest, while there's a commit suggesting it's 10.43.
The difference is important: the 10.43 has support for (?aD), while
10.42 does not.
Incidentally, you can demonstrate this in python3, too:
$ LC_ALL=en_US.UTF-8 python3 \
-c "import re; print(re.match(r'\d+', '٠١٢٣٤٥٦٧٨٩'))"
<re.Match object; span=(0, 10), match='٠١٢٣٤٥٦٧٨٩'>
Use flags=re.ASCII to get the often-desired behavior:
$ LC_ALL=en_US.UTF-8 python3 \
-c "import re; print(re.match(r'\d+', '٠١٢٣٤٥٦٧٨٩', flags=re.ASCII))"
None
This is cause for a new snapshot today and soon thereafter,
the release of grep-3.10.
>From 0daefc8c5659e79149a650d97ca12b49ad5e6548 Mon Sep 17 00:00:00 2001
From: Jim Meyering <meyering@fb.com>
Date: Sat, 18 Mar 2023 08:28:36 -0700
Subject: [PATCH] grep: -P (--perl-regexp) \d: match only ASCII digits
Prior to grep-3.9, the PCRE matcher had always treated \d just
like [0-9]. grep-3.9's fix for \w and \b mistakenly relaxed \d
to also match multibyte digits.
* src/grep.c (P_MATCHER_INDEX): Define enum.
(pcre_pattern_expand_backslash_d): New function.
(main): Call it for -P.
* NEWS (Bug fixes): Mention it.
* doc/grep.texi: Document it: with -P, \d matches only ASCII digits.
Provide a PCRE documentation URL and an example of how
to use (?s) with -z.
* tests/pcre-ascii-digits: New test.
* tests/Makefile.am (TESTS): Add that file name.
---
NEWS | 10 +++++
doc/grep.texi | 31 ++++++++++++++++
src/grep.c | 82 ++++++++++++++++++++++++++++++++++++++++-
tests/Makefile.am | 1 +
tests/pcre-ascii-digits | 31 ++++++++++++++++
5 files changed, 154 insertions(+), 1 deletion(-)
create mode 100755 tests/pcre-ascii-digits
diff --git a/NEWS b/NEWS
index 803e14b..a24cebd 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,16 @@ GNU grep NEWS -*- outline
-*-
* Noteworthy changes in release ?.? (????-??-??) [?]
+** Bug fixes
+
+ With -P, \d now matches only ASCII digits, regardless of PCRE
+ options/modes. The changes in grep-3.9 to make \b and \w work
+ properly had the undesirable side effect of making \d also match
+ e.g., the Arabic digits: ٠١٢٣٤٥٦٧٨٩. With grep-3.9, -P '\d+'
+ would match that ten-digit (20-byte) string. Now, to match such
+ a digit, you would use \p{Nd}.
+ [bug introduced in grep 3.9]
+
* Noteworthy changes in release 3.9 (2023-03-05) [stable]
diff --git a/doc/grep.texi b/doc/grep.texi
index 621beaf..eaad6e1 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -1141,6 +1141,37 @@ combined with the @option{-z} (@option{--null-data})
option, and note that
@samp{grep@ -P} may warn of unimplemented features.
@xref{Other Options}.
+For documentation, refer to @url{https://www.pcre.org/}, with these caveats:
+@itemize
+@item
+@samp{\d} always matches only the ten ASCII digits, regardless of locale or
+in-regexp directives like @samp{(?aD)}.
+Use @samp{\p@{Nd@}} if you require to match non-ASCII digits.
+Once pcre2 support for @samp{(?aD)} is widespread enough,
+we expect to make that the default, so it will be overridable.
+@c Using pcre2 git commit pcre2-10.40-112-g6277357, this demonstrates how
+@c we'll prefix with (?aD) to make \d's ASCII-only behavior the default:
+@c $ LC_ALL=en_US.UTF-8 ./pcre2grep -u '(?aD)^\d+' <<< '٠١٢٣٤٥٦٧٨٩'
+@c [Exit 1]
+@c $ LC_ALL=en_US.UTF-8 ./pcre2grep -u '^\d+' <<< '٠١٢٣٤٥٦٧٨٩'
+@c ٠١٢٣٤٥٦٧٨٩
+
+@item
+By default, @command{grep} applies each regexp to a line at a time,
+so the @samp{(?s)} directive (making @samp{.} match line breaks)
+is generally ineffective.
+However, with @option{-z} (@option{--null-data}) it can work:
+@example
+$ printf 'a\nb\n' |grep -zP '(?s)a.b'
+a
+b
+@end example
+But beware: with the @option{-z} (@option{--null-data}) and a file
+containing no NUL byte, grep must read the entire file into memory
+before processing any of it.
+Thus, it will exhaust memory and fail for some large files.
+@end itemize
+
@end table
diff --git a/src/grep.c b/src/grep.c
index 7547b64..6ba881e 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -2089,7 +2089,8 @@ static struct
#endif
};
/* Keep these in sync with the 'matchers' table. */
-enum { E_MATCHER_INDEX = 1, F_MATCHER_INDEX = 2, G_MATCHER_INDEX = 0 };
+enum { E_MATCHER_INDEX = 1, F_MATCHER_INDEX = 2, G_MATCHER_INDEX = 0,
+ P_MATCHER_INDEX = 6 };
/* Return the index of the matcher corresponding to M if available.
MATCHER is the index of the previous matcher, or -1 if none.
@@ -2378,6 +2379,80 @@ fgrep_to_grep_pattern (char **keys_p, idx_t *len_p)
*len_p = p - new_keys;
}
+/* Replace each \d in *KEYS_P with [0-9], to ensure that \d matches only ASCII
+ digits. Now that we enable PCRE2_UCP for pcre regexps, \d would otherwise
+ match non-ASCII digits in some locales. Use \p{Nd} if you require to match
+ those. */
+static void
+pcre_pattern_expand_backslash_d (char **keys_p, idx_t *len_p)
+{
+ idx_t len = *len_p;
+ char *keys = *keys_p;
+ mbstate_t mb_state = { 0 };
+ char *new_keys = xnmalloc (len / 2 + 1, 5);
+ char *p = new_keys;
+ bool prev_backslash = false;
+
+ for (ptrdiff_t n; len; keys += n, len -= n)
+ {
+ n = mb_clen (keys, len, &mb_state);
+ switch (n)
+ {
+ case -2:
+ n = len;
+ FALLTHROUGH;
+ default:
+ if (prev_backslash)
+ {
+ prev_backslash = false;
+ *p++ = '\\';
+ }
+ p = mempcpy (p, keys, n);
+ break;
+
+ case -1:
+ if (prev_backslash)
+ {
+ prev_backslash = false;
+ *p++ = '\\';
+ }
+ memset (&mb_state, 0, sizeof mb_state);
+ n = 1;
+ FALLTHROUGH;
+ case 1:
+ if (prev_backslash)
+ {
+ prev_backslash = false;
+ switch (*keys)
+ {
+ case 'd':
+ p = mempcpy (p, "[0-9]", 5);
+ break;
+ default:
+ *p++ = '\\';
+ *p++ = *keys;
+ break;
+ }
+ }
+ else
+ {
+ if (*keys == '\\')
+ prev_backslash = true;
+ else
+ *p++ = *keys;
+ }
+ break;
+ }
+ }
+
+ if (prev_backslash)
+ *p++ = '\\';
+ *p = '\n';
+ free (*keys_p);
+ *keys_p = new_keys;
+ *len_p = p - new_keys;
+}
+
/* If it is easy, convert the MATCHER-style patterns KEYS (of size
*LEN_P) to -F style, update *LEN_P to a possibly-smaller value, and
return F_MATCHER_INDEX. If not, leave KEYS and *LEN_P alone and
@@ -2970,6 +3045,11 @@ main (int argc, char **argv)
matcher = try_fgrep_pattern (matcher, keys, &keycc);
}
+ /* If -P, replace each \d with [0-9].
+ Those who want to match non-ASCII digits must use \p{Nd}. */
+ if (matcher == P_MATCHER_INDEX)
+ pcre_pattern_expand_backslash_d (&keys, &keycc);
+
execute = matchers[matcher].execute;
compiled_pattern =
matchers[matcher].compile (keys, keycc, matchers[matcher].syntax,
diff --git a/tests/Makefile.am b/tests/Makefile.am
index a47cf5c..f195c8d 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -139,6 +139,7 @@ TESTS = \
options \
pcre \
pcre-abort \
+ pcre-ascii-digits \
pcre-context \
pcre-count \
pcre-infloop \
diff --git a/tests/pcre-ascii-digits b/tests/pcre-ascii-digits
new file mode 100755
index 0000000..ae713f7
--- /dev/null
+++ b/tests/pcre-ascii-digits
@@ -0,0 +1,31 @@
+#!/bin/sh
+# Ensure that grep -P's \d matches only the 10 ASCII digits.
+# With, grep-3.9, \d would match e.g., the multibyte Arabic digits.
+#
+# Copyright (C) 2023 Free Software Foundation, Inc.
+#
+# Copying and distribution of this file, with or without modification,
+# are permitted in any medium without royalty provided the copyright
+# notice and this notice are preserved.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+require_en_utf8_locale_
+LC_ALL=en_US.UTF-8
+export LC_ALL
+require_pcre_
+
+echo . | grep -qP '(*UTF).' 2>/dev/null \
+ || skip_ 'PCRE unicode support is compiled out'
+
+fail=0
+
+# $ printf %s ٠١٢٣٤٥٦٧٨٩|od -An -to1 -w10 |sed 's/ /\\/g'; : arabic digits
+# \331\240\331\241\331\242\331\243\331\244
+# \331\245\331\246\331\247\331\250\331\251
+printf '\331\240\331\241\331\242\331\243\331\244' > in || framework_failure_
+printf '\331\245\331\246\331\247\331\250\331\251' >> in || framework_failure_
+
+grep -P '\d+' in > out && fail=1
+compare /dev/null out || fail=1
+
+Exit $fail
--
2.40.0.rc2
- bug#62267: grep-3.9 bug: \d matches multibyte digits,
Jim Meyering <=
- bug#62267: grep-3.9 bug: \d matches multibyte digits, Paul Eggert, 2023/03/18
- bug#62267: grep-3.9 bug: \d matches multibyte digits, Jim Meyering, 2023/03/19
- bug#62267: grep-3.9 bug: \d matches multibyte digits, Jim Meyering, 2023/03/19
- bug#62267: grep-3.9 bug: \d matches multibyte digits, Paul Eggert, 2023/03/19
- bug#62267: grep-3.9 bug: \d matches multibyte digits, Paul Eggert, 2023/03/19
- bug#62267: grep-3.9 bug: \d matches multibyte digits, Jim Meyering, 2023/03/19
- bug#62267: grep-3.9 bug: \d matches multibyte digits, Jim Meyering, 2023/03/19
- bug#62267: grep-3.9 bug: \d matches multibyte digits, Paul Eggert, 2023/03/19
- bug#62267: grep-3.9 bug: \d matches multibyte digits, Jim Meyering, 2023/03/19