bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#16481: dfa.c and Rational Range Interpretation


From: Paul Eggert
Subject: bug#16481: dfa.c and Rational Range Interpretation
Date: Fri, 17 Jan 2014 14:43:29 -0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0

Thanks for continuing to bird-dog this.

On 01/17/2014 05:39 AM, Aharon Robbins wrote:

> the following diff lets grep check the other awk syntax
> variants.  Feel free to apply it.

I did that (the first patch enclosed below).
Thanks.

> I do think that gawk's code is the correct thing to be doing for RRI.

I agree, and installed the second patch enclosed below to
implement this.  This patch also includes some documentation
changes -- if you have a bit of time to review them I'd
appreciate it.

Also, I notice that there are a few "#ifdef GREP"s in dfa.c
Do you happen to know why they're needed?  It'd be nice if
we could simplify dfa.c to omit the need for the GREP macro.

> Additionally, I recommend that grep's configure check for good RRI
> support in the system regex routines and switch to the included ones
> if the system ones don't support it.

Unfortunately that'd break support for equivalence classes
and multibyte collation symbols on GNU/Linux platforms, so
it may be a bridge too far.  Until we get glibc fixed, I
think it's OK to live with the situation where [a-z]
ordinarily has the rational range interpretation, and this
breaks down only for complicated matches where the DFA
doesn't suffice; at least it'll work in the usual case.

>From c862ced6f31f0ccdf2505ac46e354a1a011149cd Mon Sep 17 00:00:00 2001
From: Aharon Robbins <address@hidden>
Date: Fri, 17 Jan 2014 12:42:49 -0800
Subject: [PATCH 1/2] grep: add undocumented '-X gawk' and '-X posixawk'
 options

See <http://bugs.gnu.org/16481>.
* src/grep.c (GAcompile, PAcompile): New functions.
(const): Use them.
---
 src/grep.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/src/grep.c b/src/grep.c
index 1b2198f..12644a2 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -19,10 +19,24 @@ Acompile (char const *pattern, size_t size)
   GEAcompile (pattern, size, RE_SYNTAX_AWK);
 }
 
+static void
+GAcompile (char const *pattern, size_t size)
+{
+  GEAcompile (pattern, size, RE_SYNTAX_GNU_AWK);
+}
+
+static void
+PAcompile (char const *pattern, size_t size)
+{
+  GEAcompile (pattern, size, RE_SYNTAX_POSIX_AWK);
+}
+
 struct matcher const matchers[] = {
   { "grep",    Gcompile, EGexecute },
   { "egrep",   Ecompile, EGexecute },
   { "awk",     Acompile, EGexecute },
+  { "gawk",    GAcompile, EGexecute },
+  { "posixawk", PAcompile, EGexecute },
   { "fgrep",   Fcompile, Fexecute },
   { "perl",    Pcompile, Pexecute },
   { NULL, NULL, NULL },
-- 
1.8.4.2


>From aba2c718908d6c8fcfd75d55a43a4c9b1e3405a3 Mon Sep 17 00:00:00 2001
From: Paul Eggert <address@hidden>
Date: Fri, 17 Jan 2014 14:32:10 -0800
Subject: [PATCH 2/2] grep: DFA now uses rational ranges in unibyte locales

Problem reported by Aharon Robbins in <http://bugs.gnu.org/16481>.
* NEWS:
* doc/grep.texi (Environment Variables)
(Character Classes and Bracket Expressions):
Document this.
* src/dfa.c (parse_bracket_exp): Treat unibyte locales like multibyte.
---
 NEWS          |  8 ++++++++
 doc/grep.texi | 19 +++++++++----------
 src/dfa.c     | 20 ++------------------
 3 files changed, 19 insertions(+), 28 deletions(-)

diff --git a/NEWS b/NEWS
index 6e46684..589b2ac 100644
--- a/NEWS
+++ b/NEWS
@@ -7,6 +7,14 @@ GNU grep NEWS                                    -*- outline 
-*-
   grep -i in a multibyte locale is now typically 10 times faster
   for patterns that do not contain \ or [.
 
+  Range expressions in unibyte locales now ordinarily use the rational
+  range interpretation, in which [a-z] matches only lower-case ASCII
+  letters regardless of locale, and similarly for other ranges.  (This
+  was already true for multibyte locales.)  Portable programs should
+  continue to specify the C locale when using range expressions, since
+  these expressions have unspecified behavior in non-GNU systems and
+  are not yet guaranteed to use the rational range interpretation even
+  in GNU systems.
 
 * Noteworthy changes in release 2.16 (2014-01-01) [stable]
 
diff --git a/doc/grep.texi b/doc/grep.texi
index 473a181..42fb9a2 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -960,8 +960,8 @@ They are omitted (i.e., false) by default and become true 
when specified.
 @cindex national language support
 @cindex NLS
 These variables specify the locale for the @code{LC_COLLATE} category,
-which determines the collating sequence
-used to interpret range expressions like @samp{[a-z]}.
+which might affect how range expressions like @samp{[a-z]} are
+interpreted.
 
 @item LC_ALL
 @itemx LC_CTYPE
@@ -1223,14 +1223,13 @@ For example, the regular expression
 Within a bracket expression, a @dfn{range expression} consists of two
 characters separated by a hyphen.
 It matches any single character that
-sorts between the two characters, inclusive, using the locale's
-collating sequence and character set.
-For example, in the default C
-locale, @samp{[a-d]} is equivalent to @samp{[abcd]}.
-Many locales sort
-characters in dictionary order, and in these locales @samp{[a-d]} is
-typically not equivalent to @samp{[abcd]};
-it might be equivalent to @samp{[aBbCcDd]}, for example.
+sorts between the two characters, inclusive.
+In the default C locale, the sorting sequence is the native character
+order; for example, @samp{[a-d]} is equivalent to @samp{[abcd]}.
+In other locales, the sorting sequence is not specified, and
address@hidden might be equivalent to @samp{[abcd]} or to
address@hidden, or it might fail to match any character, or the set of
+characters that it matches might even be erratic.
 To obtain the traditional interpretation
 of bracket expressions, you can use the @samp{C} locale by setting the
 @env{LC_ALL} environment variable to the value @samp{C}.
diff --git a/src/dfa.c b/src/dfa.c
index 6ab4e05..5e3140d 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -1108,30 +1108,14 @@ parse_bracket_exp (void)
             }
           else
             {
-              /* Defer to the system regex library about the meaning
-                 of range expressions.  */
-              regex_t re;
-              char pattern[6] = { '[', 0, '-', 0, ']', 0 };
-              char subject[2] = { 0, 0 };
               c1 = c;
               if (case_fold)
                 {
                   c1 = tolower (c1);
                   c2 = tolower (c2);
                 }
-
-              pattern[1] = c1;
-              pattern[3] = c2;
-              regcomp (&re, pattern, REG_NOSUB);
-              for (c = 0; c < NOTCHAR; ++c)
-                {
-                  if ((case_fold && isupper (c)))
-                    continue;
-                  subject[0] = c;
-                  if (regexec (&re, subject, 0, NULL, 0) != REG_NOMATCH)
-                    setbit_case_fold_c (c, ccl);
-                }
-              regfree (&re);
+              for (c = c1; c <= c2; c++)
+                setbit_case_fold_c (c, ccl);
             }
 
           colon_warning_state |= 8;
-- 
1.8.4.2







reply via email to

[Prev in Thread] Current Thread [Next in Thread]