[PATCH 9/9] grep: match multibyte charsets line-by-line when using -i

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH 9/9] grep: match multibyte charsets line-by-line when using -i

From:	Paolo Bonzini
Subject:	[PATCH 9/9] grep: match multibyte charsets line-by-line when using -i
Date:	Sun, 14 Mar 2010 16:35:14 +0100

The turtle combination -i + MB_CUR_MAX>1 requires case conversion ahead
of time.  Avoid doing this repeatedly when many matches succeed.  Together
with the previous changes, this fixes https://savannah.gnu.org/bugs/?29117
and https://savannah.gnu.org/bugs/?14472.

* NEWS: Document new speedup.
* src/grep.c (do_execute): New.
(grepbuf): Use it.
---
 NEWS       |    6 ++++++
 src/grep.c |   40 ++++++++++++++++++++++++++++++++++++++--
 2 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/NEWS b/NEWS
index a2db324..1a88d21 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,12 @@ GNU grep NEWS                                    -*- outline 
-*-
 
 * Noteworthy changes in release ?.? (????-??-??) [?]
 
+** Speed improvements
+
+  grep is much faster on multibyte character sets, especially (but not
+  limited to) UTF-8 character sets.  The speed improvement is also very
+  pronounced with case-insensitive matches.
+
 ** Bug fixes
 
   grep -i with a character class would malfunction in multi-byte locales.
diff --git a/src/grep.c b/src/grep.c
index f1d341a..19e04e2 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -1025,6 +1025,42 @@ prtext (char const *beg, char const *lim, int *nlinesp)
   used = 1;
 }
 
+static EXECUTE_RET do_execute EXECUTE_ARGS
+{
+  const char *line_buf, *line_end, *line_next;
+  size_t result = (size_t) -1;
+
+  /* -i is a real turtle with multibyte character sts, so match
+     line-by-line.
+
+     FIXME: this is just an ugly workaround, and it doesn't really
+     belong here.  Also, PCRE is always using this same per-line
+     matching algorithm.  Either we fix -i, or we should refactor
+     this code---for example, we could adding another function pointer
+     to struct matcher to split the buffer passed to execute.  It would
+     perform the memchr if line-by-line matching is necessary, or just
+     returns buf + size otherwise.  */
+  if (MB_CUR_MAX == 1 || !match_icase)
+    return execute(buf, size, match_size, start_ptr);
+
+  for (line_next = buf; result == (size_t)-1 && line_next < buf + size; )
+    {
+      line_buf = line_next;
+      line_end = memchr (line_buf, eolbyte, (buf + size) - line_buf);
+      if (line_end == NULL)
+        line_next = line_end = buf + size;
+      else
+        line_next = line_end + 1;
+
+      if (start_ptr && start_ptr >= line_end)
+        continue;
+
+      result = execute (line_buf, line_next - line_buf, match_size, start_ptr);
+    }
+
+  return result == (size_t)-1 ? result : (line_buf - buf) + result;
+}
+
 /* Scan the specified portion of the buffer, matching lines (or
    between matching lines if OUT_INVERT is true).  Return a count of
    lines printed. */
@@ -1038,8 +1074,8 @@ grepbuf (char const *beg, char const *lim)
 
   nlines = 0;
   p = beg;
-  while ((match_offset = execute(p, lim - p, &match_size,
-                                NULL)) != (size_t) -1)
+  while ((match_offset = do_execute(p, lim - p, &match_size,
+                                   NULL)) != (size_t) -1)
     {
       char const *b = p + match_offset;
       char const *endp = b + match_size;
-- 
1.6.6.1

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, (continued)
- [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Jim Meyering, 2010/03/17
    - Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Paolo Bonzini, 2010/03/17
- [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set, Jim Meyering, 2010/03/15
- [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Jim Meyering, 2010/03/17
    - Re: [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Paolo Bonzini, 2010/03/17
- [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match, Jim Meyering, 2010/03/17
- [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Paolo Bonzini <=
  - Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Jim Meyering, 2010/03/16
    - Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Paolo Bonzini, 2010/03/16
    - Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Jim Meyering, 2010/03/16
    - Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Paolo Bonzini, 2010/03/16

Prev by Date: [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match
Next by Date: Re: [PATCH 00/16] my last hefty patch drop
Previous by thread: Re: [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match
Next by thread: Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i
Index(es):
- Date
- Thread