[PATCH 17/17] grep: match multibyte charsets line-by-line when using -i

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH 17/17] grep: match multibyte charsets line-by-line when using -i

From:	Paolo Bonzini
Subject:	[PATCH 17/17] grep: match multibyte charsets line-by-line when using -i
Date:	Fri, 12 Mar 2010 18:49:18 +0100

The turtle combination -i + MB_CUR_MAX>1 requires case conversion ahead
of time.  Avoid doing this repeatedly when many matches succeed.  Together
with the previous changes, this fixes https://savannah.gnu.org/bugs/?29117
and https://savannah.gnu.org/bugs/?14472.

* src/grep.c (do_execute): New.
(grepbuf): Use it.
---
 src/grep.c |   40 ++++++++++++++++++++++++++++++++++++++--
 1 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/src/grep.c b/src/grep.c
index f1d341a..1f73c70 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -1025,6 +1025,42 @@ prtext (char const *beg, char const *lim, int *nlinesp)
   used = 1;
 }
 
+EXECUTE_RET do_execute EXECUTE_ARGS
+{
+  const char *line_buf, *line_end, *line_next;
+  size_t result = (size_t) -1;
+
+  /* -i is a real turtle with multibyte character sts, so match
+     line-by-line.
+
+     FIXME: this is just an ugly workaround, and it doesn't really
+     belong here.  Also, PCRE is always using this same per-line
+     matching algorithm.  Either we fix -i, or we should refactor
+     this code---for example, we could adding another function pointer
+     to struct matcher to split the buffer passed to execute.  It would
+     perform the memchr if line-by-line matching is necessary, or just
+     returns buf + size otherwise.  */
+  if (MB_CUR_MAX == 1 || !match_icase)
+    return execute(buf, size, match_size, start_ptr);
+
+  for (line_next = buf; result == (size_t)-1 && line_next < buf + size; )
+    {
+      line_buf = line_next;
+      line_end = memchr (line_buf, eolbyte, (buf + size) - line_buf);
+      if (line_end == NULL)
+        line_next = line_end = buf + size;
+      else
+        line_next = line_end + 1;
+
+      if (start_ptr && start_ptr >= line_end)
+        continue;
+
+      result = execute (line_buf, line_next - line_buf, match_size, start_ptr);
+    }
+
+  return result == (size_t)-1 ? result : (line_buf - buf) + result;
+}
+
 /* Scan the specified portion of the buffer, matching lines (or
    between matching lines if OUT_INVERT is true).  Return a count of
    lines printed. */
@@ -1038,8 +1074,8 @@ grepbuf (char const *beg, char const *lim)
 
   nlines = 0;
   p = beg;
-  while ((match_offset = execute(p, lim - p, &match_size,
-                                NULL)) != (size_t) -1)
+  while ((match_offset = do_execute(p, lim - p, &match_size,
+                                   NULL)) != (size_t) -1)
     {
       char const *b = p + match_offset;
       char const *endp = b + match_size;
-- 
1.6.6

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH 12/17] dfa: speed up handling of brackets, (continued)
- [PATCH 12/17] dfa: speed up handling of brackets, Paolo Bonzini, 2010/03/12
- [PATCH 11/17] dfa: rewrite handling of multibyte case folding, Paolo Bonzini, 2010/03/12
- [PATCH 14/17] dfa: cache MB_CUR_MAX for dfaexec, Paolo Bonzini, 2010/03/12
- [PATCH 15/17] dfa: run simple UTF-8 regexps as a single-byte character set, Paolo Bonzini, 2010/03/12
- [PATCH 16/17] grep: remove check_multibyte_string, fix non-UTF8 missed match, Paolo Bonzini, 2010/03/12
  - Re: [PATCH 16/17] grep: remove check_multibyte_string, fix non-UTF8 missed match, Norihiro Tanaka, 2010/03/13
    - Re: [PATCH 16/17] grep: remove check_multibyte_string, fix non-UTF8 missed match, Paolo Bonzini, 2010/03/14
    - Re: [PATCH 16/17] grep: remove check_multibyte_string, fix non-UTF8 missed match, Norihiro Tanaka, 2010/03/14
    - Re: [PATCH 16/17] grep: remove check_multibyte_string, fix non-UTF8 missed match, Paolo Bonzini, 2010/03/15
    - Re: [PATCH 16/17] grep: remove check_multibyte_string, fix non-UTF8 missed match, Norihiro Tanaka, 2010/03/19
- [PATCH 17/17] grep: match multibyte charsets line-by-line when using -i, Paolo Bonzini <=
- Re: [PATCH 00/16] my last hefty patch drop, Jim Meyering, 2010/03/12
  - Re: [PATCH 00/16] my last hefty patch drop, Jim Meyering, 2010/03/13
    - Re: [PATCH 00/16] my last hefty patch drop, Paolo Bonzini, 2010/03/13
    - Re: [PATCH 00/16] my last hefty patch drop, Paolo Bonzini, 2010/03/14
- Re: [PATCH 00/16] my last hefty patch drop, Paolo Bonzini, 2010/03/12
- Re: [PATCH 00/16] my last hefty patch drop, Aharon Robbins, 2010/03/13
  - Re: [PATCH 00/16] my last hefty patch drop, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 00/16] my last hefty patch drop, Paolo Bonzini, 2010/03/14

Prev by Date: [PATCH 16/17] grep: remove check_multibyte_string, fix non-UTF8 missed match
Next by Date: Re: [PATCH 01/17] kwset/system: remove ptr_t
Previous by thread: Re: [PATCH 16/17] grep: remove check_multibyte_string, fix non-UTF8 missed match
Next by thread: Re: [PATCH 00/16] my last hefty patch drop
Index(es):
- Date
- Thread