Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -

From:	Jim Meyering
Subject:	Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i
Date:	Tue, 16 Mar 2010 14:25:14 +0100

Paolo Bonzini wrote:
> The turtle combination -i + MB_CUR_MAX>1 requires case conversion ahead
> of time.  Avoid doing this repeatedly when many matches succeed.  Together
> with the previous changes, this fixes https://savannah.gnu.org/bugs/?29117
> and https://savannah.gnu.org/bugs/?14472.
>
> * NEWS: Document new speedup.
> * src/grep.c (do_execute): New.
> (grepbuf): Use it.
> ---
>  NEWS       |    6 ++++++
>  src/grep.c |   40 ++++++++++++++++++++++++++++++++++++++--
...
> +static EXECUTE_RET do_execute EXECUTE_ARGS
> +{
> +  const char *line_buf, *line_end, *line_next;
> +  size_t result = (size_t) -1;
> +
> +  /* -i is a real turtle with multibyte character sts, so match
> +     line-by-line.
> +
> +     FIXME: this is just an ugly workaround, and it doesn't really
> +     belong here.  Also, PCRE is always using this same per-line
> +     matching algorithm.  Either we fix -i, or we should refactor
> +     this code---for example, we could adding another function pointer
> +     to struct matcher to split the buffer passed to execute.  It would
> +     perform the memchr if line-by-line matching is necessary, or just
> +     returns buf + size otherwise.  */
> +  if (MB_CUR_MAX == 1 || !match_icase)
> +    return execute(buf, size, match_size, start_ptr);
> +
> +  for (line_next = buf; result == (size_t)-1 && line_next < buf + size; )
> +    {
> +      line_buf = line_next;
> +      line_end = memchr (line_buf, eolbyte, (buf + size) - line_buf);
> +      if (line_end == NULL)
> +        line_next = line_end = buf + size;
> +      else
> +        line_next = line_end + 1;
> +
> +      if (start_ptr && start_ptr >= line_end)
> +        continue;
> +
> +      result = execute (line_buf, line_next - line_buf, match_size, 
> start_ptr);
> +    }
> +
> +  return result == (size_t)-1 ? result : (line_buf - buf) + result;
> +}

[I started looking at this one first (yeah, out of order), didn't
 understand things right away, so started adjusting to taste and
 adding comments until I did... ]

This looks fine.
However, I found it easier to read when moving a few declarations "down"
and converting the 2/3-"for" loop into a while-loop.

I've also adjusted and added to comments to reflect my understanding.

While I generally prefer to have a single point of return,
in this case, it seemed slightly clearer to handle
the "found" and "not found" cases separately.  You're
welcome to ignore that part if you prefer it the other way.

In case you're inclined, you're welcome to squash this into your
topmost cset.

diff --git a/src/grep.c b/src/grep.c
index 19e04e2..323f509 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -1027,26 +1027,26 @@ prtext (char const *beg, char const *lim, int *nlinesp)

 static EXECUTE_RET do_execute EXECUTE_ARGS
 {
-  const char *line_buf, *line_end, *line_next;
-  size_t result = (size_t) -1;
-
-  /* -i is a real turtle with multibyte character sts, so match
-     line-by-line.
+  /* With the current implementation, using --ignore-case with a multi-byte
+     character set is very inefficient when applied to a large buffer
+     containing many matches.  We can avoid much of the wasted effort
+     by matching line-by-line.

      FIXME: this is just an ugly workaround, and it doesn't really
      belong here.  Also, PCRE is always using this same per-line
      matching algorithm.  Either we fix -i, or we should refactor
-     this code---for example, we could adding another function pointer
+     this code---for example, we could add another function pointer
      to struct matcher to split the buffer passed to execute.  It would
      perform the memchr if line-by-line matching is necessary, or just
-     returns buf + size otherwise.  */
+     return buf + size otherwise.  */
   if (MB_CUR_MAX == 1 || !match_icase)
     return execute(buf, size, match_size, start_ptr);

-  for (line_next = buf; result == (size_t)-1 && line_next < buf + size; )
+  const char *line_next = buf;
+  while (line_next < buf + size)
     {
-      line_buf = line_next;
-      line_end = memchr (line_buf, eolbyte, (buf + size) - line_buf);
+      const char *line_buf = line_next;
+      const char *line_end = memchr (line_buf, eolbyte, (buf + size) - 
line_buf);
       if (line_end == NULL)
         line_next = line_end = buf + size;
       else
@@ -1055,10 +1055,13 @@ static EXECUTE_RET do_execute EXECUTE_ARGS
       if (start_ptr && start_ptr >= line_end)
         continue;

-      result = execute (line_buf, line_next - line_buf, match_size, start_ptr);
+      size_t result = execute (line_buf, line_next - line_buf,
+                              match_size, start_ptr);
+      if (result != (size_t) -1)
+       return (line_buf - buf) + result;
     }

-  return result == (size_t)-1 ? result : (line_buf - buf) + result;
+  return (size_t) -1;
 }

 /* Scan the specified portion of the buffer, matching lines (or

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, (continued)
- [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set, Jim Meyering, 2010/03/15
- [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Jim Meyering, 2010/03/17
    - Re: [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Paolo Bonzini, 2010/03/17
- [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match, Jim Meyering, 2010/03/17
- [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Jim Meyering <=
    - Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Paolo Bonzini, 2010/03/16
    - Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Jim Meyering, 2010/03/16
    - Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Paolo Bonzini, 2010/03/16

Prev by Date: rebased utf-8 series
Next by Date: Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i
Previous by thread: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i
Next by thread: Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i
Index(es):
- Date
- Thread