Re: [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character

From:	Jim Meyering
Subject:	Re: [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set
Date:	Mon, 15 Mar 2010 11:16:21 +0100

Paolo Bonzini wrote:
> This partially works around https://savannah.gnu.org/bugs/?29117,
> but in general provides a speedup whenever fgrep is "almost" sufficient
> but not quite (e.g. grep ^abc).  Speedup is too good to be true :-)
> (can get to 1000x on some not-too-contrived testcases).
>
> * src/dfa.c (dfaoptimize): New.
> (dfacomp): Call it.
> ---
>  src/dfa.c |   25 +++++++++++++++++++++++++
>  1 files changed, 25 insertions(+), 0 deletions(-)
>
> diff --git a/src/dfa.c b/src/dfa.c
> index 6a658c1..f9f7cd3 100644
> --- a/src/dfa.c
> +++ b/src/dfa.c
> @@ -3000,6 +3000,30 @@ dfainit (struct dfa *d)
>  #endif
>  }
>
> +static void
> +dfaoptimize (struct dfa *d)
> +{
> +  int i;
> +  if (!using_utf8)
> +    return;
> +
> +  for (i = 0; i < d->tindex; ++i)
> +    {
> +      switch(d->tokens[i])
> +     {
> +     case ANYCHAR:
> +       return;
> +     case MBCSET:
> +       return;
> +     default:
> +       break; /* can not happen.  */

That comment is false.
Otherwise, you could replace the entire loop with

  if (d->tindex)
    return

Stylistic: please put the two cases together:

        case ANYCHAR:
        case MBCSET:
          return;

Also stylistic, please declare "i" as an unsigned int.

Hmm... that makes me realize that dfa.tindex should probably be
be declared as an unsigned type too, along with most of the other
members.  But let's not go there just yet ;-)

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH 3/9] dfa: rewrite handling of multibyte case_fold lexing, (continued)
- [PATCH 4/9] dfa: speed up handling of brackets, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 4/9] dfa: speed up handling of brackets, Jim Meyering, 2010/03/17
    - Re: [PATCH 4/9] dfa: speed up handling of brackets, Paolo Bonzini, 2010/03/17
    - Re: [PATCH 4/9] dfa: speed up handling of brackets, Jim Meyering, 2010/03/17
- [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Jim Meyering, 2010/03/17
    - Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Paolo Bonzini, 2010/03/17
- [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set, Jim Meyering <=
- [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Jim Meyering, 2010/03/17
    - Re: [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Paolo Bonzini, 2010/03/17
- [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match, Jim Meyering, 2010/03/17
- [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Jim Meyering, 2010/03/16
    - Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Paolo Bonzini, 2010/03/16
    - Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Jim Meyering, 2010/03/16
    - Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Paolo Bonzini, 2010/03/16

Prev by Date: Re: [PATCH] tests: clean up fedora.sh
Next by Date: Re: [PATCH 1/9] tests: add more UTF-8 test cases
Previous by thread: [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set
Next by thread: [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec
Index(es):
- Date
- Thread