[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character
From: |
Jim Meyering |
Subject: |
Re: [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set |
Date: |
Mon, 15 Mar 2010 11:16:21 +0100 |
Paolo Bonzini wrote:
> This partially works around https://savannah.gnu.org/bugs/?29117,
> but in general provides a speedup whenever fgrep is "almost" sufficient
> but not quite (e.g. grep ^abc). Speedup is too good to be true :-)
> (can get to 1000x on some not-too-contrived testcases).
>
> * src/dfa.c (dfaoptimize): New.
> (dfacomp): Call it.
> ---
> src/dfa.c | 25 +++++++++++++++++++++++++
> 1 files changed, 25 insertions(+), 0 deletions(-)
>
> diff --git a/src/dfa.c b/src/dfa.c
> index 6a658c1..f9f7cd3 100644
> --- a/src/dfa.c
> +++ b/src/dfa.c
> @@ -3000,6 +3000,30 @@ dfainit (struct dfa *d)
> #endif
> }
>
> +static void
> +dfaoptimize (struct dfa *d)
> +{
> + int i;
> + if (!using_utf8)
> + return;
> +
> + for (i = 0; i < d->tindex; ++i)
> + {
> + switch(d->tokens[i])
> + {
> + case ANYCHAR:
> + return;
> + case MBCSET:
> + return;
> + default:
> + break; /* can not happen. */
That comment is false.
Otherwise, you could replace the entire loop with
if (d->tindex)
return
Stylistic: please put the two cases together:
case ANYCHAR:
case MBCSET:
return;
Also stylistic, please declare "i" as an unsigned int.
Hmm... that makes me realize that dfa.tindex should probably be
be declared as an unsigned type too, along with most of the other
members. But let's not go there just yet ;-)
- Re: [PATCH 3/9] dfa: rewrite handling of multibyte case_fold lexing, (continued)
- [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Paolo Bonzini, 2010/03/14
- [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set, Paolo Bonzini, 2010/03/14
- Re: [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set,
Jim Meyering <=
- [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Paolo Bonzini, 2010/03/14
- [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match, Paolo Bonzini, 2010/03/14
- [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Paolo Bonzini, 2010/03/14