[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets
From: |
Jim Meyering |
Subject: |
Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets |
Date: |
Wed, 17 Mar 2010 11:10:31 +0100 |
Paolo Bonzini wrote:
> Use a bitset when not involving MBCSET is possible. Testcase:
> yes 'the quick brown fox jumps over the lazy dog' | sed 100000q | \
> time grep -c [ABCDEFGHIJKLMNOPQRSTUVWXYZ,]
>
> Before: 51ms (best of three runs); after: 16ms(best of three runs).
>
> * src/dfa.c (check_utf8, using_utf8): New.
> (parse_bracket_exp): For simple bracket expressions under UTF-8,
> use a CSET.
> (dfacomp): Call check_utf8.
> ---
> src/dfa.c | 34 +++++++++++++++++++++++++++++++++-
> 1 files changed, 33 insertions(+), 1 deletions(-)
>
> diff --git a/src/dfa.c b/src/dfa.c
> index ed4e1ae..da70aa1 100644
> --- a/src/dfa.c
> +++ b/src/dfa.c
> @@ -21,6 +21,7 @@
> Modified July, 1988 by Arthur David Olson to assist BMG speedups */
>
> #include <config.h>
> +#include <assert.h>
> #include <ctype.h>
> #include <stdio.h>
> #include <sys/types.h>
> @@ -78,6 +79,7 @@
> /* We can handle multibyte strings. */
> # include <wchar.h>
> # include <wctype.h>
> +# include <langinfo.h>
> #endif
>
> #include "regex.h"
> @@ -312,8 +314,27 @@ static wchar_t *inputwcs; /* Wide character
> representation of input
> And inputwcs[i] is the codepoint. */
> static unsigned char const *buf_begin; /* reference to begin in
> dfaexec(). */
> static unsigned char const *buf_end; /* reference to end in dfaexec(). */
> +
> +/* UTF-8 encoding allows some optimizations that we can't otherwise
> + assume in a multibyte encoding. */
> +static int using_utf8;
> +
> +static void
> +check_utf8 (void)
> +{
> +#ifdef HAVE_LANGINFO_CODESET
> + if (strcmp (nl_langinfo (CODESET), "UTF-8") == 0)
> + using_utf8 = 1;
> +#endif
> +}
> +#else
> +static void
> +check_utf8 (void)
> +{
> +}
> #endif /* MBS_SUPPORT */
What do you think about dropping the global variable
and simply calling the function "using_utf8"?
static inline bool
using_utf8 (void)
{
static bool utf8;
static bool first_call = true;
if (first_call)
{
#ifdef HAVE_LANGINFO_CODESET
utf8 = (strcmp (nl_langinfo (CODESET), "UTF-8") == 0);
#else
utf8 = false;
#endif
first_call = false;
}
return utf8;
}
Hmm... I guess we have to be leery of using "bool" in dfa.c since it's
slated to be shared with gawk (which lacks gnulib). So we should
stick with "int" and 0/1.
Either way, ACK.
- Re: [PATCH 2/9] dfa: fix handling of ranges in multibyte character sets, (continued)
[PATCH 3/9] dfa: rewrite handling of multibyte case_fold lexing, Paolo Bonzini, 2010/03/14
[PATCH 4/9] dfa: speed up handling of brackets, Paolo Bonzini, 2010/03/14
[PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Paolo Bonzini, 2010/03/14
- Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets,
Jim Meyering <=
[PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set, Paolo Bonzini, 2010/03/14
[PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Paolo Bonzini, 2010/03/14
[PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match, Paolo Bonzini, 2010/03/14
[PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Paolo Bonzini, 2010/03/14