Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets

From:	Jim Meyering
Subject:	Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets
Date:	Wed, 17 Mar 2010 11:10:31 +0100

Paolo Bonzini wrote:
> Use a bitset when not involving MBCSET is possible.  Testcase:
>    yes 'the quick brown fox jumps over the lazy dog' | sed 100000q | \
>      time grep -c [ABCDEFGHIJKLMNOPQRSTUVWXYZ,]
>
> Before: 51ms (best of three runs); after: 16ms(best of three runs).
>
> * src/dfa.c (check_utf8, using_utf8): New.
> (parse_bracket_exp): For simple bracket expressions under UTF-8,
> use a CSET.
> (dfacomp): Call check_utf8.
> ---
>  src/dfa.c |   34 +++++++++++++++++++++++++++++++++-
>  1 files changed, 33 insertions(+), 1 deletions(-)
>
> diff --git a/src/dfa.c b/src/dfa.c
> index ed4e1ae..da70aa1 100644
> --- a/src/dfa.c
> +++ b/src/dfa.c
> @@ -21,6 +21,7 @@
>     Modified July, 1988 by Arthur David Olson to assist BMG speedups  */
>
>  #include <config.h>
> +#include <assert.h>
>  #include <ctype.h>
>  #include <stdio.h>
>  #include <sys/types.h>
> @@ -78,6 +79,7 @@
>  /* We can handle multibyte strings. */
>  # include <wchar.h>
>  # include <wctype.h>
> +# include <langinfo.h>
>  #endif
>
>  #include "regex.h"
> @@ -312,8 +314,27 @@ static wchar_t *inputwcs;        /* Wide character 
> representation of input
>                                  And inputwcs[i] is the codepoint.  */
>  static unsigned char const *buf_begin;       /* reference to begin in 
> dfaexec().  */
>  static unsigned char const *buf_end; /* reference to end in dfaexec().  */
> +
> +/* UTF-8 encoding allows some optimizations that we can't otherwise
> +   assume in a multibyte encoding. */
> +static int using_utf8;
> +
> +static void
> +check_utf8 (void)
> +{
> +#ifdef HAVE_LANGINFO_CODESET
> +  if (strcmp (nl_langinfo (CODESET), "UTF-8") == 0)
> +    using_utf8 = 1;
> +#endif
> +}
> +#else
> +static void
> +check_utf8 (void)
> +{
> +}
>  #endif /* MBS_SUPPORT  */

What do you think about dropping the global variable
and simply calling the function "using_utf8"?

static inline bool
using_utf8 (void)
{
  static bool utf8;
  static bool first_call = true;
  if (first_call)
    {
#ifdef HAVE_LANGINFO_CODESET
      utf8 = (strcmp (nl_langinfo (CODESET), "UTF-8") == 0);
#else
      utf8 = false;
#endif
      first_call = false;
    }

  return utf8;
}

Hmm... I guess we have to be leery of using "bool" in dfa.c since it's
slated to be shared with gawk (which lacks gnulib).  So we should
stick with "int" and 0/1.

Either way, ACK.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH 2/9] dfa: fix handling of ranges in multibyte character sets, (continued)
- [PATCH 3/9] dfa: rewrite handling of multibyte case_fold lexing, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 3/9] dfa: rewrite handling of multibyte case_fold lexing, Jim Meyering, 2010/03/16
    - Re: [PATCH 3/9] dfa: rewrite handling of multibyte case_fold lexing, Paolo Bonzini, 2010/03/17
- [PATCH 4/9] dfa: speed up handling of brackets, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 4/9] dfa: speed up handling of brackets, Jim Meyering, 2010/03/17
    - Re: [PATCH 4/9] dfa: speed up handling of brackets, Paolo Bonzini, 2010/03/17
    - Re: [PATCH 4/9] dfa: speed up handling of brackets, Jim Meyering, 2010/03/17
- [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Jim Meyering <=
    - Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Paolo Bonzini, 2010/03/17
- [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set, Jim Meyering, 2010/03/15
- [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Jim Meyering, 2010/03/17
    - Re: [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Paolo Bonzini, 2010/03/17
- [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match, Jim Meyering, 2010/03/17
- [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Jim Meyering, 2010/03/16

Prev by Date: Re: [PATCH] tests: fix syntax-check failures
Next by Date: [PATCH] tests: fix typo
Previous by thread: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets
Next by thread: Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets
Index(es):
- Date
- Thread