Re: [PATCH 8/9] grep: remove check_multibyte

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed

From:	Jim Meyering
Subject:	Re: [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match
Date:	Wed, 17 Mar 2010 10:38:14 +0100

Paolo Bonzini wrote:
> Avoid computing ahead something that can be computed lazily as efficiently
> (or more efficiently in the case of UTF-8, though this is left as TODO).
> At the same time, "soften" the rejection condition for matching in the
> middle of a multibyte sequence to fix bug 23814.
>
> Multibyte "grep -i" is still very slow.
>
> * NEWS: Document bugfix.
> * src/search.c (check_multibyte_string): Rewrite as...
> (is_mb_middle): ... this.
> (EGexecute, Fexecute): Adjust.
> * tests/Makefile.am (TESTS): Add euc-mb.
> * tests/euc-mb: New testcase.

Nice fix, nice test.

Here are some stylistic suggestions:

> diff --git a/src/search.c b/src/search.c
> index 9d73abc..ce9f92c 100644
> --- a/src/search.c
> +++ b/src/search.c
> @@ -220,20 +220,22 @@ kwsmusts (void)
>
>  #ifdef MBS_SUPPORT
>
> -static char*
> -check_multibyte_string(const char *buf, size_t size)
> +static bool
> +is_mb_middle(const char **good, const char *buf, const char *end)
>  {
> -  char *mb_properties = xcalloc(size, 1);
> +  const char *p = *good, *prev = p;

Please declare only one variable per line.
IMHO, this is far more readable than the one-line version above:

    const char *p = *good;
    const char *prev = p;

I haven't said much about this yet because there are so many legacy violations.
This is a general rule discussed in the GCS, but it is especially
important when a declaration comes with an initialization.

>    mbstate_t cur_state;
> -  wchar_t wc;
> -  int i;
>
> +  /* TODO: can be optimized for UTF-8.  */
>    memset(&cur_state, 0, sizeof(mbstate_t));
> -
> -  for (i = 0; i < size ;)
> +  while (p < buf)
>      {
>        size_t mbclen;
> -      mbclen = mbrtowc(&wc, buf + i, size - i, &cur_state);
> +      mbclen = mbrlen(p, end - p, &cur_state);

Saving vertical space is good for readability, too.
Please merge declaration and first assignment when possible:

         size_t mbclen = mbrlen(p, end - p, &cur_state);

> +      /* Store the beginning of the previous complete multibyte character.  
> */
> +      if (mbclen != (size_t) -2)
> +        prev = p;
>
>        if (mbclen == (size_t) -1 || mbclen == (size_t) -2 || mbclen == 0)
>       {
> @@ -241,11 +243,11 @@ check_multibyte_string(const char *buf, size_t size)
>            We treat it as a single byte character.  */
>         mbclen = 1;
>       }
> -      mb_properties[i] = mbclen;
> -      i += mbclen;
> +      p += mbclen;
>      }
>
> -  return mb_properties;
> +  *good = prev;
> +  return p > buf;
...
> diff --git a/tests/euc-mb b/tests/euc-mb
> new file mode 100644
> index 0000000..a59c295
> --- /dev/null
> +++ b/tests/euc-mb
> @@ -0,0 +1,23 @@
> +#!/bin/sh
> +# test that matches starting in the middle of a multibyte char aren't 
> rejected
> +# too greedily.
> +# Derived from https://savannah.gnu.org/bugs/?23814
> +: ${srcdir=.}
> +. "$srcdir/init.sh"; path_prepend_ ../src
> +
> +make_input () {
> +  echo "$1" | tr AB '\244\263'
> +}
> +
> +euc_grep () {
> +  LC_ALL=ja_JP.EUC-JP grep `make_input "$1"`

I'm a little nervous about using an unquoted multi-byte argument to grep.
How about adding quotes?

    pat=$(make_input "$1")
    LC_ALL=ja_JP.EUC-JP grep "$pat"

> +}
> +
> +if make_input BABA |euc_grep AB ; then
> +  skip_ 'EUC-JP locale seems not to work'
> +fi
>
> +make_input BABAAB |euc_grep AB || \
> +  fail_ 'whole line rejected after matching in the middle of a multibyte 
> char'
> +
> +exit 0

It's slightly more portable to finish with "Exit 0".

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH 4/9] dfa: speed up handling of brackets, (continued)
- [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Jim Meyering, 2010/03/17
    - Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Paolo Bonzini, 2010/03/17
- [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set, Jim Meyering, 2010/03/15
- [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Jim Meyering, 2010/03/17
    - Re: [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Paolo Bonzini, 2010/03/17
- [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match, Jim Meyering <=
- [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Jim Meyering, 2010/03/16
    - Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Paolo Bonzini, 2010/03/16
    - Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Jim Meyering, 2010/03/16
    - Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Paolo Bonzini, 2010/03/16

Prev by Date: Re: [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec
Next by Date: Re: [PATCH] dfa: improve documentation of struct dfa
Previous by thread: [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match
Next by thread: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i
Index(es):
- Date
- Thread