[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed
From: |
Jim Meyering |
Subject: |
Re: [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match |
Date: |
Wed, 17 Mar 2010 10:38:14 +0100 |
Paolo Bonzini wrote:
> Avoid computing ahead something that can be computed lazily as efficiently
> (or more efficiently in the case of UTF-8, though this is left as TODO).
> At the same time, "soften" the rejection condition for matching in the
> middle of a multibyte sequence to fix bug 23814.
>
> Multibyte "grep -i" is still very slow.
>
> * NEWS: Document bugfix.
> * src/search.c (check_multibyte_string): Rewrite as...
> (is_mb_middle): ... this.
> (EGexecute, Fexecute): Adjust.
> * tests/Makefile.am (TESTS): Add euc-mb.
> * tests/euc-mb: New testcase.
Nice fix, nice test.
Here are some stylistic suggestions:
> diff --git a/src/search.c b/src/search.c
> index 9d73abc..ce9f92c 100644
> --- a/src/search.c
> +++ b/src/search.c
> @@ -220,20 +220,22 @@ kwsmusts (void)
>
> #ifdef MBS_SUPPORT
>
> -static char*
> -check_multibyte_string(const char *buf, size_t size)
> +static bool
> +is_mb_middle(const char **good, const char *buf, const char *end)
> {
> - char *mb_properties = xcalloc(size, 1);
> + const char *p = *good, *prev = p;
Please declare only one variable per line.
IMHO, this is far more readable than the one-line version above:
const char *p = *good;
const char *prev = p;
I haven't said much about this yet because there are so many legacy violations.
This is a general rule discussed in the GCS, but it is especially
important when a declaration comes with an initialization.
> mbstate_t cur_state;
> - wchar_t wc;
> - int i;
>
> + /* TODO: can be optimized for UTF-8. */
> memset(&cur_state, 0, sizeof(mbstate_t));
> -
> - for (i = 0; i < size ;)
> + while (p < buf)
> {
> size_t mbclen;
> - mbclen = mbrtowc(&wc, buf + i, size - i, &cur_state);
> + mbclen = mbrlen(p, end - p, &cur_state);
Saving vertical space is good for readability, too.
Please merge declaration and first assignment when possible:
size_t mbclen = mbrlen(p, end - p, &cur_state);
> + /* Store the beginning of the previous complete multibyte character.
> */
> + if (mbclen != (size_t) -2)
> + prev = p;
>
> if (mbclen == (size_t) -1 || mbclen == (size_t) -2 || mbclen == 0)
> {
> @@ -241,11 +243,11 @@ check_multibyte_string(const char *buf, size_t size)
> We treat it as a single byte character. */
> mbclen = 1;
> }
> - mb_properties[i] = mbclen;
> - i += mbclen;
> + p += mbclen;
> }
>
> - return mb_properties;
> + *good = prev;
> + return p > buf;
...
> diff --git a/tests/euc-mb b/tests/euc-mb
> new file mode 100644
> index 0000000..a59c295
> --- /dev/null
> +++ b/tests/euc-mb
> @@ -0,0 +1,23 @@
> +#!/bin/sh
> +# test that matches starting in the middle of a multibyte char aren't
> rejected
> +# too greedily.
> +# Derived from https://savannah.gnu.org/bugs/?23814
> +: ${srcdir=.}
> +. "$srcdir/init.sh"; path_prepend_ ../src
> +
> +make_input () {
> + echo "$1" | tr AB '\244\263'
> +}
> +
> +euc_grep () {
> + LC_ALL=ja_JP.EUC-JP grep `make_input "$1"`
I'm a little nervous about using an unquoted multi-byte argument to grep.
How about adding quotes?
pat=$(make_input "$1")
LC_ALL=ja_JP.EUC-JP grep "$pat"
> +}
> +
> +if make_input BABA |euc_grep AB ; then
> + skip_ 'EUC-JP locale seems not to work'
> +fi
>
> +make_input BABAAB |euc_grep AB || \
> + fail_ 'whole line rejected after matching in the middle of a multibyte
> char'
> +
> +exit 0
It's slightly more portable to finish with "Exit 0".
- Re: [PATCH 4/9] dfa: speed up handling of brackets, (continued)
- [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Paolo Bonzini, 2010/03/14
- [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set, Paolo Bonzini, 2010/03/14
- [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Paolo Bonzini, 2010/03/14
- [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match, Paolo Bonzini, 2010/03/14
- Re: [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match,
Jim Meyering <=
- [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i, Paolo Bonzini, 2010/03/14