bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8


From: Jim Meyering
Subject: Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8
Date: Fri, 01 Jun 2012 22:40:16 +0200

Jim Meyering wrote:
...
> Here's a preliminary view of the fix for the old one
> that tests/turkish-I tested for:
>
> [still to do: address the FIXME in NEWS,
> and, I've just realized, to add bug URLs and reporter names to the log]
...
> Subject: [PATCH] fix how -i works with a match containing the Turkish
>  I-with-dot
>
> Fix a long-standing problem in the way grep's -i interacts with
> data whose byte count changes when we convert it to lower case.
> For example, the UTF-8 Turkish I-with-dot (İ) occupies two bytes,
> but its lower case analog, i, occupies just one byte.  The code
> converts both search string and the haystack data to lower case,
> and then searches for the modified string in the modified buffer.
> The trouble arose when using a lowercase buffer <offset,length>
> pair to manipulate the original (longer) buffer.
>
> The solution is to change mbtolower to return additional information:
> a malloc'd mapping vector M.  With that, the caller maps the
> lowercase-relative <offset,length> to numbers that refer to the
> original buffer.  This mapping is used only when lengths actually
> differ, so the cost in general should be small.
>
> * src/searchutils.c (mbtolower): Add the new map parameter.
> * src/search.h (mb_case_map_apply): New function.
> * src/kwsearch.c (Fexecute): Update mbtolower caller, and upon
> success, apply the new map.
> * src/dfasearch.c (EGexecute): Likewise.
> * tests/Makefile.am (XFAIL_TESTS): Remove turkish-I from this list;
> that test is no longer expected to fail.
> * NEWS (Bug fixes): Mention it.

I've appended this to the log:

    Reported by Ilya Basin in
    http://thread.gmane.org/gmane.comp.gnu.grep.bugs/3413 and later
    by Strahinja Kustudic in http://savannah.gnu.org/bugs/?36567

...
> diff --git a/NEWS b/NEWS
> index 6926276..ebe0e2f 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -4,6 +4,11 @@ GNU grep NEWS                                    -*- outline 
> -*-
>
>  ** Bug fixes
>
> +  grep -i, in a multi-byte locale, when matching a line containing a 
> character
> +  like the UTF-8 Turkish I-with-dot (İ) (whose lower-case representation
> +  occupies fewer bytes), would print an incomplete output line.
> +  [bug introduced in grep- FIXME]

I built all versions of grep that I had handy, and did this
to conclude that the problem was introduced in grep-2.6

$ for i in $(env ls -1vd /p/p/grep*); do echo -n "$i: "; LC_ALL=en_US.UTF-8 
$i/bin/grep -i .... in|wc -c; done
/p/p/grep-2.0: 15
/p/p/grep-2.2: 15
/p/p/grep-2.3: 15
/p/p/grep-2.4: 15
/p/p/grep-2.4.1: 15
/p/p/grep-2.4.2: 15
/p/p/grep-2.5.1: 15
/p/p/grep-2.5.3: 15
/p/p/grep-2.5.4: 15
/p/p/grep-2.6: 8
/p/p/grep-2.6.1: 8
/p/p/grep-2.6.2: 8
/p/p/grep-2.6.3: 8
/p/p/grep-2.7: 8
/p/p/grep-2.8: 8
/p/p/grep-2.9: 8
/p/p/grep-2.10: 8
/p/p/grep-2.11: 8
/p/p/grep-2.12: 8



reply via email to

[Prev in Thread] Current Thread [Next in Thread]