[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8
From: |
Jim Meyering |
Subject: |
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8 |
Date: |
Fri, 01 Jun 2012 22:40:16 +0200 |
Jim Meyering wrote:
...
> Here's a preliminary view of the fix for the old one
> that tests/turkish-I tested for:
>
> [still to do: address the FIXME in NEWS,
> and, I've just realized, to add bug URLs and reporter names to the log]
...
> Subject: [PATCH] fix how -i works with a match containing the Turkish
> I-with-dot
>
> Fix a long-standing problem in the way grep's -i interacts with
> data whose byte count changes when we convert it to lower case.
> For example, the UTF-8 Turkish I-with-dot (İ) occupies two bytes,
> but its lower case analog, i, occupies just one byte. The code
> converts both search string and the haystack data to lower case,
> and then searches for the modified string in the modified buffer.
> The trouble arose when using a lowercase buffer <offset,length>
> pair to manipulate the original (longer) buffer.
>
> The solution is to change mbtolower to return additional information:
> a malloc'd mapping vector M. With that, the caller maps the
> lowercase-relative <offset,length> to numbers that refer to the
> original buffer. This mapping is used only when lengths actually
> differ, so the cost in general should be small.
>
> * src/searchutils.c (mbtolower): Add the new map parameter.
> * src/search.h (mb_case_map_apply): New function.
> * src/kwsearch.c (Fexecute): Update mbtolower caller, and upon
> success, apply the new map.
> * src/dfasearch.c (EGexecute): Likewise.
> * tests/Makefile.am (XFAIL_TESTS): Remove turkish-I from this list;
> that test is no longer expected to fail.
> * NEWS (Bug fixes): Mention it.
I've appended this to the log:
Reported by Ilya Basin in
http://thread.gmane.org/gmane.comp.gnu.grep.bugs/3413 and later
by Strahinja Kustudic in http://savannah.gnu.org/bugs/?36567
...
> diff --git a/NEWS b/NEWS
> index 6926276..ebe0e2f 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -4,6 +4,11 @@ GNU grep NEWS -*- outline
> -*-
>
> ** Bug fixes
>
> + grep -i, in a multi-byte locale, when matching a line containing a
> character
> + like the UTF-8 Turkish I-with-dot (İ) (whose lower-case representation
> + occupies fewer bytes), would print an incomplete output line.
> + [bug introduced in grep- FIXME]
I built all versions of grep that I had handy, and did this
to conclude that the problem was introduced in grep-2.6
$ for i in $(env ls -1vd /p/p/grep*); do echo -n "$i: "; LC_ALL=en_US.UTF-8
$i/bin/grep -i .... in|wc -c; done
/p/p/grep-2.0: 15
/p/p/grep-2.2: 15
/p/p/grep-2.3: 15
/p/p/grep-2.4: 15
/p/p/grep-2.4.1: 15
/p/p/grep-2.4.2: 15
/p/p/grep-2.5.1: 15
/p/p/grep-2.5.3: 15
/p/p/grep-2.5.4: 15
/p/p/grep-2.6: 8
/p/p/grep-2.6.1: 8
/p/p/grep-2.6.2: 8
/p/p/grep-2.6.3: 8
/p/p/grep-2.7: 8
/p/p/grep-2.8: 8
/p/p/grep-2.9: 8
/p/p/grep-2.10: 8
/p/p/grep-2.11: 8
/p/p/grep-2.12: 8
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/01
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/01
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8,
Jim Meyering <=
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/02
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Johannes Meixner, 2012/06/12
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Johannes Meixner, 2012/06/14
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paolo Bonzini, 2012/06/15