Re: [PATCH] grep -i: work also when converting to lower-case inflates by

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] grep -i: work also when converting to lower-case inflates by

From:	Jim Meyering
Subject:	Re: [PATCH] grep -i: work also when converting to lower-case inflates byte count
Date:	Sat, 16 Jun 2012 18:23:54 +0200

Paul Eggert wrote:
> On 06/16/2012 05:34 AM, Jim Meyering wrote:
>> -  unsigned char *map = NULL;
>> +  char *map = NULL;
>
> This isn't portable, as 'char' is unsigned on some platforms.
> Changing it to 'signed char' or to 'int_least8_t'
> should fix the problem.  Similarly elsewhere.

Thanks!  Good catch.
Also, the code should work as well with gcc's -funsigned-char as without.
I tested with that, to ensure that the tests pass:

    make CC='gcc -std=gnu99 -funsigned-char' CFLAGS=-ggdb3 WERROR_CFLAGS=

Now they do, before this adjustment, they turkish-I-without-dot failed.

Here's the incremental, followed by updated complete patch
and updated log:

(Yeah, I know that *_t names are officially "reserved",
but search.h is strictly internal)

diff --git a/src/dfasearch.c b/src/dfasearch.c
index 358e5b5..ab30cd5 100644
--- a/src/dfasearch.c
+++ b/src/dfasearch.c
@@ -78,7 +78,7 @@ static char const *
 kwsincr_case (const char *must)
 {
   size_t n = strlen (must);
-  char *map = NULL;
+  mb_len_map_t *map = NULL;
   const char *buf = (match_icase && MB_CUR_MAX > 1
                      ? mbtolower (must, &n, &map)
                      : must);
@@ -218,7 +218,7 @@ EGexecute (char const *buf, size_t size, size_t *match_size,
   ptrdiff_t len, best_len;
   struct kwsmatch kwsm;
   size_t i, ret_val;
-  char *map = NULL;
+  mb_len_map_t *map = NULL;

   if (MB_CUR_MAX > 1)
     {
diff --git a/src/kwsearch.c b/src/kwsearch.c
index c39fca3..b56b465 100644
--- a/src/kwsearch.c
+++ b/src/kwsearch.c
@@ -34,7 +34,7 @@ Fcompile (char const *pattern, size_t size)
 {
   char const *err;
   size_t psize = size;
-  char *map = NULL;
+  mb_len_map_t *map = NULL;
   char const *pat = (match_icase && MB_CUR_MAX > 1
                      ? mbtolower (pattern, &psize, &map)
                      : pattern);
@@ -84,7 +84,7 @@ Fexecute (char const *buf, size_t size, size_t *match_size,
   char eol = eolbyte;
   struct kwsmatch kwsmatch;
   size_t ret_val;
-  char *map = NULL;
+  mb_len_map_t *map = NULL;

   if (MB_CUR_MAX > 1)
     {
diff --git a/src/search.h b/src/search.h
index 3a03535..3820d7a 100644
--- a/src/search.h
+++ b/src/search.h
@@ -36,10 +36,16 @@
 #include "kwset.h"
 #include "xalloc.h"

+/* This must be a signed type.  Each value is the difference in the size
+   of a character (in bytes) induced by converting to lower case.
+   The vast majority of values are 0, but a few are 1 or -1, so
+   technically, two bits may be sufficient.  */
+typedef signed char mb_len_map_t;
+
 /* searchutils.c */
 extern void kwsinit (kwset_t *);

-extern char *mbtolower (const char *, size_t *, char **);
+extern char *mbtolower (const char *, size_t *, mb_len_map_t **);
 extern bool is_mb_middle (const char **, const char *, const char *, size_t);

 /* dfasearch.c */
@@ -57,7 +63,7 @@ extern size_t Pexecute (char const *, size_t, size_t *, char 
const *);
 /* Apply the MAP (created by mbtolower) to the lowercase-buffer-relative
    *OFF and *LEN, converting them to be relative to the original buffer.  */
 static inline void
-mb_case_map_apply (char const *map, size_t *off, size_t *len)
+mb_case_map_apply (mb_len_map_t const *map, size_t *off, size_t *len)
 {
   if (map)
     {
diff --git a/src/searchutils.c b/src/searchutils.c
index b906ba5..ca30134 100644
--- a/src/searchutils.c
+++ b/src/searchutils.c
@@ -74,16 +74,16 @@ kwsinit (kwset_t *kwset)
    part of the original buffer.  */

 char *
-mbtolower (const char *beg, size_t *n, char **len_map_p)
+mbtolower (const char *beg, size_t *n, mb_len_map_t **len_map_p)
 {
   static char *out;
-  static char *len_map;
+  static mb_len_map_t *len_map;
   static size_t outalloc;
   size_t outlen, mb_cur_max;
   mbstate_t is, os;
   const char *end;
   char *p;
-  char *m;
+  mb_len_map_t *m;
   bool lengths_differ = false;

   if (*n > outalloc || outalloc == 0)


>From 2be0c6591ea5d56ee7fbc364228f24b85f569a5c Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Sat, 16 Jun 2012 12:32:35 +0200
Subject: [PATCH] grep -i: work also when converting to lower-case inflates
 byte count

Commit v2.12-16-g7aa698d addressed the case in which the lower-case
representation of an input byte occupies fewer bytes than the original.
However, even with commit v2.12-20-g074842d, grep -i would still
misbehave when converting a character to lower-case increased its
byte count.  The map-manipulation code assumed that the case conversion
could only shrink the byte count.  With the consideration that it may
also inflate it, the deltas recorded in the map array must be signed,
and we must account for the one-to-two-or-more mapping when the
original-to-lower-case conversion causes the byte count to increase.
* src/searchutils.c (mbtolower): When a lower-case character occupies
more than one byte, set its remaining map slots to zero.  Change the
type of the map to be signed, and compute the change in character
byte count as new_length - old_length.
* src/search.h: Include <stdint.h>, for decl of intmax_t.
(mb_case_map_apply): Adjust for signed increments:
each map entry is now signed.
(mb_len_map_t): Define type.  Thanks to Paul Eggert for noticing
in review that using a bare "char" as the base type would be wrong on
systems for which it is a signed type (as with gcc's -funsigned-char).
* src/kwsearch.c (Fcompile, Fexecute): Likewise.
* src/dfasearch.c (kwsincr_case, EGexecute): Likewise.
* tests/turkish-I-without-dot: New test.  Thanks to Paolo Bonzini
for the tip that in the tr_TR.utf8 locale, mapping "I" to lower case
increases the character's byte count.
* tests/Makefile.am (TESTS): Add it.
* tests/init.cfg (require_tr_utf8_locale_): New function.
* NEWS (Bug fixes): Expand the existing entry.
---
 NEWS                        |  3 +++
 src/dfasearch.c             |  4 ++--
 src/kwsearch.c              |  4 ++--
 src/search.h                | 17 +++++++++++-----
 src/searchutils.c           | 41 ++++++++++++++++++++++++---------------
 tests/Makefile.am           |  1 +
 tests/init.cfg              |  9 +++++++++
 tests/turkish-I-without-dot | 47 +++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 102 insertions(+), 24 deletions(-)
 create mode 100755 tests/turkish-I-without-dot

diff --git a/NEWS b/NEWS
index d0ea60a..499d18d 100644
--- a/NEWS
+++ b/NEWS
@@ -7,6 +7,9 @@ GNU grep NEWS                                    -*- outline -*-
   grep -i, in a multi-byte locale, when matching a line containing a character
   like the UTF-8 Turkish I-with-dot (U+0130) (whose lower-case representation
   occupies fewer bytes), would print an incomplete output line.
+  Similarly, with a matched line containing a character (e.g., the Latin
+  capital I in a Turkish UTF-8 locale), where the lower-case representation
+  occupies more bytes, grep could print garbage.
   [bug introduced in grep-2.6]

   --include and --exclude can again be combined, and again apply to
diff --git a/src/dfasearch.c b/src/dfasearch.c
index a48333d..ab30cd5 100644
--- a/src/dfasearch.c
+++ b/src/dfasearch.c
@@ -78,7 +78,7 @@ static char const *
 kwsincr_case (const char *must)
 {
   size_t n = strlen (must);
-  unsigned char *map = NULL;
+  mb_len_map_t *map = NULL;
   const char *buf = (match_icase && MB_CUR_MAX > 1
                      ? mbtolower (must, &n, &map)
                      : must);
@@ -218,7 +218,7 @@ EGexecute (char const *buf, size_t size, size_t *match_size,
   ptrdiff_t len, best_len;
   struct kwsmatch kwsm;
   size_t i, ret_val;
-  unsigned char *map = NULL;
+  mb_len_map_t *map = NULL;

   if (MB_CUR_MAX > 1)
     {
diff --git a/src/kwsearch.c b/src/kwsearch.c
index d0bb201..b56b465 100644
--- a/src/kwsearch.c
+++ b/src/kwsearch.c
@@ -34,7 +34,7 @@ Fcompile (char const *pattern, size_t size)
 {
   char const *err;
   size_t psize = size;
-  unsigned char *map = NULL;
+  mb_len_map_t *map = NULL;
   char const *pat = (match_icase && MB_CUR_MAX > 1
                      ? mbtolower (pattern, &psize, &map)
                      : pattern);
@@ -84,7 +84,7 @@ Fexecute (char const *buf, size_t size, size_t *match_size,
   char eol = eolbyte;
   struct kwsmatch kwsmatch;
   size_t ret_val;
-  unsigned char *map = NULL;
+  mb_len_map_t *map = NULL;

   if (MB_CUR_MAX > 1)
     {
diff --git a/src/search.h b/src/search.h
index 529e7e2..3820d7a 100644
--- a/src/search.h
+++ b/src/search.h
@@ -22,6 +22,7 @@
 #include <config.h>

 #include <sys/types.h>
+#include <stdint.h>

 #include "mbsupport.h"

@@ -35,10 +36,16 @@
 #include "kwset.h"
 #include "xalloc.h"

+/* This must be a signed type.  Each value is the difference in the size
+   of a character (in bytes) induced by converting to lower case.
+   The vast majority of values are 0, but a few are 1 or -1, so
+   technically, two bits may be sufficient.  */
+typedef signed char mb_len_map_t;
+
 /* searchutils.c */
 extern void kwsinit (kwset_t *);

-extern char *mbtolower (const char *, size_t *, unsigned char **);
+extern char *mbtolower (const char *, size_t *, mb_len_map_t **);
 extern bool is_mb_middle (const char **, const char *, const char *, size_t);

 /* dfasearch.c */
@@ -53,15 +60,15 @@ extern size_t Fexecute (char const *, size_t, size_t *, 
char const *);
 extern void Pcompile (char const *, size_t);
 extern size_t Pexecute (char const *, size_t, size_t *, char const *);

-/* Apply a non-NULL MAP from mbtolower to the lowercase-buffer-relative
+/* Apply the MAP (created by mbtolower) to the lowercase-buffer-relative
    *OFF and *LEN, converting them to be relative to the original buffer.  */
 static inline void
-mb_case_map_apply (unsigned char const *map, size_t *off, size_t *len)
+mb_case_map_apply (mb_len_map_t const *map, size_t *off, size_t *len)
 {
   if (map)
     {
-      size_t off_incr = 0;
-      size_t len_incr = 0;
+      intmax_t off_incr = 0;
+      intmax_t len_incr = 0;
       size_t k;
       for (k = 0; k < *off; k++)
         off_incr += map[k];
diff --git a/src/searchutils.c b/src/searchutils.c
index c1fb656..ca30134 100644
--- a/src/searchutils.c
+++ b/src/searchutils.c
@@ -43,10 +43,10 @@ kwsinit (kwset_t *kwset)
 }

 #if MBS_SUPPORT
-/* Convert the *N-byte string, BEG, to lowercase, and write the
+/* Convert the *N-byte string, BEG, to lower-case, and write the
    NUL-terminated result into malloc'd storage.  Upon success, set *N
    to the length (in bytes) of the resulting string (not including the
-   trailing NUL byte), and return a pointer to the lowercase string.
+   trailing NUL byte), and return a pointer to the lower-case string.
    Upon memory allocation failure, this function exits.
    Note that on input, *N must be larger than zero.

@@ -55,26 +55,35 @@ kwsinit (kwset_t *kwset)
    to the buffer and reuses it on any subsequent call.  As a consequence,
    this function is not thread-safe.

-   When all the characters in the lowercase result string have the
-   same length as corresponding characters in the input string,
-   set *LEN_MAP_P to NULL.  Otherwise, set it to a malloc'd buffer (like the
-   returned buffer, this must not be freed by caller) of the same length as
-   the result string.  (*LEN_MAP_P)[J] is one less than the length-in-bytes
-   of the character in BEG that formed byte J of the result.  This map is
-   used by the caller to convert offset,length pairs that reference the
-   lowercase result to numbers that refer to the corresponding parts of
-   the original buffer.  */
+   When each character in the lower-case result string has the same length
+   as the corresponding character in the input string, set *LEN_MAP_P
+   to NULL.  Otherwise, set it to a malloc'd buffer (like the returned
+   buffer, this must not be freed by caller) of the same length as the
+   result string.  (*LEN_MAP_P)[J] is the change in byte-length of the
+   character in BEG that formed byte J of the result as it was converted to
+   lower-case.  It is usually zero.  For the upper-case Turkish I-with-dot
+   it is -1, since the upper-case character occupies two bytes, while the
+   lower-case one occupies only one byte.  For the Turkish-I-without-dot
+   in the tr_TR.utf8 locale, it is 1 because the lower-case representation
+   is one byte longer than the original.  When that happens, we have two
+   or more slots in *LEN_MAP_P for each such character.  We store the
+   difference in the first one and 0's in any remaining slots.
+
+   This map is used by the caller to convert offset,length pairs that
+   reference the lower-case result to numbers that refer to the matched
+   part of the original buffer.  */
+
 char *
-mbtolower (const char *beg, size_t *n, unsigned char **len_map_p)
+mbtolower (const char *beg, size_t *n, mb_len_map_t **len_map_p)
 {
   static char *out;
-  static unsigned char *len_map;
+  static mb_len_map_t *len_map;
   static size_t outalloc;
   size_t outlen, mb_cur_max;
   mbstate_t is, os;
   const char *end;
   char *p;
-  unsigned char *m;
+  mb_len_map_t *m;
   bool lengths_differ = false;

   if (*n > outalloc || outalloc == 0)
@@ -123,9 +132,11 @@ mbtolower (const char *beg, size_t *n, unsigned char 
**len_map_p)
         }
       else
         {
-          *m++ = mbclen - 1;
           beg += mbclen;
           size_t ombclen = wcrtomb (p, towlower ((wint_t) wc), &os);
+          *m = mbclen - ombclen;
+          memset (m + 1, 0, ombclen - 1);
+          m += ombclen;
           p += ombclen;
           outlen += ombclen;
           lengths_differ |= (mbclen != ombclen);
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 388d48a..7d95862 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -86,6 +86,7 @@ TESTS =                                               \
   status                                       \
   symlink                                      \
   turkish-I                                    \
+  turkish-I-without-dot                                \
   warn-char-classes                            \
   word-delim-multibyte                         \
   word-multi-file                              \
diff --git a/tests/init.cfg b/tests/init.cfg
index 408400c..3b6242f 100644
--- a/tests/init.cfg
+++ b/tests/init.cfg
@@ -62,6 +62,15 @@ require_en_utf8_locale_()
   esac
 }

+require_tr_utf8_locale_()
+{
+  path_prepend_ .
+  case $(get-mb-cur-max tr_TR.UTF-8) in
+    [3456]) ;;
+    *) skip_ 'en_US.UTF-8 locale not found' ;;
+  esac
+}
+
 require_ru_RU_koi8_r()
 {
   path_prepend_ .
diff --git a/tests/turkish-I-without-dot b/tests/turkish-I-without-dot
new file mode 100755
index 0000000..cab4ede
--- /dev/null
+++ b/tests/turkish-I-without-dot
@@ -0,0 +1,47 @@
+#!/bin/sh
+# grep -i would misbehave for any matched line containing a character
+# (like "I" in the tr_TR.utf8 locale) whose lower-case representation
+# occupies more bytes (two in this case, for 0xc4b1, aka U+0131).
+
+# Copyright (C) 2011-2012 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+
+require_tr_utf8_locale_
+require_compiled_in_MB_support
+
+# Before this change, grep could print a lot of uninitialized memory:
+# $ printf "IIIIIII\n" > in
+# $ for i in $(seq 10); do LC_ALL=tr_TR.utf8 src/grep -i . in|wc -c; done
+# 760
+# 754
+# 585
+# 298
+# 273
+# 458
+# 660
+# 552
+# 936
+# 678
+
+fail=0
+
+printf "IIIIIII\n" > in || framework_failure_
+LC_ALL=tr_TR.utf8 grep -i .... in > out || fail=1
+
+compare out in || fail=1
+
+Exit $fail
--
1.7.11.rc3.11.g7dba3f7

[Prev in Thread]

Current Thread

[Next in Thread]

[bug #36567] grep -i (case-insensitive) is broken with UTF8, Strahinja Kustudic, 2012/06/01
- [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paul Eggert, 2012/06/12
  - [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paul Eggert, 2012/06/12
    - [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paul Eggert, 2012/06/12
  - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/12
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paul Eggert, 2012/06/12
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paolo Bonzini, 2012/06/15
    - [PATCH] grep -i: work also when converting to lower-case inflates byte count, Jim Meyering, 2012/06/16
    - Re: [PATCH] grep -i: work also when converting to lower-case inflates byte count, Paul Eggert, 2012/06/16
    - Re: [PATCH] grep -i: work also when converting to lower-case inflates byte count, Jim Meyering <=
    - Re: [PATCH] grep -i: work also when converting to lower-case inflates byte count, Paolo Bonzini, 2012/06/23
- Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/01
  - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/01
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paul Eggert, 2012/06/01
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/01
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paul Eggert, 2012/06/01
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/01
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/01
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/02
  - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Johannes Meixner, 2012/06/12

Prev by Date: Re: [PATCH] grep -i: work also when converting to lower-case inflates byte count
Next by Date: [bug #36682] Ignore case handling of special unicode characters (case folding)
Previous by thread: Re: [PATCH] grep -i: work also when converting to lower-case inflates byte count
Next by thread: Re: [PATCH] grep -i: work also when converting to lower-case inflates byte count
Index(es):
- Date
- Thread