bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-16 surrogate pair handling in grep -i option


From: Corinna Vinschen
Subject: Re: UTF-16 surrogate pair handling in grep -i option
Date: Mon, 19 Aug 2013 14:43:52 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

Hi Jim,

On Aug 18 14:26, Jim Meyering wrote:
> Thank you for adjusting that.  I've added a test script and tweaked a
> few minor things.
> 
> I added curly braces around another 1-line else statement, adjusted the
> commit log to use the explanation from NEWS and added our usual
> ChangeLog-style entries.
> 
> In NEWS, I've added the statement that the bug was introduced in 2.6,
> but that is only a guess.  Can you tell me if that is true?

Well, it depends.  In 2.6, the then-new mbtolower function wouldn't have
done the right thing with surrogates, but it wouldn't have SEGV'ed,
either.  The SEGV is a result of introducing the call to

  memset (m + 1, 0, ombclen - 1);

Since ombclen is 0 after encountering the high surrogate, ombclen - 1
is -1, which, as a size_t value, tries to memset a lot of memory...

This memset call has been introduced with grep 2.13, so this is where
the problem started.

Btw., for a discussion why wcrtomb returns 0 for the high surrogate, and
some problems surrounding UTF-16 wchar_t handling in general, see the
thread starting at 
http://cygwin.com/ml/cygwin-developers/2009-09/msg00065.html

> If not,
> for which releases does grep segfault on cygwin?  Since you are still
> listed as the "Author" of this entire patch, please review my changes
> carefully.  This test passes for me, but I haven't tried it on cygwin,
> and from what you say, part of it will fail there.
> 
> Note that I've removed your Patch by: line. The Author: field of the git
> commit identifies the person who wrote the patch (you), so including
> "Patch by" would be redundant.  Signed-off-by is omitted for the same
> reason.
> 
> Does using the en_US.UTF-8 locale on cygwin provoke the failure when
> run against an unpatched grep?

Yes.  Any UTF-8 locale does it.  I just have some trouble to build
git master for testing due to some gnulib peculiarities:

- Building gnulib requires gperf now without configure testing if gperf
  is installed on the system.  Configure works, but make fails then.
  Shouldn't gperf be checked for in configure.ac?

- -Werror is default but:

    lib/openat-die.c: In function 'openat_restore_fail':
    lib/openat-die.c:34:1: warning: function might be candidate for attribute
    'noreturn' [-Wsuggest-attribute=noreturn]
     openat_save_fail (int errnum)
     ^
    lib/openat-die.c: In function 'openat_restore_fail':
    lib/openat-die.c:53:1: warning: function might be candidate for attribute 
'noreturn' [-Wsuggest-attribute=noreturn]
     openat_restore_fail (int errnum)
     ^
    lib/obstack.c: In function '_obstack_allocated_p':
    lib/obstack.c:319:1: warning: function might be candidate for attribute 
'pure' if it is known to return normally [-Wsuggest-attribute=pure]
     _obstack_allocated_p (struct obstack *h, void *obj)
     ^
    lib/obstack.c: In function '_obstack_memory_used':
    lib/obstack.c:378:1: warning: function might be candidate for attribute 
'pure' if it is known to return normally [-Wsuggest-attribute=pure]
     _obstack_memory_used (struct obstack *h)
     ^

    src/dfa.c: In function 'lex':
    src/dfa.c:1096:38: error: 'wc2' may be used uninitialized in this function 
[-Werror=maybe-uninitialized]
                     case_fold ? towlower (wc2) : (wchar_t) wc2;
                                          ^
    src/dfa.c:929:10: note: 'wc2' was declared here
       wint_t wc2;
              ^
    src/dfa.c:1082:14: error: 'c2' may be used uninitialized in this function 
[-Werror=maybe-uninitialized]
               if (c2 == '\\' && (syntax_bits & RE_BACKSLASH_ESCAPE_IN_LISTS))
                  ^
    src/dfa.c:918:14: note: 'c2' was declared here
       int c, c1, c2;
                  ^
    src/dfa.c:1167:22: error: 'wc' may be used uninitialized in this function 
[-Werror=maybe-uninitialized]
           if (!setbit_wc (wc, ccl))
                          ^
    src/dfa.c:928:10: note: 'wc' was declared here
       wint_t wc;
              ^
    src/dfa.c:958:6: error: 'c' may be used uninitialized in this function 
[-Werror=maybe-uninitialized]
       if (c == '^')
          ^
    src/dfa.c:918:7: note: 'c' was declared here
       int c, c1, c2;
           ^
    src/dfasearch.c: In function 'dfaerror':
    src/dfasearch.c:50:1: warning: function might be candidate for attribute 
'noreturn' [-Wsuggest-attribute=noreturn]
     dfaerror (char const *mesg)
     ^
    src/main.c: In function 'usage':
    src/main.c:1525:1: warning: function might be candidate for attribute 
'noreturn' [-Wsuggest-attribute=noreturn]
     usage (int status)
     ^

  This is with gcc 4.8.1 on x86_64-pc-cygwin.

- Even though I called bootstrap right before trying to build, the build
  of lib/fpending.c fails due the fpending problem reported by

Thanks for writing the testcase.  As you noted above, it fails in
some cases and works in others, namely, the -F and -iF cases work,
the other cases don't, since the regex compiler mishandles surrogates.

But, here's a question:  If the surrogate-pair test fails without the
patch due to the SEGV, and it also fails with the patch, just in a
different way, what's the idea of the testcase?  In theory, shouldn't
there be two tests, one of them testing only for this very SEGV, and
another test testing how grep handles 4 byte UTF-8 values, since that's
another problem entirely?


Thanks,
Corinna

-- 
Corinna Vinschen
Cygwin Maintainer
Red Hat

Attachment: pgpEWvDU0POrF.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]