bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#16581: suggested code simplification in dfa.c


From: Aaron Crane
Subject: bug#16581: suggested code simplification in dfa.c
Date: Wed, 29 Jan 2014 14:20:10 +0000

Paul Eggert <address@hidden> wrote:
> +/* The following functions exploit the commutativity and associativity of ^,
> +   and the fact that X ^ X is zero.  POSIX requires that C equals
> +   either tolower (C) or toupper (C); if the former, then C ^ tolower (C)
> +   is zero so C ^ xor_other (C) equals toupper (C), and similarly
> +   for the latter.  */
> +
> +/* Return the exclusive-OR of C and C's other case, or zero if C is
> +   not a letter that changes case.  */
> +
> +static wint_t
> +xor_wother (wint_t c)
> +{
> +  return towlower (c) ^ towupper (c);
> +}
[…]
> +      if (case_fold)
>          {
> +          wchar_t xor = xor_wother (wc);
> +          if (xor)
> +            {
> +              addtok_wc (wc ^ xor);
> +              addtok (OR);
> +            }

I don't think this works for the wide-character case. For example, in
a suitable locale, I'd expect U+01C8 LATIN CAPITAL LETTER L WITH SMALL
LETTER J ("Lj", roughly) to be U+01C7 LATIN CAPITAL LETTER LJ ("LJ")
under towupper(), and U+01C9 LATIN SMALL LETTER LJ ("lj") under
towlower(). This matches the behaviour I can observe with a simple
test program under the en_GB.UTF-8 locale on both Linux and Mac OS.

Since 0x1c7 ^ 0x1c9 == 14, and 0x1c8 ^ 14 == 0x1c6, this means we'd
call addtok_wc(0x1c6), and U+01C6 is LATIN SMALL LETTER DZ WITH CARON,
which isn't a desired character.

-- 
Aaron Crane ** http://aaroncrane.co.uk/





reply via email to

[Prev in Thread] Current Thread [Next in Thread]