Re: [PATCH] mbrtowc: remove a redundant condition

bug-gnulib

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] mbrtowc: remove a redundant condition

From:	Bruno Haible
Subject:	Re: [PATCH] mbrtowc: remove a redundant condition
Date:	Mon, 22 Mar 2021 13:49:04 +0100
User-agent:	KMail/5.1.3 (Linux/4.4.0-203-generic; KDE/5.18.0; x86_64; ; )

Hi Benno,

> By staring at the code and trying to understand whether I am missing
> something, why that c == 0xf4 is there.

I see. For consistency with the other occurrences of this idiom

lib/unistr/u8-check.c:66:                  && (c < 0xf4 || (c == 0xf4 && s[1] < 
0x90)))
lib/unistr/u8-mblen.c:56:                  && (c < 0xf4 || (c == 0xf4 && s[1] < 
0x90)))
lib/unistr/u8-mbtouc-aux.c:98:                              && (c < 0xf4 || (c 
== 0xf4 && s[1] < 0x90)))
lib/unistr/u8-mbtouc-unsafe-aux.c:98:                              && (c < 0xf4 
|| (c == 0xf4 && s[1] < 0x90))
lib/unistr/u8-mbtouc-unsafe.c:109:                              && (c < 0xf4 || 
(c == 0xf4 && s[1] < 0x90))
lib/unistr/u8-mbtouc.c:108:                              && (c < 0xf4 || (c == 
0xf4 && s[1] < 0x90)))
lib/unistr/u8-mbtoucr.c:95:                  && (c < 0xf4 || (c == 0xf4 && s[1] 
< 0x90)))
lib/unistr/u8-prev.c:68:                            && (c_4 < 0xf4 || (c_4 == 
0xf4 && c_3 < 0x90)))
lib/unistr/u8-strmblen.c:51:              && (c < 0xf4 || (c == 0xf4 && s[1] < 
0x90)))
lib/unistr/u8-strmbtouc.c:63:              && (c < 0xf4 || (c == 0xf4 && s[1] < 
0x90)))

I prefer to keep the test in, but commented out. I apply the patch below.

> I was looking at the code to see how conversion from multibyte to
> wide character is done, whether it's worth it to give nano its own
> copy.  There is quite some overhead at the beginning, and nano doesn't
> need to do the checks on m.  Also, I want nano's version to return -1
> for codes beyond U+10FFFF.  The mbtowc from glibc returns 4 for codes
> from U+110000 to U+1FFFFF, but those are not valid UTF-8 sequences,

Indeed, it's registered as a glibc bug:
  https://sourceware.org/bugzilla/show_bug.cgi?id=2373
  https://sourceware.org/bugzilla/show_bug.cgi?id=26034

> so nano would need to do another check afterward.

Yup. In GNU gettext, I've had abort()s due to the fact that some multibyte-to-
Unicode conversion allowed characters up to U+7FFFFFFF and a subsequent check
for a valid Unicode character (<= U+0010FFFF) failed.

You're right, this is worth documenting.

Bruno


2021-03-22  Benno Schulenberg  <bensberg@telfort.nl>  (tiny change)

        mbrtowc: Remove a redundant condition.
        * lib/mbrtowc-impl-utf8.h: There is no need to check for c == 0xf4
        when !(c < 0xf4), as ten lines earlier c <= 0xf4 was established.

diff --git a/lib/mbrtowc-impl-utf8.h b/lib/mbrtowc-impl-utf8.h
index 4f3bbb3..58006d3 100644
--- a/lib/mbrtowc-impl-utf8.h
+++ b/lib/mbrtowc-impl-utf8.h
@@ -96,7 +96,7 @@
 
                     if ((c2 ^ 0x80) < 0x40
                         && (c >= 0xf1 || c2 >= 0x90)
-                        && (c < 0xf4 || (c == 0xf4 && c2 < 0x90)))
+                        && (c < 0xf4 || (/* c == 0xf4 && */ c2 < 0x90)))
                       {
                         if (m == 2)
                           goto incomplete;


2021-03-22  Bruno Haible  <bruno@clisp.org>

        doc: Mention an open glibc bug.
        * doc/posix-functions/mbrtowc.texi: Mention the possible out-of-range
        wchar_t values returned by this function on glibc.
        * doc/posix-functions/mbtowc.texi: Likewise.

diff --git a/doc/posix-functions/mbrtowc.texi b/doc/posix-functions/mbrtowc.texi
index 897e4da..291207e 100644
--- a/doc/posix-functions/mbrtowc.texi
+++ b/doc/posix-functions/mbrtowc.texi
@@ -44,6 +44,12 @@ Solaris 9.
 Portability problems not fixed by Gnulib:
 @itemize
 @item
+In UTF-8 locales, this function may return wide characters up to 0x7FFFFFFF
+(that is, beyond 0x0010FFFF) on some platforms:
+@c https://sourceware.org/bugzilla/show_bug.cgi?id=2373
+@c https://sourceware.org/bugzilla/show_bug.cgi?id=26034
+glibc 2.33.
+@item
 On Windows and 32-bit AIX platforms, @code{wchar_t} is a 16-bit type and
 therefore cannot accommodate all Unicode characters.
 However, the ISO C11 function @code{mbrtoc32}, provided by Gnulib module
diff --git a/doc/posix-functions/mbtowc.texi b/doc/posix-functions/mbtowc.texi
index e7e5376..855f825 100644
--- a/doc/posix-functions/mbtowc.texi
+++ b/doc/posix-functions/mbtowc.texi
@@ -16,6 +16,12 @@ Android 4.4.
 Portability problems not fixed by Gnulib:
 @itemize
 @item
+In UTF-8 locales, this function may return wide characters up to 0x7FFFFFFF
+(that is, beyond 0x0010FFFF) on some platforms:
+@c https://sourceware.org/bugzilla/show_bug.cgi?id=2373
+@c https://sourceware.org/bugzilla/show_bug.cgi?id=26034
+glibc 2.33.
+@item
 This function accumulates hidden state on some platforms:
 glibc 2.8 (see @url{https://sourceware.org/bugzilla/show_bug.cgi?id=9674}).
 @item

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH] mbrtowc: remove a redundant condition, Benno Schulenberg, 2021/03/21
- Re: [PATCH] mbrtowc: remove a redundant condition, Bruno Haible, 2021/03/21
  - Re: [PATCH] mbrtowc: remove a redundant condition, Benno Schulenberg, 2021/03/22
    - Re: [PATCH] mbrtowc: remove a redundant condition, Bruno Haible <=

Prev by Date: Re: [PATCH] mbrtowc: remove a redundant condition
Next by Date: error, getprogname: Relicense under LGPLv2+
Previous by thread: Re: [PATCH] mbrtowc: remove a redundant condition
Next by thread: Is the 'diacrit' module still needed?
Index(es):
- Date
- Thread