[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] mbrtowc: remove a redundant condition
From: |
Bruno Haible |
Subject: |
Re: [PATCH] mbrtowc: remove a redundant condition |
Date: |
Mon, 22 Mar 2021 13:49:04 +0100 |
User-agent: |
KMail/5.1.3 (Linux/4.4.0-203-generic; KDE/5.18.0; x86_64; ; ) |
Hi Benno,
> By staring at the code and trying to understand whether I am missing
> something, why that c == 0xf4 is there.
I see. For consistency with the other occurrences of this idiom
lib/unistr/u8-check.c:66: && (c < 0xf4 || (c == 0xf4 && s[1] <
0x90)))
lib/unistr/u8-mblen.c:56: && (c < 0xf4 || (c == 0xf4 && s[1] <
0x90)))
lib/unistr/u8-mbtouc-aux.c:98: && (c < 0xf4 || (c
== 0xf4 && s[1] < 0x90)))
lib/unistr/u8-mbtouc-unsafe-aux.c:98: && (c < 0xf4
|| (c == 0xf4 && s[1] < 0x90))
lib/unistr/u8-mbtouc-unsafe.c:109: && (c < 0xf4 ||
(c == 0xf4 && s[1] < 0x90))
lib/unistr/u8-mbtouc.c:108: && (c < 0xf4 || (c ==
0xf4 && s[1] < 0x90)))
lib/unistr/u8-mbtoucr.c:95: && (c < 0xf4 || (c == 0xf4 && s[1]
< 0x90)))
lib/unistr/u8-prev.c:68: && (c_4 < 0xf4 || (c_4 ==
0xf4 && c_3 < 0x90)))
lib/unistr/u8-strmblen.c:51: && (c < 0xf4 || (c == 0xf4 && s[1] <
0x90)))
lib/unistr/u8-strmbtouc.c:63: && (c < 0xf4 || (c == 0xf4 && s[1] <
0x90)))
I prefer to keep the test in, but commented out. I apply the patch below.
> I was looking at the code to see how conversion from multibyte to
> wide character is done, whether it's worth it to give nano its own
> copy. There is quite some overhead at the beginning, and nano doesn't
> need to do the checks on m. Also, I want nano's version to return -1
> for codes beyond U+10FFFF. The mbtowc from glibc returns 4 for codes
> from U+110000 to U+1FFFFF, but those are not valid UTF-8 sequences,
Indeed, it's registered as a glibc bug:
https://sourceware.org/bugzilla/show_bug.cgi?id=2373
https://sourceware.org/bugzilla/show_bug.cgi?id=26034
> so nano would need to do another check afterward.
Yup. In GNU gettext, I've had abort()s due to the fact that some multibyte-to-
Unicode conversion allowed characters up to U+7FFFFFFF and a subsequent check
for a valid Unicode character (<= U+0010FFFF) failed.
You're right, this is worth documenting.
Bruno
2021-03-22 Benno Schulenberg <bensberg@telfort.nl> (tiny change)
mbrtowc: Remove a redundant condition.
* lib/mbrtowc-impl-utf8.h: There is no need to check for c == 0xf4
when !(c < 0xf4), as ten lines earlier c <= 0xf4 was established.
diff --git a/lib/mbrtowc-impl-utf8.h b/lib/mbrtowc-impl-utf8.h
index 4f3bbb3..58006d3 100644
--- a/lib/mbrtowc-impl-utf8.h
+++ b/lib/mbrtowc-impl-utf8.h
@@ -96,7 +96,7 @@
if ((c2 ^ 0x80) < 0x40
&& (c >= 0xf1 || c2 >= 0x90)
- && (c < 0xf4 || (c == 0xf4 && c2 < 0x90)))
+ && (c < 0xf4 || (/* c == 0xf4 && */ c2 < 0x90)))
{
if (m == 2)
goto incomplete;
2021-03-22 Bruno Haible <bruno@clisp.org>
doc: Mention an open glibc bug.
* doc/posix-functions/mbrtowc.texi: Mention the possible out-of-range
wchar_t values returned by this function on glibc.
* doc/posix-functions/mbtowc.texi: Likewise.
diff --git a/doc/posix-functions/mbrtowc.texi b/doc/posix-functions/mbrtowc.texi
index 897e4da..291207e 100644
--- a/doc/posix-functions/mbrtowc.texi
+++ b/doc/posix-functions/mbrtowc.texi
@@ -44,6 +44,12 @@ Solaris 9.
Portability problems not fixed by Gnulib:
@itemize
@item
+In UTF-8 locales, this function may return wide characters up to 0x7FFFFFFF
+(that is, beyond 0x0010FFFF) on some platforms:
+@c https://sourceware.org/bugzilla/show_bug.cgi?id=2373
+@c https://sourceware.org/bugzilla/show_bug.cgi?id=26034
+glibc 2.33.
+@item
On Windows and 32-bit AIX platforms, @code{wchar_t} is a 16-bit type and
therefore cannot accommodate all Unicode characters.
However, the ISO C11 function @code{mbrtoc32}, provided by Gnulib module
diff --git a/doc/posix-functions/mbtowc.texi b/doc/posix-functions/mbtowc.texi
index e7e5376..855f825 100644
--- a/doc/posix-functions/mbtowc.texi
+++ b/doc/posix-functions/mbtowc.texi
@@ -16,6 +16,12 @@ Android 4.4.
Portability problems not fixed by Gnulib:
@itemize
@item
+In UTF-8 locales, this function may return wide characters up to 0x7FFFFFFF
+(that is, beyond 0x0010FFFF) on some platforms:
+@c https://sourceware.org/bugzilla/show_bug.cgi?id=2373
+@c https://sourceware.org/bugzilla/show_bug.cgi?id=26034
+glibc 2.33.
+@item
This function accumulates hidden state on some platforms:
glibc 2.8 (see @url{https://sourceware.org/bugzilla/show_bug.cgi?id=9674}).
@item