[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: mbswidth "failure" on Solaris
From: |
Bruno Haible |
Subject: |
Re: mbswidth "failure" on Solaris |
Date: |
Sun, 05 May 2019 13:35:56 +0200 |
User-agent: |
KMail/5.1.3 (Linux/4.4.0-145-generic; KDE/5.18.0; x86_64; ; ) |
Hi,
> > 15 | e: {ββΓπΈβ = -βπ΅β/βt}
> > - | ^~~~~~~~~~~~~~
> > + | ^~~~~~~~~~~~~~~~~
Indeed, mbswidth seems to have returned 3 more columns.
> The error (three more columns than expected) seems to indicate something
> related to the combining arrow.
No. The issue comes from the math symbols. The following test programs shows
it:
#include <config.h>
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
#include "mbswidth.h"
int main ()
{
setlocale (LC_ALL, "en_US.UTF-8");
printf ("%d\n", (int) mbswidth ("{ββΓπΈβ = -βπ΅β/βt}",0)); // 14 vs 17
printf ("%d\n", wcwidth (0x2207)); // 1 vs. 2
printf ("%d\n", wcwidth (0x20D7)); // 0
printf ("%d\n", wcwidth (0x00D7)); // 1
printf ("%d\n", wcwidth (0x1D438)); // 1
printf ("%d\n", wcwidth (0x2202)); // 1 vs. 2
printf ("%d\n", wcwidth (0x1D435)); // 1
}
The following patch should fix it.
The patch changes the behaviour of wcwidth(0x2202) for UTF-8 locales.
It would be possible to limit the change to the non-East-Asian UTF-8
locales (by using the function uc_locale_language() and testing
whether its result is not one of "zh", "ja", "ko"), but glibc does not
do this (it uses the same width across all UTF-8 locales), therefore
I'm not doing it here either.
2019-05-05 Bruno Haible <address@hidden>
wcwidth: Ensure width 1, not 2, for ambiguous characters.
Reported by Kiyoshi KANAZAWA <address@hidden>
via Akim Demaille <address@hidden>.
* m4/wcwidth.m4 (gl_FUNC_WCWIDTH): Check the width of U+2202. Use an
en_US.UTF-8 locale, since that is more likely to be present than an
fr_FR.UTF-8 locale.
* tests/test-wcwidth.c (main): Check the width of U+2202.
* doc/posix-functions/wcwidth.texi: Mention the issue.
diff --git a/m4/wcwidth.m4 b/m4/wcwidth.m4
index 3952fd2..e9b5bf4 100644
--- a/m4/wcwidth.m4
+++ b/m4/wcwidth.m4
@@ -1,4 +1,4 @@
-# wcwidth.m4 serial 28
+# wcwidth.m4 serial 29
dnl Copyright (C) 2006-2019 Free Software Foundation, Inc.
dnl This file is free software; the Free Software Foundation
dnl gives unlimited permission to copy and/or distribute it,
@@ -54,6 +54,8 @@ AC_DEFUN([gl_FUNC_WCWIDTH],
dnl On OSF/1 5.1, wcwidth(0x200B) (ZERO WIDTH SPACE) returns 1.
dnl On OpenBSD 5.8, wcwidth(0xFF1A) (FULLWIDTH COLON) returns 0.
dnl This leads to bugs in 'ls' (coreutils).
+ dnl On Solaris 11.4, wcwidth(0x2202) (PARTIAL DIFFERENTIAL) returns 2,
+ dnl even in Western locales.
AC_CACHE_CHECK([whether wcwidth works reasonably in UTF-8 locales],
[gl_cv_func_wcwidth_works],
[
@@ -80,7 +82,7 @@ int wcwidth (int);
int main ()
{
int result = 0;
- if (setlocale (LC_ALL, "fr_FR.UTF-8") != NULL)
+ if (setlocale (LC_ALL, "en_US.UTF-8") != NULL)
{
if (wcwidth (0x0301) > 0)
result |= 1;
@@ -90,6 +92,8 @@ int main ()
result |= 4;
if (wcwidth (0xFF1A) == 0)
result |= 8;
+ if (wcwidth (0x2202) > 1)
+ result |= 16;
}
return result;
}]])],
diff --git a/tests/test-wcwidth.c b/tests/test-wcwidth.c
index eb7bdd2..8e9cea3 100644
--- a/tests/test-wcwidth.c
+++ b/tests/test-wcwidth.c
@@ -72,6 +72,22 @@ main ()
ASSERT (wcwidth (0x200B) == 0);
ASSERT (wcwidth (0xFEFF) <= 0);
+ /* Test width of some math symbols.
+ U+2202 is marked as having ambiguous width (A) in EastAsianWidth.txt
+ (see <https://www.unicode.org/Public/12.0.0/ucd/EastAsianWidth.txt>).
+ The Unicode Standard Annex 11
+ <https://www.unicode.org/reports/tr11/tr11-36.html>
+ says
+ "Ambiguous characters behave like wide or narrow characters
+ depending on the context (language tag, script identification,
+ associated font, source of data, or explicit markup; all can
+ provide the context). If the context cannot be established
+ reliably, they should be treated as narrow characters by default."
+ For wcwidth(), the only available context information is the locale.
+ "fr_FR.UTF-8" is a Western locale, not an East Asian locale, therefore
+ U+2202 should be treated like a narrow character. */
+ ASSERT (wcwidth (0x2202) == 1);
+
/* Test width of some CJK characters. */
ASSERT (wcwidth (0x3000) == 2);
ASSERT (wcwidth (0xB250) == 2);
diff --git a/doc/posix-functions/wcwidth.texi b/doc/posix-functions/wcwidth.texi
index 741be8e..ecdf758 100644
--- a/doc/posix-functions/wcwidth.texi
+++ b/doc/posix-functions/wcwidth.texi
@@ -18,6 +18,10 @@ glibc 2.8.
This function handles combining characters in UTF-8 locales incorrectly on some
platforms:
Mac OS X 10.3, OpenBSD 5.8.
address@hidden
+This function returns 2 for characters with ambiguous east asian width, even in
+Western locales, on some platforms:
+Solaris 11.4.
@end itemize
Portability problems not fixed by Gnulib: