[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: u32_normalize UNINORM_NFKC on 0xD800
From: |
Simon Josefsson |
Subject: |
Re: u32_normalize UNINORM_NFKC on 0xD800 |
Date: |
Fri, 27 May 2011 11:23:03 +0200 |
User-agent: |
Gnus/5.110018 (No Gnus v0.18) Emacs/23.2 (gnu/linux) |
Bruno Haible <address@hidden> writes:
> Simon Josefsson wrote:
>> I'm doing some Unicode NFKC operations and noticing that u32_normalize
>> fails for U+D800.
>
> This is a valid behaviour, because U+D800 is a "surrogate" point code
> and therefore not a valid character code point.
>
> See the Unicode standard, chapter 2 [1], pages 23..24:
> Surrogate code points and other non-character code points "should never be
> interchanged". This means, for libunistring, that they are invalid input
> and invalid output in all functions taking or returning UTF-32 strings or
> UTF-8 strings.
>
> Character code points and code points that are in regions that may be assigned
> in future Unicode versions must not be rejected; these are valid input.
I'm not interchanging the code points, I'm calculating this IDNA2008
property
toNFKC(toCaseFold(toNFKC(cp))) != cp
for all code points. Is this impossible to do with the u32_normalize
interface?
I notice that ICU also gives an error in this situation:
http://demo.icu-project.org/icu-bin/nbrowser?t=&s=D800&uv=0
I wonder what the above expression means when toNFKC fails..
I managed to work around this using a local patch to make u32_uctomb
mimic u32_mbtouc_unsafe's behaviour. But I'm not sure if I'm going to
use it.
--- lib/unistr/u32-uctomb.c.orig 2011-05-27 11:16:00.112466242 +0200
+++ lib/unistr/u32-uctomb.c 2011-05-27 11:16:01.696467065 +0200
@@ -30,8 +30,10 @@
int
u32_uctomb (uint32_t *s, ucs4_t uc, int n)
{
+#if CONFIG_UNICODE_SAFETY
if (uc < 0xd800 || (uc >= 0xe000 && uc < 0x110000))
{
+#endif
if (n > 0)
{
*s = uc;
@@ -39,9 +41,11 @@
}
else
return -2;
+#if CONFIG_UNICODE_SAFETY
}
else
return -1;
+#endif
}
#endif
/Simon