bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-gnulib] Re: strtok_r


From: Simon Josefsson
Subject: [Bug-gnulib] Re: strtok_r
Date: Fri, 12 Nov 2004 14:40:28 +0100
User-agent: Gnus/5.110003 (No Gnus v0.3) Emacs/21.3.50 (gnu/linux)

Bruno Haible <address@hidden> writes:

> Simon Josefsson wrote:
>> I'll install this in gnulib now.
>>
>> /* Parse S into tokens separated by characters in DELIM.
>>    If S is NULL, the saved pointer in SAVE_PTR is used as
>>    the next starting point.  For example:
>>      char s[] = "-abc-=-def";
>>      char *sp;
>>      x = strtok_r(s, "-", &sp);      // x = "abc", sp = "=-def"
>>      x = strtok_r(NULL, "-=", &sp);  // x = "def", sp = NULL
>>      x = strtok_r(NULL, "=", &sp);   // x = NULL
>>              // s = "abc\0-def\0"
>>
>>    For the POSIX documentation for this function, see:
>>    http://www.opengroup.org/onlinepubs/009695399/functions/strtok.html
>>
>>    Caveat: It modifies the original string.
>>    Caveat: These functions cannot be used on constant strings.
>>    Caveat: The identity of the delimiting character is lost.
>>    Caveat: It doesn't work with multibyte strings unless all of the
>>            delimiter characters are ASCII characters < 0x80.
>>
>>    See also strsep().
>> */
>
> Yes, this looks good. Except the 0x80 should really be 0x30. Most multibyte
> encodings have the property that an ASCII character is encoded as a single
> byte, with the same value as in ASCII. But here, in order to use, say, '0'
> or 'A' as a delimiter, you need a different property: That every occurrence
> of a byte with a given ASCII value means that ASCII character and is not
> part of a multibyte character. This property is fulfilled for UTF-8 and the
> EUC-*. Unfortunately, the following widely used encodings don't have this
> property:
>
>   BIG5 BIG5-HKSCS GBK SHIFT_JIS
>             don't have the property for 0x40 <= c <= 0x7E
>   GB18030   doesn't have the property for 0x30 <= c <= 0x39, 0x40 <= c <= 0x7E
>   JOHAB     doesn't have the property for 0x31 <= c <= 0x7E
>
> Especially GB18030 is probably bound to stay around for a long time.
> Therefore really 0x30 is the limit of the usable delimiters.

I think that is a rather subtle discussion -- especially considering
that, e.g., UCS-4 is a widely used multibyte encoding that is not
compatible with ASCII for any character.

Can't we say:

    Caveat: It only support one-octet delimiters.  With many character
            sets, non-ASCII characters cannot be used as delimiters.

?

Perhaps put your discussion in the manual?

("many" because, e.g., ISO-8859-1 delimiters are supported.)

Thanks.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]