[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-gnulib] Re: strtok_r
From: |
Simon Josefsson |
Subject: |
[Bug-gnulib] Re: strtok_r |
Date: |
Fri, 12 Nov 2004 14:40:28 +0100 |
User-agent: |
Gnus/5.110003 (No Gnus v0.3) Emacs/21.3.50 (gnu/linux) |
Bruno Haible <address@hidden> writes:
> Simon Josefsson wrote:
>> I'll install this in gnulib now.
>>
>> /* Parse S into tokens separated by characters in DELIM.
>> If S is NULL, the saved pointer in SAVE_PTR is used as
>> the next starting point. For example:
>> char s[] = "-abc-=-def";
>> char *sp;
>> x = strtok_r(s, "-", &sp); // x = "abc", sp = "=-def"
>> x = strtok_r(NULL, "-=", &sp); // x = "def", sp = NULL
>> x = strtok_r(NULL, "=", &sp); // x = NULL
>> // s = "abc\0-def\0"
>>
>> For the POSIX documentation for this function, see:
>> http://www.opengroup.org/onlinepubs/009695399/functions/strtok.html
>>
>> Caveat: It modifies the original string.
>> Caveat: These functions cannot be used on constant strings.
>> Caveat: The identity of the delimiting character is lost.
>> Caveat: It doesn't work with multibyte strings unless all of the
>> delimiter characters are ASCII characters < 0x80.
>>
>> See also strsep().
>> */
>
> Yes, this looks good. Except the 0x80 should really be 0x30. Most multibyte
> encodings have the property that an ASCII character is encoded as a single
> byte, with the same value as in ASCII. But here, in order to use, say, '0'
> or 'A' as a delimiter, you need a different property: That every occurrence
> of a byte with a given ASCII value means that ASCII character and is not
> part of a multibyte character. This property is fulfilled for UTF-8 and the
> EUC-*. Unfortunately, the following widely used encodings don't have this
> property:
>
> BIG5 BIG5-HKSCS GBK SHIFT_JIS
> don't have the property for 0x40 <= c <= 0x7E
> GB18030 doesn't have the property for 0x30 <= c <= 0x39, 0x40 <= c <= 0x7E
> JOHAB doesn't have the property for 0x31 <= c <= 0x7E
>
> Especially GB18030 is probably bound to stay around for a long time.
> Therefore really 0x30 is the limit of the usable delimiters.
I think that is a rather subtle discussion -- especially considering
that, e.g., UCS-4 is a widely used multibyte encoding that is not
compatible with ASCII for any character.
Can't we say:
Caveat: It only support one-octet delimiters. With many character
sets, non-ASCII characters cannot be used as delimiters.
?
Perhaps put your discussion in the manual?
("many" because, e.g., ISO-8859-1 delimiters are supported.)
Thanks.
Re: [Bug-gnulib] strtok_r, Paul Eggert, 2004/11/11