[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: new module c-strstr
From: |
Bruno Haible |
Subject: |
Re: new module c-strstr |
Date: |
Fri, 18 Aug 2006 20:44:33 +0200 |
User-agent: |
KMail/1.9.1 |
Paul Eggert wrote:
> > /* The functions defined in this file assume the "C" locale and a character
> > set without diacritics (ASCII-US or EBCDIC-US or something like that).
> > Even if the "C" locale on a particular system is an extension of the
> > ASCII
> > character set (like on BeOS, where it is UTF-8, or on AmigaOS, where it
> > is ISO-8859-1), the functions in this file recognize only the ASCII
> > characters. More precisely, one of the string arguments must be an ASCII
> > string with additional restrictions. */
>
> The intent here is to act like the "C", where all single bytes count
> as characters, ...
The "C" locale is not always a unibyte locale. On some systems, like
BeOS or MacOS X, even the C locale is a multibyte locale (with UTF-8
encoding). Therefore most of our "c-*" modules should better be called
"ascii-*" or "unibyte-*".
> even when some other locale is in effect, right?
The purpose is either to provide the semantics of a unibyte locale without
actually switching locales, or to provide the correct locale dependent
semantics through a speedier algorithm. I now see where the confusion comes
from: the first paragraph of comments highlights the first purpose; the
second highlights the second purpose; and they contradict each other.
> > This function is safe to be called, even in a multibyte locale, if NEEDLE
> > ...
>
> I think this claim isn't true for some weird non-ASCII encoding
> schemes like DBCS-Host.
Are these used as locale encodings? Many of these so-called DBCS encodings
are stateful and therefore not usable as locale encodings.
Non-nearly-ASCII-compatible encodings don't appear in the world where GNU
programs are deployed. I added a check to gperf with the effect that if a
gperf-generated program is compiled in an environment with an encoding
that is not nearly ASCII compatible (testing only the printable characters,
not the control characters), it will lead to a compilation failure, and
ask for a bug report. No such bug report has ever been filed.
> Also, it wouldn't be true if someone introduced a new encoding that
> varies from ASCII in some other way.
This is true, but the pace of creation of new encodings has slowed down
a lot in the last years. The last created new encoding scheme is
GB-18030, and that's 6 years ago. I expect that from now on, only minor
variations of existing encodings will be created.
> How about changing the wording to be:
>
> In all practical encodings that we know of that are extensions or
> near-extensions of ASCII, this function is safe to be called, even
> in a multibyte locale, if NEEDLE ...
The "nearly an ASCII extension" assumption is so ubiquitous, think of
(c >= '0') tests and similar. You really find it's worth mentioning?
> Another possibility would be to remove the claim entirely
But it's important to know that c_strstr (s, "x") is not safe and
c_strstr (s, "123") is also not safe. The programmer needs to have the
precise criteria.
> > foundneedle:
> > return (char*) haystack;
>
> The usual GNU style puts a space before the "*".
Yes. Fixed.
How about these comments? They don't talk about the C locale any more.
Bruno
======================== lib/strstr.h ==============================
/* Searching in a string.
Copyright (C) 2001-2003, 2006 Free Software Foundation, Inc.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2, or (at your option)
any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software Foundation,
Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. */
/* The functions defined in this file assume a nearly ASCII compatible
character set. */
#ifdef __cplusplus
extern "C" {
#endif
/* Find the first occurrence of NEEDLE in HAYSTACK.
This function is safe to be called, even in a multibyte locale, if NEEDLE
1. consists solely of printable ASCII characters excluding '\\' and '~'
[this restriction is needed because of Shift_JIS and JOHAB]
or of the control ASCII characters '\a' '\b' '\f' '\n' '\r' '\t' '\v'
[this restriction is needed because of VISCII], and
2. has at least length 2
[this restriction is needed because of BIG5, BIG5-HKSCS, GBK, GB18030,
Shift_JIS, JOHAB], and
3. does not consist entirely of decimal digits, or has at least length 4
[this restricion is needed because of GB18030].
This function is also safe to be called, even in a multibyte locale, if
HAYSTACK and NEEDLE are known to both consist solely of printable ASCII
characters excluding '\\' and '~'. */
extern char *c_strstr (const char *haystack, const char *needle);
#ifdef __cplusplus
}
#endif