bug-libunistring
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-libunistring] observations on the manual


From: Paolo Bonzini
Subject: [bug-libunistring] observations on the manual
Date: Tue, 28 Apr 2009 10:47:41 +0200
User-agent: Thunderbird 2.0.0.21 (Macintosh/20090302)

In libunistring.texi:

> The workarounds can be found in GNU gnulib 
> @url{http://wwww.gnu.org/software/gnulib/}.

wwww

> or --- if @code{wchar_t *}

I'm not sure if space before/after --- are good.

In unicase.texi:

> @cindex locale language
> These functions are locale dependent.  The @var{iso639_language} argument
> identifies the language (e.g. @code{"tr"} for Turkish).  NULL means to use
> locale independent case mappings.

Is it possible to pass just a POSIX locale name like tr_TR.UTF-8
directly as @var{iso639_language}, with everything but the language part
discarded?

Also, maybe you could add a special constant for the current locale
name, like ((void *)1), or even make that the default and specify "" for
locale-independent case mappings?

It seems to me that there is a limitation, in that you cannot turn to
lowercase/uppercase/titlecase parts of a string; for that you have to
use uc_toupper/lower/title and forget about the locale-specific mappings.

However, in many cases the context is available.  For example, if I
modified sed to use u8_tolower, this:

  s/[Α-Ωα-ω]/\L&/g

should have the same effect as doing the conversion on the entire string
(maybe more slowly).  I have not thought about the API so far, but it
seems to me that only the following character is needed, which makes it
noticeably easier.  You could pass to the functions the length of the
string overall and the length of the part to be converted.

> @code{memcmp2}

This function is provided by gnulib and should be defined somewhere in
the documentation.  It is also mentioned in unistr.texi.

> Converts the string @var{s} of length @var{n} to a string in locale encoding,

The output of xfrm functions is not guaranteed to be in locale encoding.
 In fact, it is just a sequence of bytes that represent the
locale-specific collation rules.

I noticed that there are no functions accepting NULL-terminated strings.
 Is this by design, or in the future they could be introduced (either as
u8_strtoupper, or for example with something like a -1 value for the
length)?

In unistr.texi:

> @deftypefun {uint8_t *} u8_cpy_alloc (const uint8_t address@hidden, size_t 
> @var{n})

Why not u8_dup?

In uniwidth.texi:

> These functions are locale dependent.  The @var{encoding} argument identifies
> the encoding (address@hidden @code{"ISO-8859-2"} for Polish).

The manual does not explain why an encoding is required rather than a
language.  I found this comment in the code:

  /* In ancient CJK encodings, Cyrillic and most other characters are
     double-width as well.  */

I believe it should be possible to make the encoding argument optional
(NULL = assume not in ancient CJK encodings).

In unistdio.texi:

> The following functions take an ASCII format string and produce output in
> locale encoding to a @code{FILE} stream.

I think these should be moved up with the other ulc_* functions, like:

"The following functions take an ASCII format string and produce output
in locale encoding---either returning it a @code{char *} string or
emitting it to a @code{FILE} stream".

Finally, I think that you should put somewhere information about the
intended ABI/API stability of libunistring (e.g. will be changed
incompatibly until 1.0).

Thanks for the great work!

Paolo




reply via email to

[Prev in Thread] Current Thread [Next in Thread]