bug-libunistring
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-libunistring] observations on the manual


From: Paolo Bonzini
Subject: Re: [bug-libunistring] observations on the manual
Date: Wed, 29 Apr 2009 14:29:01 +0200
User-agent: Thunderbird 2.0.0.21 (Macintosh/20090302)

>> Also, maybe you could add a special constant for the current locale
>> name, like ((void *)1), or even make that the default
> 
> The function uc_locale_language() exists precisely for this purpose.
> You are supposed to call it once. It speeds up the u*_tolower etc. functions
> to not have to look up the current locale over and over again.

I think the function should be mentioned.

>> and specify "" for locale-independent case mappings?
> 
> For the locale independent mappings, you can use NULL, or "", or any
> other invalid territory name.

NULL as documented now is more than okay given your other observation.

>> However, in many cases the context is available.  For example, if I
>> modified sed to use u8_tolower, this:
>>
>>   s/[Α-Ωα-ω]/\L&/g
>>
>> should have the same effect as doing the conversion on the entire string
>> (maybe more slowly).
> 
> Well, I cannot really speak about 'sed'; but that sed command appears to
> request character-by-character processing.

Not necessarily, for example ὰ (GREEK SMALL LETTER ALPHA U+03B1 followed
by COMBINING ACUTE ACCENT U+0340) might match the character class in the
given locale.  This however was not the point; the question is that it
would be nice if these two sed commands

     s/[Α-Ωα-ω]/\L&/g
     s/[Α-Ωα-ω]*/\L&/g

would be equivalent on a version of GNU sed using Unicode case mappings
for its \L\l\U\u extension.

By the way, it would be nice to have an example of titlecase.  It is not
clear right now from the documentation if "foo bar" would be converted
to "Foo bar" or "Foo Bar".  In other words, a definition of
"titlecasing" would be useful.

Also, it seems the latter from reading the code, so having sample code
in the documentation on how to do the former kind of conversion would be
nice.  I suppose that would be like:

   call uN_wordbreaks
   look for the second wordbreak
   use uN_totitle until the second wordbreak, excluded
   use uN_tolower from the second wordbreak on

Having a function for this would be better though, because the function
would not need to find wordbreaks at all  In fact, just defining

  #define U_WORDBREAKS(s, n, wordbreaks)  memset ((wordbreaks), 0, (n))

and using u-totitle.h is enough, if suboptimal.

> No, these functions have an arbitrary long lookahead and an arbitrary
> long "look backwards". They don't need to look across lines, though.

Ouch. :-)  Only looking ahead/behind for combining characters (for
Lithuanian i) and end-of-word (for Greek sigma), or even more generally?

> [The need to call strlen manually] is more or less desired

Ok.

>>> @deftypefun {uint8_t *} u8_cpy_alloc (const uint8_t address@hidden, size_t 
>>> @var{n})
>> Why not u8_dup?
> 
> Indeed, that would make a better analogy with u8_strdup. But OTOH, the
> C function dup() does something entirely different...

I see the point.

Paolo





reply via email to

[Prev in Thread] Current Thread [Next in Thread]