Re: [bug-libunistring] observations on the manual

bug-libunistring

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-libunistring] observations on the manual

From:	Bruno Haible
Subject:	Re: [bug-libunistring] observations on the manual
Date:	Mon, 11 May 2009 00:34:34 +0200
User-agent:	KMail/1.9.9

Hello Paolo,

Paolo Bonzini wrote on 2009-04-29:
> > The function uc_locale_language() exists precisely for this purpose.
> > You are supposed to call it once. It speeds up the u*_tolower etc. functions
> > to not have to look up the current locale over and over again.
> 
> I think the function should be mentioned.

The function is mentioned just a few lines away from the functions that
take an iso639_language argument.

> the question is that it would be nice if these two sed commands
> 
>      s/[Α-Ωα-ω]/\L&/g
>      s/[Α-Ωα-ω]*/\L&/g
> 
> would be equivalent on a version of GNU sed using Unicode case mappings
> for its \L\l\U\u extension.

Yes, these can be equivalent if, for each substring to be transformed to
lowercase (a 1-character string in the first case) the context is being
specified in the form of the entire string. Just the casemap-a-substring
that you requested a few days ago.

> By the way, it would be nice to have an example of titlecase.  It is not
> clear right now from the documentation if "foo bar" would be converted
> to "Foo bar" or "Foo Bar".  In other words, a definition of
> "titlecasing" would be useful.

Good point. I'm adding a piece of doc about it, see below.

> Also, it seems the latter from reading the code, so having sample code
> in the documentation on how to do the former kind of conversion would be
> nice.  I suppose that would be like:
> 
>    call uN_wordbreaks
>    look for the second wordbreak
>    use uN_totitle until the second wordbreak, excluded
>    use uN_tolower from the second wordbreak on

This could be some sample code, probably outside of the doc. Do you want
to help writing it?

> Having a function for this would be better though, because the function
> would not need to find wordbreaks at all  In fact, just defining
> 
>   #define U_WORDBREAKS(s, n, wordbreaks)  memset ((wordbreaks), 0, (n))
> 
> and using u-totitle.h is enough, if suboptimal.

This should IMO be a sample as well, not a built-in function. The
transformation that you are suggesting (titlecase the first letter in the
string, lowercase the rest) is "sentence capitalization" for some languages.
But in other languages the sentence capitalization is different.
  http://en.wikipedia.org/wiki/Capitalization
  http://en.wikipedia.org/wiki/Sentence_case

> > No, these functions have an arbitrary long lookahead and an arbitrary
> > long "look backwards". They don't need to look across lines, though.
> 
> Ouch. :-)  Only looking ahead/behind for combining characters (for
> Lithuanian i) and end-of-word (for Greek sigma), or even more generally?

There are a couple of cases that require looking ahead/behind. See
  http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf
table 3-14 on page 124.

Bruno


2009-05-10  Bruno Haible  <address@hidden>

        * doc/unicase.texi (Case mappings of characters,
        Case mappings of strings): Explain what title case is.
        Reported by Paolo Bonzini.

--- doc/unicase.texi.orig       2009-05-11 00:24:37.000000000 +0200
+++ doc/unicase.texi    2009-05-11 00:23:53.000000000 +0200
@@ -42,6 +42,33 @@
 
 @deftypefun ucs4_t uc_totitle (ucs4_t @var{uc})
 Returns the titlecase mapping of the Unicode character @var{uc}.
+
+The titlecase mapping of a character is to be used when the character should
+look like upper case and the following characters are lower cased.
+
+For most characters, this is the same as the uppercase mapping.  There are
+only few characters where the title case variant and the uuper case variant
+are different.  These characters occur in the Latin writing of the Croatian,
+Bosnian, and Serbian languages.
+
address@hidden Normally we would use .33 space for each column, but this is too 
much in
address@hidden TeX mode, see
address@hidden 
<http://lists.gnu.org/archive/html/bug-texinfo/2009-05/msg00016.html>.
address@hidden @columnfractions .31 .31 .31
address@hidden Lower case @tab Title case @tab Upper case
address@hidden LATIN SMALL LETTER LJ
+ @tab LATIN CAPITAL LETTER L WITH SMALL LETTER J
+ @tab LATIN CAPITAL LETTER LJ
address@hidden LATIN SMALL LETTER NJ
+ @tab LATIN CAPITAL LETTER N WITH SMALL LETTER J
+ @tab LATIN CAPITAL LETTER NJ
address@hidden LATIN SMALL LETTER DZ
+ @tab LATIN CAPITAL LETTER D WITH SMALL LETTER Z
+ @tab LATIN CAPITAL LETTER DZ
address@hidden LATIN SMALL LETTER DZ WITH CARON
+ @tab LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
+ @tab LATIN CAPITAL LETTER DZ WITH CARON
address@hidden multitable
 @end deftypefun
 
 @node Case mappings of strings
@@ -94,6 +121,10 @@
 @deftypefunx {uint32_t *} u32_totitle (const uint32_t address@hidden, size_t 
@var{n}, const char address@hidden, uninorm_t @var{nf}, uint32_t 
address@hidden, size_t address@hidden)
 Returns the titlecase mapping of a string.
 
+Mapping to title case means that, in each word, the first cased character
+is being mapped to title case and the remaining characters of the word
+are being mapped to lower case.
+
 The @var{nf} argument identifies the normalization form to apply after the
 case-mapping.  It can also be NULL, for no normalization.
 @end deftypefun

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [bug-libunistring] observations on the manual, Bruno Haible <=
- Re: [bug-libunistring] observations on the manual, Paolo Bonzini, 2009/05/11

Prev by Date: Re: [bug-libunistring] Improve parallel build.
Next by Date: [bug-libunistring] typo in mirror character description?
Previous by thread: [bug-libunistring] patch: more bliss (augmented .gitignore files)
Next by thread: Re: [bug-libunistring] observations on the manual
Index(es):
- Date
- Thread