bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#17196: UTF-8 printf string formating problem


From: Steffen Nurpmeso
Subject: bug#17196: UTF-8 printf string formating problem
Date: Thu, 10 Apr 2014 18:16:24 +0200
User-agent: s-nail v14.6.4-1-ga39836e

Rich Felker <address@hidden> wrote:
 |On Wed, Apr 09, 2014 at 02:49:37PM +0200, Steffen Nurpmeso wrote:
 |> Eric Blake <address@hidden> wrote:
 |>|Hmm.  POSIX requires support for %ls (aka %S) according to byte counts,
 |>|and currently states that %Ls is undefined.  But I would LOVE to have a
 |>|standardized spelling for counting characters instead of bytes.  The
 |>|extension %Ls looks like a good candidate for standardization, precisely
 |>|because counting characters when printing a multibyte string is more
 |>|useful than counting bytes (you do NOT want to end in the middle of a
 |>|multibyte character), and because ksh offers it as existing practice.
 |>|
 |>|Your idea for counting "cells" (by which I'm assuming you mean one or
 |>|more characters that all display within the same cell of the terminal,
 |>|as if the end user saw only one grapheme), on the other hand, does not
 |>|seem to have any precedence, and I would strongly object to having %s
 [.]
 |> I see you are trying to invent the word character for code points
 |> and reserve the term "graphem" for user-perceived characters.
 |> This goes in line with the GNU library which has the existing
 |> practice to let wcwidth(3) return the value 1 for accents and
 |> other combining code points as well as so-called (Unicode)
 |> noncharacters.  And who would call wcwidth(3) on something that is
 |> not to be drawn onto the screen directly afterwards.  And, of
 |> course, which terminal will perform the composition of code points
 |> written via STD I/O to characters on its own.
 |> I think for quite a while it is up to the input methods to combine
 |> into something precomposed in order to let POSIX programs finally
 |> work with it.
 |
 |Many languages do not have precomposed forms for all the character
 |sequences they need, and for some, it would not even be practical to
 |have precomposed forms, and would force the use of complex input
 |methods instead of simple keyboard maps.

And of course with UTF-8 decomposed forms of characters from an
immense number of languages can occur in at least theory, in,
e.g., a text file.
The german U+00F6 (LATIN SMALL LETTER U WITH DIAERESIS) could very
well be «ü» but also U+0076 U+0308 «u ̈», dependent on where it
came from.  And note that my vim(1) composed U+00F6 when i tried
to input the latter string automatically, i had to separate, enter
each, and join them together to get at «u» plus, actually non-,
combining diaeresis.  (In fact actually «combining with a space».)
Of course a wcwidth(3) of 1 for U+0308 is much better than 0 when
it really produces something visual.

Even better would nonetheless be the great picture with
a termios(4) IUTF8 flag, some extended xywidth(3) that returns
a tuple of {[EastAsianWidth indication,] is-combining,
width-if-non-combining} and best even some composition function.
I don't think that «user-perceived characters don't have any
precedence».  A whole lot of development in the past decade on the
winner side (that is, the other :) was exactly that -- making
software barrier-free.
If POSIX beams itself onto UTF-8 it should really consider to
offer a way to be able to act on what the user really deals with.
And that is, in the Unicode world -- and isn't that what the bug
report is about --, not necessarily a mbrlen(3)-division of bytes.

--steffen
--- Begin Message --- Subject: Re: bug#17196: UTF-8 printf string formating problem Date: Thu, 10 Apr 2014 03:56:10 -0400 User-agent: Mutt/1.5.21 (2010-09-15)
On Wed, Apr 09, 2014 at 02:49:37PM +0200, Steffen Nurpmeso wrote:
> Eric Blake <address@hidden> wrote:
>  |>>   Dan Douglas wrote:
>  |>>> ksh93 already has this feature using the "L" modifier:
>  |>>> 
>  |>>> ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
>  |>>> ★★★
>  |>>
>  |>> At least there is prior art for it.
>  |> 
>  |> So we can count bytes, chars or cells (graphemes).
>  |> 
>  |> Thinking a bit more about it, I think shell level printf
>  |> should be dealing in text of the current encoding and counting cells.
>  |> In the edge case where you want to deal in bytes one can do:
>  |>   LC_ALL=C printf ...
>  |> 
>  |> I see that ksh behaves as I would expect and counts cells,
>  |> though requires the explicit %L enabler:
>  |>   $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>  |>   á★★
>  |>   $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>  |>   A★
>  |>   $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
>  |>   A
>  |> 
>  |> zsh seems to just count characters:
>  |>   $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>  |>   á★
>  |>   $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
>  |>   á★
>  |>   $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>  |>   A★★
>  |> 
>  |> I see that dash gives invalid directive for any of %ls %Ls %S.
>  |> 
>  |> Pity there is no consensus here.
>  |> Personally I would go for:
>  |>   printf '%3s' 'blah'  # count cells
>  |>   printf '%3Ls' 'blah' # count chars
>  |>   LANG=C '%3Ls' 'blah' # count bytes
>  |>   LANG=C '%3s' 'blah'  # count bytes
>  |
>  |Hmm.  POSIX requires support for %ls (aka %S) according to byte counts,
>  |and currently states that %Ls is undefined.  But I would LOVE to have a
>  |standardized spelling for counting characters instead of bytes.  The
>  |extension %Ls looks like a good candidate for standardization, precisely
>  |because counting characters when printing a multibyte string is more
>  |useful than counting bytes (you do NOT want to end in the middle of a
>  |multibyte character), and because ksh offers it as existing practice.
>  |
>  |Your idea for counting "cells" (by which I'm assuming you mean one or
>  |more characters that all display within the same cell of the terminal,
>  |as if the end user saw only one grapheme), on the other hand, does not
>  |seem to have any precedence, and I would strongly object to having %s
>  |count by cells because %s already has a standardized (if unfortunate)
>  |meaning of counting by bytes.  Maybe yet another extension is warranted
>  |(perhaps %LLs?) as a new notion for counting by cells instead of
>  |characters, but it's harder to justify that without existing practice.
> 
> I see you are trying to invent the word character for code points
> and reserve the term "graphem" for user-perceived characters.
> This goes in line with the GNU library which has the existing
> practice to let wcwidth(3) return the value 1 for accents and
> other combining code points as well as so-called (Unicode)
> noncharacters.  And who would call wcwidth(3) on something that is
> not to be drawn onto the screen directly afterwards.  And, of
> course, which terminal will perform the composition of code points
> written via STD I/O to characters on its own.
> I think for quite a while it is up to the input methods to combine
> into something precomposed in order to let POSIX programs finally
> work with it.

Many languages do not have precomposed forms for all the character
sequences they need, and for some, it would not even be practical to
have precomposed forms, and would force the use of complex input
methods instead of simple keyboard maps.

Rich


--- End Message ---

reply via email to

[Prev in Thread] Current Thread [Next in Thread]