bug#17196: UTF-8 printf string formating problem

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#17196: UTF-8 printf string formating problem

From:	Pádraig Brady
Subject:	bug#17196: UTF-8 printf string formating problem
Date:	Tue, 08 Apr 2014 01:11:13 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2

On 04/07/2014 10:57 PM, Eric Blake wrote:
> [adding the Austin Group]
> 
> On 04/07/2014 07:08 AM, Pádraig Brady wrote:
>> On 04/06/2014 07:24 PM, Bob Proulx wrote:
>>> Pádraig Brady wrote:
>>>> Yes printf follows the C standard which only considers bytes.
>>>> ...
>>>> I don't think we'd be able to change the current operation of printf
>>>> due to backwards compat reasons? Though we might be able to somehow 
>>>> leverage
>>>> the existing multibyte character aware alignment/truncation code in:
>>>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD
>>>
>>> Dan Douglas pointed out in the corresponding discussion in bug-bash
>>> that ksh uses the L modifier.
>>>
>>>   http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html
>>>
>>>   Dan Douglas wrote:
>>>   > ksh93 already has this feature using the "L" modifier:
>>>   > 
>>>   > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
>>>   > ★★★
>>>
>>> At least there is prior art for it.
>>
>> So we can count bytes, chars or cells (graphemes).
>>
>> Thinking a bit more about it, I think shell level printf
>> should be dealing in text of the current encoding and counting cells.
>> In the edge case where you want to deal in bytes one can do:
>>   LC_ALL=C printf ...
>>
>> I see that ksh behaves as I would expect and counts cells,
>> though requires the explicit %L enabler:
>>   $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>>   á★★
>>   $ ksh -c "printf '%.3Ls\n' $'Ａ\u2605\u2605\u2605'"
>>   Ａ★
>>   $ ksh -c "printf '%.3Ls\n' $'ＡＡ\u2605\u2605\u2605'"
>>   Ａ
>>
>> zsh seems to just count characters:
>>   $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>>   á★
>>   $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
>>   á★
>>   $ zsh -c "printf '%.3Ls\n' $'Ａ\u2605\u2605\u2605'"
>>   Ａ★★
>>
>> I see that dash gives invalid directive for any of %ls %Ls %S.
>>
>> Pity there is no consensus here.
>> Personally I would go for:
>>   printf '%3s' 'blah'  # count cells
>>   printf '%3Ls' 'blah' # count chars
>>   LANG=C '%3Ls' 'blah' # count bytes
>>   LANG=C '%3s' 'blah'  # count bytes
> 
> Hmm.  POSIX requires support for %ls (aka %S) according to byte counts,
> and currently states that %Ls is undefined.  But I would LOVE to have a
> standardized spelling for counting characters instead of bytes.  The
> extension %Ls looks like a good candidate for standardization, precisely
> because counting characters when printing a multibyte string is more
> useful than counting bytes (you do NOT want to end in the middle of a
> multibyte character), and because ksh offers it as existing practice.

Note ksh seems to count cells with %Ls

> Your idea for counting "cells" (by which I'm assuming you mean one or
> more characters that all display within the same cell of the terminal,
> as if the end user saw only one grapheme), on the other hand, does not
> seem to have any precedence, and I would strongly object to having %s
> count by cells because %s already has a standardized (if unfortunate)
> meaning of counting by bytes.  Maybe yet another extension is warranted
> (perhaps %LLs?) as a new notion for counting by cells instead of
> characters, but it's harder to justify that without existing practice.

At the shell level I expect that the vast majority
of uses would prefer to be specifying cell counts.
I thought there might not be much backwards compat issues
with doing that, especially since zsh and gawk adjust
the meaning of %s according to the locale
(albeit for char rather than cell count).

But it's a fair point that there may be scripts
that don't consider the zsh behavior.

If we had to make it explicit for backwards compat reasons,
then I suppose counting by characters is the least useful,
so we could just standardize the existing ksh behavior and have:

   printf '%3s' 'blah'  # count bytes
   printf '%3Ls' 'blah' # count cells
   LANG=C '%3Ls' 'blah' # count bytes

This has the disadvantage of not degrading gracefully
on dash for example where %Ls is rejected.

thanks,
Pádraig.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#17196: UTF-8 printf string formating problem, Jan Novak, 2014/04/06
- bug#17196: UTF-8 printf string formating problem, Pádraig Brady, 2014/04/06
  - bug#17196: UTF-8 printf string formating problem, Pádraig Brady, 2014/04/06
  - bug#17196: UTF-8 printf string formating problem, Bob Proulx, 2014/04/06
    - bug#17196: UTF-8 printf string formating problem, Pádraig Brady, 2014/04/07
    - bug#17196: UTF-8 printf string formating problem, Jan Novak, 2014/04/07
    - bug#17196: UTF-8 printf string formating problem, Eric Blake, 2014/04/07
    - bug#17196: UTF-8 printf string formating problem, Pádraig Brady <=
    - bug#17196: UTF-8 printf string formating problem, Eric Blake, 2014/04/07
    - bug#17196: UTF-8 printf string formating problem, Steffen Nurpmeso, 2014/04/09
    - bug#17196: UTF-8 printf string formating problem, Rich Felker, 2014/04/10
    - bug#17196: UTF-8 printf string formating problem, Steffen Nurpmeso, 2014/04/10
    - bug#17196: UTF-8 printf string formating problem, Chet Ramey, 2014/04/10
    - bug#17196: UTF-8 printf string formating problem, Steffen Nurpmeso, 2014/04/11
    - bug#17196: UTF-8 printf string formating problem, Chet Ramey, 2014/04/11
    - bug#17196: UTF-8 printf string formating problem, Steffen Nurpmeso, 2014/04/11

Prev by Date: bug#17189: Sort bug #2
Next by Date: bug#17196: UTF-8 printf string formating problem
Previous by thread: bug#17196: UTF-8 printf string formating problem
Next by thread: bug#17196: UTF-8 printf string formating problem
Index(es):
- Date
- Thread