[bug-gawk] Possible printf %c width multi-byte bug

bug-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gawk] Possible printf %c width multi-byte bug

From:	Nethox
Subject:	[bug-gawk] Possible printf %c width multi-byte bug
Date:	Fri, 10 May 2013 04:06:37 +0200
User-agent:	Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130413 Icedove/17.0.5

I am not sure if the following is a bug or intended behaviour. But I
find gawk's printf %c and %s inconsistent when width is specified and
multi-byte encoding (UTF-8) is used.


Test program:
        BEGIN { printf "%2c\n", "ú" }
Versions:
        GNU Awk 4.0.75, API: 0.0
        GNU Awk 4.0.1
        mawk 1.3.3 Nov 1996

Command line                                            |Output
--------------------------------------------------------+------
LC_ALL=C.UTF-8 gawk 'BEGIN { printf "%2c\n", "ú" }'     |ú    <-- ???
LC_ALL=C       gawk 'BEGIN { printf "%2c\n", "ú" }'     | Ã
LC_ALL=C.UTF-8 gawk 'BEGIN { printf "%2s\n", "ú" }'     | ú
LC_ALL=C       gawk 'BEGIN { printf "%2s\n", "ú" }'     |ú

LC_ALL=C.UTF-8 gawk -b 'BEGIN { printf "%2c\n", "ú" }'  | Ã
LC_ALL=C       gawk -b 'BEGIN { printf "%2c\n", "ú" }'  | Ã
LC_ALL=C.UTF-8 gawk -b 'BEGIN { printf "%2s\n", "ú" }'  |ú
LC_ALL=C       gawk -b 'BEGIN { printf "%2s\n", "ú" }'  |ú

Both gawk versions output the same. Options -c and -P never alter this.
gawk -b and mawk output the same (lack of multi-byte support).
Java 7's printf (Formatter) outputs the correct " ú" with %2c and %2s.

The "ú" character is Unicode U+00FA: LATIN SMALL LETTER U WITH ACUTE.
In UTF-8 it is encoded with 2 bytes: 0xC3BA.
When multi-byte support is not available, the gibberish result is "Ã".
Because only the fist byte C3 is used, and BA is discarded.
The "Ã" character is Unicode U+00C3: LATIN CAPITAL LETTER A WITH TILDE.

I have not found related tests in test/print* and test/mb* .


The problem I see is in the first command with %c, where I expected:
LC_ALL=C.UTF-8 gawk 'BEGIN { printf "%2c\n", "ú" }'    | ú

Which would be consistent with the padding behaviour of gawk's printf
%2s, length() and other functions which count chars and not bytes when
the user locale is UTF-8, and also with the man page:
%c      A single character.  If the argument used for %c is numeric, it
        is treated as a character and printed.  Otherwise, the argument
        is assumed to be a string, and the only first character of that
        string is printed.
I think this is the latest POSIX spec, which also talks about characters:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html
        7. For the c conversion specifier character: [...] If the
        argument does not have a numeric value, the first character of
        the string value shall output; if the string does not contain
        any characters, the behavior is undefined.

Now, both the padding and the content (gibberish in one case) changes
when the user locale changes. As a workaround,  printf "%2s\n",
substr("último", 1, 1)  cuts the first char of a longer multi-byte
encoded string.
If this is due to POSIX compliance, maybe the current behaviour could be
left in -c or -P modes and by default make both %c and %s print padded
characters for UTF-8 and other multi-byte locales.

Thanks and regards.

[Prev in Thread]

Current Thread

[Next in Thread]

[bug-gawk] Possible printf %c width multi-byte bug, Nethox, 2013/05/09
- [bug-gawk] Possible printf %c width multi-byte bug, Nethox, 2013/05/09
- [bug-gawk] Possible printf %c width multi-byte bug, Nethox <=
  - Re: [bug-gawk] Possible printf %c width multi-byte bug, Aharon Robbins, 2013/05/10

Prev by Date: [bug-gawk] Possible printf %c width multi-byte bug
Next by Date: Re: [bug-gawk] Possible printf %c width multi-byte bug
Previous by thread: [bug-gawk] Possible printf %c width multi-byte bug
Next by thread: Re: [bug-gawk] Possible printf %c width multi-byte bug
Index(es):
- Date
- Thread