[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-gawk] Possible printf %c width multi-byte bug
From: |
Nethox |
Subject: |
[bug-gawk] Possible printf %c width multi-byte bug |
Date: |
Fri, 10 May 2013 04:06:37 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130413 Icedove/17.0.5 |
I am not sure if the following is a bug or intended behaviour. But I
find gawk's printf %c and %s inconsistent when width is specified and
multi-byte encoding (UTF-8) is used.
Test program:
BEGIN { printf "%2c\n", "ú" }
Versions:
GNU Awk 4.0.75, API: 0.0
GNU Awk 4.0.1
mawk 1.3.3 Nov 1996
Command line |Output
--------------------------------------------------------+------
LC_ALL=C.UTF-8 gawk 'BEGIN { printf "%2c\n", "ú" }' |ú <-- ???
LC_ALL=C gawk 'BEGIN { printf "%2c\n", "ú" }' | Ã
LC_ALL=C.UTF-8 gawk 'BEGIN { printf "%2s\n", "ú" }' | ú
LC_ALL=C gawk 'BEGIN { printf "%2s\n", "ú" }' |ú
LC_ALL=C.UTF-8 gawk -b 'BEGIN { printf "%2c\n", "ú" }' | Ã
LC_ALL=C gawk -b 'BEGIN { printf "%2c\n", "ú" }' | Ã
LC_ALL=C.UTF-8 gawk -b 'BEGIN { printf "%2s\n", "ú" }' |ú
LC_ALL=C gawk -b 'BEGIN { printf "%2s\n", "ú" }' |ú
Both gawk versions output the same. Options -c and -P never alter this.
gawk -b and mawk output the same (lack of multi-byte support).
Java 7's printf (Formatter) outputs the correct " ú" with %2c and %2s.
The "ú" character is Unicode U+00FA: LATIN SMALL LETTER U WITH ACUTE.
In UTF-8 it is encoded with 2 bytes: 0xC3BA.
When multi-byte support is not available, the gibberish result is "Ã".
Because only the fist byte C3 is used, and BA is discarded.
The "Ã" character is Unicode U+00C3: LATIN CAPITAL LETTER A WITH TILDE.
I have not found related tests in test/print* and test/mb* .
The problem I see is in the first command with %c, where I expected:
LC_ALL=C.UTF-8 gawk 'BEGIN { printf "%2c\n", "ú" }' | ú
Which would be consistent with the padding behaviour of gawk's printf
%2s, length() and other functions which count chars and not bytes when
the user locale is UTF-8, and also with the man page:
%c A single character. If the argument used for %c is numeric, it
is treated as a character and printed. Otherwise, the argument
is assumed to be a string, and the only first character of that
string is printed.
I think this is the latest POSIX spec, which also talks about characters:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html
7. For the c conversion specifier character: [...] If the
argument does not have a numeric value, the first character of
the string value shall output; if the string does not contain
any characters, the behavior is undefined.
Now, both the padding and the content (gibberish in one case) changes
when the user locale changes. As a workaround, printf "%2s\n",
substr("último", 1, 1) cuts the first char of a longer multi-byte
encoded string.
If this is due to POSIX compliance, maybe the current behaviour could be
left in -c or -P modes and by default make both %c and %s print padded
characters for UTF-8 and other multi-byte locales.
Thanks and regards.