[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#12285: uniq on a UTF8 file with roman numerals
From: |
Pádraig Brady |
Subject: |
bug#12285: uniq on a UTF8 file with roman numerals |
Date: |
Mon, 27 Aug 2012 01:16:24 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/20110816 Thunderbird/6.0 |
tag 12285 + notabug
close 12285
stop
more info below...
On 08/26/2012 09:53 PM, Pádraig Brady wrote:
> On 08/26/2012 06:49 PM, P. Michaud wrote:
>> Hello,
>>
>> I used the command
>>
>> "uniq -dc myfile.txt'
>>
>> here are some lines of the output
>>
>> 2 ☼ turvy
>> 2 ☼ with gay abandon
>> 2 ☼ with reckless abandon
>> 10 ☼ yyⅰ
>> 9 ☼ yyⅹⅲ
>> 2 ☼ yyⅺ
>> 12 ☼ zzⅰ
>>
>>
>> The three first lines above are correct and correspond to real duplicates
>> lines in the file, but the numbers on the 4 last one are erroneous, each of
>> them correspond to a single line in the file.
>>
>> Yours faithfully.
>>
>> Pierre Michaud
>
> What system are you on
> What version of uniq
> What is the input exactly
>
> I suspect your locale is equating roman numerals (though that is surprising),
It seems that these roman numerals are treated a equal in collating order,
so uniq is behaving as expected:
$ sort <(printf "%s\n" ⅲ ⅱ ⅰ)
ⅲ
ⅱ
ⅰ
$ uniq -dc <(printf "%s\n" ⅲ ⅱ ⅰ)
3 ⅲ
You can avoid this behaviour by doing a byte comparison
by using LC_ALL=C.
$ LC_ALL=C sort <(printf "%s\n" ⅲ ⅱ ⅰ)
ⅰ
ⅱ
ⅲ
$ LC_ALL=C uniq -c <(printf "%s\n" ⅲ ⅱ ⅰ)
1 ⅲ
1 ⅱ
1 ⅰ
thanks,
Pádraig.