[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#21916: sort -u drops unique lines with some locales
From: |
Christoph Anton Mitterer |
Subject: |
bug#21916: sort -u drops unique lines with some locales |
Date: |
Sat, 14 Nov 2015 22:19:55 +0100 |
Hey Pádraig
On Sat, 2015-11-14 at 11:06 +0000, Pádraig Brady wrote:
> Unfortunately the roman numeral code points compare equal:
>
> $ printf '%s\n' Ⅱ Ⅰ | ltrace -e strcoll sort
> sort->strcoll("\342\205\241", "\342\205\240") = 0
> Ⅱ
> Ⅰ
>
> If you compare at the byte level you'll get appropriate grouping:
>
> $ printf '%s\n' Ⅱ Ⅰ | LC_ALL=C sort
> Ⅰ
> Ⅱ
>
> The same goes for other similar representations,
> like full width forms of latin numbers:
>
> $ printf '%s\n' 2 1 | ltrace -e strcoll sort
> sort->strcoll("\357\274\222", "\357\274\221") = 0
> 2
> 1
So the bug's basically in the locales?
> That's a bit surprising, though maybe since only a limited
> number of these representations are provided, it was
> not thought appropriate to provide collation orders for them.
Really strange...
> One thing we might do immediately, is maybe with the sort --debug
> option,
> to provide some indication when strcoll() and memcmp() differ in
> direction.
Well I think the main problem here is that -u does then actually not
what most people would expect from it.
AFAIU, it removes any lines that *collation would consider as
duplicate* ... and not any lines which *actually are duplicates*.
God knows how many scripts and other stuff this already breaks... and I
wonder whether any other tools may be badly affected by that collation
stuff, too...
Imagine you do a cp -a ... or diff -qr and these would leave out any of
such files they consider duplicate :-(
That could really result in data loss.
Actually that's how I stumbled over it... I made some lists with find,
of files which are then to be binary compared on a source and copy
filesystem... over the find result I once used just sort and once sort
-u and was quite shocked then.
If I had taken the sort -u sorted list, then I might have lost some
files to copy / compare.
The semantics of -u are IMHO even more problematic, as it (AFAIU) won't
happen with LANG=C.
But normally people wouldn't expect that different locales lead to
completely different behaviour, especially with respect to collation -
they would only expect that things are ordered differently.
Does it seems possible that sort -u spills out a warning on stderr,
when such case occurs where -u drops lines, which are considered
identical in terms of collation but which aren't really identical?
Cheers,
Chris.
btw: Is that bugtracker somewhere accessible? Cause I'd like to update
the Debian bug to having been forwarded to this one here.
smime.p7s
Description: S/MIME cryptographic signature