bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18540: Sorting bug?


From: Eric Blake
Subject: bug#18540: Sorting bug?
Date: Tue, 23 Sep 2014 14:58:01 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.0

On 09/23/2014 02:24 PM, Göran Uddeborg wrote:
> I discovered a behaviour of "sort" that looks like a bug to me.  When

Thanks for the report.  Most likely, it's not a bug in sort but in your
expectations, but let's analyze it as we go...

> one key in the input is an initial part of another key, the shorter
> key is sorted first if the key is all there is on the line.  But if
> there are other fields too, not included in the key, the order
> changes.  That is true even with the --stable flag, so "sort" seems to
> consider the order of the keys different in the two cases.
> 
> I sort in a non-C locale.  sv_SE.utf8 actually, but en_US.utf8 behaves
> the same so I illustrate using that.

Thanks for realizing that locale matters.  Not everyone does!

> 
> First case, the key is all there is on the line.  The shorter line
> gets sorted earlier, regardless of input order:
> 
>     address@hidden Hämtat]$ { echo 'binutils x86_64'; echo 
> 'binutils-x86_64-linux-gnu x86_64'; } | LANG=en_US.utf8 sort --stable --debug 
> --key=1,1 --field-separator=!

Wow - someone actually using the --debug option!  Makes diagnosis a LOT
easier, doesn't it!

Makes sense so far; in your locale, space and - collate identically, and
the rest of the line is a common prefix up to the shorter length.  Then
the NUL byte that ends the shorter line sorts before the suffix of the
second line.

> 
> Second case, the input lines contains a second field.  Now the longer
> field gets sorted earlier, regardless of input order:
> 

Oh my, it looks like you have indeed found an issue.  First, I'm going
to try and whittle it down to something that fits in a narrower window:

$ printf 'a b\na-b-c\n' | LANG=en_US.utf8 sort -s --debug -k1,1 -t!sort:
using ‘en_US.utf8’ sorting rules
a b
___
a-b-c
_____
$ printf 'a b!x\na-b-c!x\n' | LANG=en_US.utf8 sort -s --debug -k1,1 -t!
sort: using ‘en_US.utf8’ sorting rules
a-b-c!x
_____
a b!x
___


And of course, switching to the C locale makes the problem disappear,
even when I munge the prefix to be identical (rather than merely
collating identically).

> I can't see any reason for this.  Is it me not understanding sorting,
> or is it actually a bug?

Let's look further:

$ printf 'a b\na-b-c\n' | LANG=en_US.utf8 ltrace -e strcoll sort -s
--debug -k1,1 -t!
sort: using ‘en_US.utf8’ sorting rules
sort->strcoll("a b", "a-b-c")                    = -1
a b
___
a-b-c
_____
+++ exited (status 0) +++

ltrace says that we are indeed using strcoll(), and on the short form,
we are comparing the entire line.

Then on the longer form,

$ printf 'a b!x\na-b-c!x\n' | LANG=en_US.utf8 ltrace -e strcoll sort -s
--debug -k1,1 -t!
sort: using ‘en_US.utf8’ sorting rules
sort->strcoll("a b!x", "a-b-c!x")                = 21
a-b-c!x
_____
a b!x
___
+++ exited (status 0) +++


Huh? Why are we passing the ENTIRE line to strcoll?  Shouldn't we only
be passing the key?

Count yourself lucky - you may have actually found a bug! Very few
people can claim to find sort bugs (most reports are due to faulty user
expectations).  I'm still not sure where the code is going wrong, but it
indeed looks like something we need to fix.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]