emacs-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[debbugs-tracker] bug#22155: closed (Wrong char count with UTF8 in sort


From: GNU bug Tracking System
Subject: [debbugs-tracker] bug#22155: closed (Wrong char count with UTF8 in sort -k)
Date: Sun, 13 Dec 2015 02:33:02 +0000

Your message dated Sun, 13 Dec 2015 02:32:51 +0000
with message-id <address@hidden>
and subject line Re: bug#22155: Wrong char count with UTF8 in sort -k
has caused the debbugs.gnu.org bug report #22155,
regarding Wrong char count with UTF8 in sort -k
to be marked as done.

(If you believe you have received this mail in error, please contact
address@hidden)


-- 
22155: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=22155
GNU Bug Tracking System
Contact address@hidden with problems
--- Begin Message --- Subject: Wrong char count with UTF8 in sort -k Date: Sat, 12 Dec 2015 23:53:40 +0100 User-agent: KMail/4.14.6 (Linux/3.19.0-39-generic; KDE/4.14.6; x86_64; ; )

Hello!

 

Given a text-file "sort.but.txt" with find-output like this:

07. Feb 2015 15:57 ./mess.jpg

05. Mär 2015 13:30 ./mess.jpg

 

Basically two columns: a date and a filename

I want sort to discard the duplicate lines for the same file using -u to keep only the first and -k to skip over the date column

 

> sort sort.bug.txt -u -s -k 1.20 --debug

sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet

sort: führende Leerzeichen sind signifikant in Schlüssel 1: Sie sollten daher

wahrscheinlich auch „b“ angeben

05. Mär 2015 13:30 ./mess.jpg

___________

07. Feb 2015 15:57 ./mess.jpg

__________

 

As the underlines in debug mode show, the keys start position depends on whether the month name contains pure ASCII or the German Umlaut ä.

 

There's a hint coming up, to apply option -b as this one character offset could possibly be overcome thanks to the separating whitespace between the columns.

 

> sort sort.bug.txt -u -s -k 1.20 -b --debug

sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet

05. Mär 2015 13:30 ./mess.jpg

__________

07. Feb 2015 15:57 ./mess.jpg

__________

 

In fact, it does correct the underlines, but still -u gives both lines, though I want it to discard the second line. You can add more lines for the same file, but sort insists on keeping exactly two: one with Umlaut and the other without.

 

This is: sort (GNU coreutils) 8.23

 

Thanks for the great utilities.

Holger

 

--

|_|/ MfG

| |\ Holger Klene

 

PGP Key ID: 0x22FFE57E

Attachment: signature.asc
Description: This is a digitally signed message part.


--- End Message ---
--- Begin Message --- Subject: Re: bug#22155: Wrong char count with UTF8 in sort -k Date: Sun, 13 Dec 2015 02:32:51 +0000 User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0
On 13/12/15 01:32, Pádraig Brady wrote:
> On 12/12/15 22:53, Holger Klene wrote:
>>> sort sort.bug.txt -u -s -k 1.20 -b --debug
>> sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet
>> 05. Mär 2015 13:30 ./mess.jpg
>>                    __________
>> 07. Feb 2015 15:57 ./mess.jpg
>>                    __________
>>
>> In fact, it does correct the underlines, but still -u gives both lines, 
>> though I want it to discard the second line. You can add more lines for the 
>> same file, but sort insists on keeping exactly two: one with Umlaut and the 
>> other without.
> 
> That's a bug in --debug because the implementation was split
> from the actual processing done during the sort (for performance reasons).
> Therefore we'll need to fix --debug to show what's being actually done

Patch attached.

thanks,
Pádraig.

Attachment: sort-debug-b.patch
Description: Text Data


--- End Message ---

reply via email to

[Prev in Thread] Current Thread [Next in Thread]