emacs-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[debbugs-tracker] bug#15450: closed (SORT failing on some lines)


From: GNU bug Tracking System
Subject: [debbugs-tracker] bug#15450: closed (SORT failing on some lines)
Date: Wed, 25 Sep 2013 19:42:03 +0000

Your message dated Wed, 25 Sep 2013 13:41:52 -0600
with message-id <address@hidden>
and subject line Re: bug#15450: SORT failing on some lines
has caused the debbugs.gnu.org bug report #15450,
regarding SORT failing on some lines
to be marked as done.

(If you believe you have received this mail in error, please contact
address@hidden)


-- 
15450: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=15450
GNU Bug Tracking System
Contact address@hidden with problems
--- Begin Message --- Subject: SORT failing on some lines Date: Mon, 23 Sep 2013 05:28:38 +0300 User-agent: Internet Messaging Program (IMP) H3 (4.3.11)

Hi there.
I am using Ubuntu Linux and the SORT command to sort a 605MB index file I have created from a wikipedia dump. The index file contains the article name followed by a separator and then the reference numbers of the byte offset within the wikipedia dump file.

This Index file I have created needs to be sorted into alphabetical order so that it can be searched quickly. I have found that although most of the lines in the file are sorted correctly, some are not, and this is throwing off the index searching.

While most items are alphabetically sorted, the following occurs (for example):

"Universe (1960 film)"
"Universe"

"Yellow 2G"
"Yellow"

the lines are in the wrong order. My C++ program which searches the index expects that "Universe" comes before "Universe (1960 film)" when doing a string compare.

Interestingly, if I copy these problem lines into a separate text file and run SORT on them, it sorts correctly.
I have tried every switch combination I can think of but the problem remains.
I am wondering if it is something to do with the size of the file I am trying to sort. 605 megabytes, about 10,000,000 lines of text. Again, most of the lines are sorted correctly, but some (and I haven't checked exactly how many, but am finding them at random) are not.

Would appreciate any help or comments you could offer
Many thanks

Best regards,
Sam


Sam Brown
Netinetics Oy
PL 23
00251
Helsinki
Finland





--- End Message ---
--- Begin Message --- Subject: Re: bug#15450: SORT failing on some lines Date: Wed, 25 Sep 2013 13:41:52 -0600 User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130805 Thunderbird/17.0.8
tag 15450 -moreinfo
tag 15450 +notabug
thanks

On 09/25/2013 12:28 PM, address@hidden wrote:
> 
> Hello Eric,
> Thank you kindly for your speedy reply.
> I should apologize for the lack of information included with my email.
> It was a hurried one.

Re-adding the list for closure, with permission.

> 
> In fact your suggestions and link and a bit of tinkering have cured the
> problem. SORT works fine it seems. I should have had more faith.
> The problem was purely with Locale, which I read up on in the FAQ link
> you sent. I had looked at Locale previously but didn't seem to have any
> success with it. I had also been trying various options for SORT,
> including -i, -d and even the field separation. (-t'#' -k1,1) I didn't
> have any luck but I realized after reading through your reply that it
> was the combination of these things which hadn't come right.
> 
> I'd just like to add here for anybody else who stumbles across this same
> problem, a description of the problem I was having in more detail (now
> solved)
> 
> The text file was a 605MB list of title texts extracted from Wikipedia,
> separated by a #--# and followed by the 'long long' integer offsets of
> where the article appeared in the dump file. (XML)
> Example lines:
> 
> Alps Electric#--#7701298893,12,24,364,394,420
> Alps Electric Co.#--#4280442890,12,28,339,3144,3170
> Alps Electric Corporation#--#9562165739,12,36,447,477,503
> 
> My machine was set to en-GB locale, although I had switched this to
> en-US with same (wrong) results.
> 
> It was necessary to set the locale to LC_ALL=C and also to instruct SORT
> only to look at the first field (up to the first #) using the -t'#' and
> -k1,1 switches as you mentioned.
> Obvious really, but the combination of the two is what caused my confusion.
> 
> It is really worth reading up on Locale for anybody using SORT and other
> utilities as it can profoundly change the results of an operation.
> Even setting locale to en-US doesn't help, as I read in the FAQ you
> linked, because en-US quite drastically reduces sort possibilities
> (case, punctuation etc ignored)
> 
> I'm sorry for the bother - but you put me on the right track.
> Many thanks for that.

Glad to hear it.  As such, I've closed the bug in the tracker.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


--- End Message ---

reply via email to

[Prev in Thread] Current Thread [Next in Thread]