bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#9321: repeated segfaults sorting large files in 8.12


From: Andras Salamon
Subject: bug#9321: repeated segfaults sorting large files in 8.12
Date: Sat, 20 Aug 2011 21:58:57 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

On Fri, Aug 19, 2011 at 11:54:46PM +0100, Pádraig Brady wrote:
On 08/18/2011 03:30 PM, Andras Salamon wrote:
I am seeing repeated (but not reliably repeatable) segmentation faults
sorting datasets in the 100MB-100GB range on a 64-bit Debian system
using GNU sort 8.12 (and also 8.9).  Stack traces seem to indicate
problems during the merge phase, usually when the temporary files
are being combined.

Andras, could you give the exact command line your having issue with,
and perhaps make sort inputs available too?

The sort inputs are several-gigabyte-range files containing strings,
each typically 60 to 140 bytes long, one per line.  There are
many duplicates, and the first reason to sort is to establish the
distribution of duplicates.  I would be happy to make available data
if I could find a reasonably sized file that causes a reproducible
segfault.  The problem seems easier to reproduce with larger files,
unfortunately.

Do the --batch-size=NMERGE or --compress-program=PROG options change anything?

Thanks for the suggestion, I will try forcing smaller batches.

Compressing batches was something I tried early on with no apparent
change in likelihood of failure, but it led to much slower runtimes.

Also there were temp file handling changes made in 7.2 so could you try:
ftp://ftp.gnu.org/gnu/coreutils/coreutils-7.1.tar.gz

Here are some of the relevant-seeming parts of a gdb session for
coreutils-7.1.  Here ?.xz is a compressed file which has already been
sorted, around 35MB in size.

Built with: configure CFLAGS=-g --disable-nls

Commandline:
% nohup xzcat 1.xz 2.xz 3.xz 4.xz | sort -S 100M -T /home/a/tmp | xz > o.xz &
Segmentation fault  ../bin/sort -T /home/a/tmp -S 100M | (core dumped)

During the run there were 435 temp files active at one point.
There may have been more at a later stage, but these were reduced
to a final 32 which remained after the crash.  There is around 600GB
free disk space on this volume.

% du -smc sort* | tail -1
29556   total

% ls -sktr sort*
  62776 sortR07gPu
  62056 sortS3H1Mu
  10848 sortECN8Nx
 951020 sortlk9Xd1
1001668 sortrDhnFQ
1001420 sortItDvPu
1001216 sortIBlIVY
1001500 sortDWg5Vj
1012504 sortOulxqu
 916424 sortOTNgnn
 907976 sortRlRPsA
 997840 sortuQbWXj
1001328 sortoWTS4K
1001436 sort3GpGf2
1001544 sortVudEk7
1009412 sortJou3Y3
 926628 sortL2SeVF
 950584 sortSTuAkJ
1001376 sortX9rCaf
1000928 sortAjXZkz
1001120 sortQzXcgK
1001412 sortLwoe9K
1012704 sortM4WHnD
 955044 sort1c8ja8
 981680 sortJhX3rd
1001040 sortqGq4yV
1000596 sort7obBHs
1000540 sortW4fLHR
1000800 sortSzB3s6
 999624 sortMD7K0b
 305892 sortqSxpe4
3183480 sortcOqzkh

(gdb) bt
#0  0x000000000040e6bc in memcoll (
s1=0x7800000005824d58 <Address 0x7800000005824d58 out of bounds>, s1len=15564440312192434243, s2=0x2b2a1a0 "<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.";..., s2len=68)
    at memcoll.c:50
#1  0x000000000040af4c in xmemcoll (
s1=0x7800000005824d58 <Address 0x7800000005824d58 out of bounds>, s1len=15564440312192434243, s2=0x2b2a1a0 "<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.";..., s2len=68)
    at xmemcoll.c:43
#2  0x00000000004059ee in compare (a=0x5b4a7f0, b=0x301dfc0) at sort.c:2059
#3 0x0000000000406815 in mergefps (files=0x24063e0, ntemps=15, nfiles=15, ofp=0x23ff8e0, output_file=0x24062ec "/home/a/tmp/sortcOqzkh")
    at sort.c:2326
#4 0x000000000040708f in merge (files=0x24063e0, ntemps=16, nfiles=32, output_file=0x0) at sort.c:2567
#5  0x000000000040766a in sort (files=0x61c660, nfiles=0, output_file=0x0)
    at sort.c:2699
#6  0x000000000040908c in main (argc=5, argv=0x7fff149247a8) at sort.c:3425

In context, line 2326 marked with ***:

      {
        size_t lo = 1;
        size_t hi = nfiles;
        size_t probe = lo;
        size_t ord0 = ord[0];
        size_t count_of_smaller_lines;

        while (lo < hi)
          {
***         int cmp = compare (cur[ord0], cur[ord[probe]]);  ***
            if (cmp < 0 || (cmp == 0 && ord0 < ord[probe]))
              hi = probe;
            else
              lo = probe + 1;
            probe = (lo + hi) / 2;
          }

        count_of_smaller_lines = lo - 1;
        for (j = 0; j < count_of_smaller_lines; j++)
          ord[j] = ord[j + 1];
        ord[count_of_smaller_lines] = ord0;
      }

In stack frame 3:

(gdb) p address@hidden
$51 = {7, 0, 14, 8, 1, 2, 9, 3, 10, 4, 11, 12, 5, 13, 6}
(gdb) print *cur[7]
$52 = {text = 0x7800000005824d58 <Address 0x7800000005824d58 out of bounds>, length = 15564440312192434244, keybeg = 0x0, keylim = 0x0}
(gdb) print *(cur[7]-1)
$54 = {
text = 0x5824d9c "<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-05>\n<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-05>\n<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-";..., length = 68, keybeg = 0xa500000000000000 <Address 0xa500000000000000 out of bounds>, keylim = 0x8900000000000000 <Address 0x8900000000000000 out of bounds>}
(gdb) print *(cur[7]+1)
$55 = {
text = 0x5824d14 "<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-05>\n<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-05>\n<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-";..., length = 68, keybeg = 0x0, keylim = 0x0}
(gdb) p (char *) 0x5824d58
$70 = 0x5824d58 
"<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-05>\n<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-05>\n<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-";...

I printed that last one because cur[7].text=0x7800000005824d58
differs by one byte from this location, and 0x58-0x14=0x9c-0x58=68,
so it might be relevant.

For interest, here is some gdb output on a core I saved with 8.12:

#5  0x00000000004073f5 in compare (a=0x228b5e0, b=0x68ce2d0) at
sort.c:2668
2668      diff = xmemcoll0 (a->text, alen + 1, b->text, blen + 1);
#6 0x000000000040837b in mergefps (files=0x119e230, ntemps=11, nfiles=11, ofp=0x11978b0, output_file=0x119787d "/home/a/tmp/sort1mESrU", fps=0x1197af0) at sort.c:2995
2995            int cmp = compare (cur[ord0], cur[ord[probe]]);

In frame 6:

(gdb) p address@hidden
$6 = {0x228b5e0, 0x2a9ff30, 0x30dff60, 0x35293b0, 0x4913940, 0x5020050, 0x5660080, 0x5bd0290, 0x68ce2d0, 0x6f60140, 0x75a0170}
(gdb) p address@hidden
$8 = {0, 8, 4, 9, 1, 5, 10, 2, 6, 3, 7}
(gdb) p ord0
$9 = 0
(gdb) p probe
$10 = 1
(gdb) p *(const struct line *)0x2a9ff30
$15 = {
  text = 0x245ff30 
"_:httpx3Ax2Fx2Fapix2Ehi5x2Ecomx2Frestx2Fprofilex2Ffoafx2F350598182xxbnode337", 
length = 77, keybeg = 0x0, keylim = 0x0}
(gdb) p *(const struct line *)0x228b5e0
$16 = {text = 0x600000000226d720 <Address 0x600000000226d720 out of bounds>, length = 14843864371813154892, keybeg = 0x756566736f4e2f72 <Address 0x756566736f4e2f72 out of bounds>, keylim = 0x66626f5f6f746c61 <Address 0x66626f5f6f746c61 out of bounds>}
(gdb) p *(const struct line *)0x75a0170
$18 = {
 text = 0x6f60170 
"_:httpx3Ax2Fx2Fapix2Ehi5x2Ecomx2Frestx2Fprofilex2Ffoafx2F492419832xxbnode215", 
length = 77, keybeg = 0x0, keylim = 0x0}
(gdb) p *buffer
$33 = {
buf = 0x1e1ff00 "_:httpx3Ax2Fx2Fapix2Ehi5x2Ecomx2Frestx2Fprofilex2Ffoafx2F104700830xxbnode271", used = 4596991, nlines = 61144, alloc = 6553632, left = 62, line_bytes = 32, eof = false}

-- Andras Salamon                   address@hidden





reply via email to

[Prev in Thread] Current Thread [Next in Thread]