|
From: | McFarland, Jeffrey |
Subject: | Multithreaded sort hangs on Solaris |
Date: | Mon, 11 Mar 2013 15:47:58 +0000 |
I have come across some odd results regarding the sort utility in coreutils version 8.20. I’ve looked through the archives and don’t see any similar issues so it may be something specific to our systems.
System: SunOS 5.10 Generic_147440-26 sun4u sparc SUNW,Sun-Fire-V890
Issue: When running sort on a 22.5 GB file I found that about 1 out of 10 times the process seems to hang (out of 100+ tests). The process is still running but the temp files are no longer changing and the final file either has not been created or is a 0 byte file. When this happens the temp files are never in the exact same state as a previous run. On this machine a complete sort normally takes about 20 minutes. On one occasion the process hung for over 48 hours before I killed it. Running top shows no significant load on the system.
Command run: ./sort -t\n -S 256M --batch-size=100 -T /disk/craiwk01/prod/SORTWK -T /disk/craiwk02/prod/SORTWK -T /disk/craiwk03/prod/SORTWK -T /disk/craiwk04/prod/SORTWK -T /disk/craiwk06/prod/SORTWK -k1.1,1.10 infile -o infile.sorted
>: ps PID TTY TIME CMD 16328 pts/3 5:06 sort 12697 pts/3 0:00 ps
>: sudo truss -rall -wall -f -p 16328 16328: lwp_park(0x00000000, 0) (sleeping...)
>: sudo pstack 16328 16328: /usr/local/abacus/etsort/sort -tn -S 295063 --batch-size=100 -T /disk/ ----------------- lwp# 1 / thread# 1 -------------------- ffffffff7d4d8818 lwp_park (0, 0, 0) 0000000100009c74 sortlines (111b56580, 111c56080, ffffffff7fffeab0, 10012a321, ffffffff7fffead0, 10012a328) + 514 000000010000a5cc sortlines (111558380, 2, ffffffff7fffeab0, 1121765e0, 0, ffffffff7fffeab0) + e6c 000000010000a5cc sortlines (111956f80, 4, ffffffff7fffeab0, 112176420, 0, ffffffff7fffeab0) + e6c 000000010000a5cc sortlines (112154760, 8, ffffffff7fffeab0, 1121760a0, 1, ffffffff7fffeab0) + e6c 000000010000c070 sort (10012a740, 0, ffffffff7fffead0, 23, 10012cddd, 112154760) + 350 000000010000e6e8 main (13, ffffffff7ffff148, 0, 10012c220, fffd, 10012b1e0) + 1ee8 00000001000041bc _start (0, 0, 0, 0, 0, 0) + 7c ----------------- lwp# 240 / thread# 240 -------------------- 000000010000a600 sortlines_thread(), exit value = 0x0000000000000000 ** zombie (exited, not detached, not yet joined) ** ----------------- lwp# 241 / thread# 241 -------------------- 000000010000a600 sortlines_thread(), exit value = 0x0000000000000000 ** zombie (exited, not detached, not yet joined) ** ----------------- lwp# 242 / thread# 242 -------------------- 000000010000a600 sortlines_thread(), exit value = 0x0000000000000000 ** zombie (exited, not detached, not yet joined) **
If I change the sort to run as a single threaded process (add “--parallel=1” to above command) then it doesn’t hang. This makes me think that it’s most likely a threading issue. I ran the same tests on a LINUX machine and it did not have the same hanging issue so it’s most likely only an issue with Solaris.
I initially found this issue using coreutils 8.9 and I changed to 8.20 to see if a fix had been made but no luck.
Is this a known issue? Are there any additional tests I should run to further narrow down this issue?
Thanks,
Jeff This e-mail and files transmitted with it are confidential, and are intended solely for the use of the individual or entity to whom this e-mail is addressed. If you are not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you are not one of the named recipient(s) or otherwise have reason to believe that you received this message in error, please immediately notify sender by e-mail, and destroy the original message. Thank You. |
[Prev in Thread] | Current Thread | [Next in Thread] |