[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Multithreaded sort hangs on Solaris
From: |
Pádraig Brady |
Subject: |
Re: Multithreaded sort hangs on Solaris |
Date: |
Wed, 13 Mar 2013 17:25:12 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 |
On 03/13/2013 02:18 PM, McFarland, Jeffrey wrote:
> Here are the values from another sort that has been running for over 12 hours
> now. This time that second argument (number of threads) looks fine in all
> three cases. And this time there are no zombie threads.
>
>> : pstack 20632
> 20632: /usr/local/abacus/etsort/sort -tn -S 295063 --batch-size=100 -T
> /disk/
> ----------------- lwp# 1 / thread# 1 --------------------
> ffffffff7eadc810 lwp_wait (f2, ffffffff7fffea9c)
> ffffffff7ead4d74 _thrp_join (f2, 0, 0, 1, ffffffff7fffeca0,
> ffffffff7fffea9c) + 38
> 000000010000f2f4 sortlines (110137e90, 8, 7194a, 11015bfe0,
> ffffffff7fffeca0, 100136240) + 174
> 0000000100010144 sort (100137cd0, 1, ffffffff7ffff660, 8, ffffffff7fffeeac,
> ffffffff7ed00200) + 2f0
> 0000000100012bf4 main (13, ffffffff7ffff1f8, ffffffff7ffff298, 100136ca8,
> 100000000, ffffffff7ed00200) + 21cc
> 0000000100004ca4 _start (0, 0, 0, 0, 0, 0) + 7c
> ----------------- lwp# 242 / thread# 242 --------------------
> ffffffff7eadc810 lwp_wait (f4, ffffffff7e1fbd2c)
> ffffffff7ead4d74 _thrp_join (f4, 0, 0, 1, ffffffff7fffeca0,
> ffffffff7e1fbd2c) + 38
> 000000010000f2f4 sortlines (110137e90, 4, 7194a, 11015c050,
> ffffffff7fffeca0, 100136240) + 174
> 000000010000f168 sortlines_thread (ffffffff7fffeb60, 1fc000, 0, 0,
> 10000f104, 0) + 64
> ffffffff7ead8778 _lwp_start (0, 0, 0, 0, 0, 0)
> ----------------- lwp# 244 / thread# 244 --------------------
> ffffffff7ead8818 lwp_park (0, 0, 0)
> 000000010000e710 lock_node (11015c360, 10f691fb0, ffffffff7ec4a300,
> ffffffff7fffecac, ffffffff7ed00a00, 0) + 14
> 000000010000efbc queue_check_insert_parent (ffffffff7fffeca0, 11015c3d0,
> 100136240, 1101597dd, ffffffff7ed00a00, 1c00) + 2c
> 000000010000f0e8 merge_loop (ffffffff7fffeca0, 7194a, 100136240, 1101597dd,
> ffffffff7eacff0c, 3) + 90
> 000000010000f43c sortlines (110137e90, 2, 7194a, 11015c0c0,
> ffffffff7fffeca0, 100136240) + 2bc
> 000000010000f168 sortlines_thread (ffffffff7e1fbdf0, 1fc000, 0, 0,
> 10000f104, 0) + 64
> ffffffff7ead8778 _lwp_start (0, 0, 0, 0, 0, 0)
Looks like a deadlock, but may be triggered by stack corruption,
as the failure modes vary.
Would it be possible to annotate lock_node() with that attached.
This should verify we're at least not missing an unlock() somewhere.
You can then capture the annotations by adding '2> locks' at the end of the
command.
thanks,
Pádraig.
sort-lock-annotate.diff
Description: Text Data