qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Qemu-block] segfault in parallel blockjobs (iotest 30)


From: Anton Nefedov
Subject: Re: [Qemu-devel] [Qemu-block] segfault in parallel blockjobs (iotest 30)
Date: Wed, 15 Nov 2017 19:31:20 +0300
User-agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0



On 15/11/2017 6:42 PM, Alberto Garcia wrote:
On Fri 10 Nov 2017 04:02:23 AM CET, Fam Zheng wrote:
I'm thinking that perhaps we should add the pause point directly to
block_job_defer_to_main_loop(), to prevent any block job from
running the exit function when it's paused.

I was trying this and unfortunately this breaks the mirror job at
least (reproduced with a simple block-commit on the topmost node,
which uses commit_active_start() -> mirror_start_job()).

So what happens is that mirror_run() always calls
bdrv_drained_begin() before returning, pausing the block job. The
closing bdrv_drained_end() call is at the end of mirror_exit(),
already in the main loop.

So the mirror job is always calling block_job_defer_to_main_loop()
and mirror_exit() when the job is paused.

FWIW, I think Max's report on 194 failures is related:

https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01822.html

so perhaps it's worth testing his patch too:

https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01835.html

Well, that doesn't solve the original crash with parallel block jobs.
The root of the crash is that the mirror job manipulates the graph
_while being paused_, so the BlockReopenQueue in bdrv_reopen_multiple()
gets messed up, and pausing the jobs (commit 40840e419be31e) won't help.

I have the impression that one major source of headaches is the fact
that the reopen queue contains nodes that don't need to be reopened at
all. Ideally this should be detected early on in bdrv_reopen_queue(), so
there's no chance that the queue contains nodes used by a different
block job. If we had that then op blockers should be enough to prevent
these things. Or am I missing something?

Berto


After applying Max's patch I tried the similar approach; that is keep
BDSes referenced while they are in the reopen queue.
Now I get the stream job hanging. Somehow one blk_root_drained_begin()
is not paired with blk_root_drained_end(). So the job stays paused.
Didn't dig deeper yet, but at first glance the reduced reopen queue
won't help with this, as reopen drains all BDSes anyway (or can we avoid
that too?)

/Anton



reply via email to

[Prev in Thread] Current Thread [Next in Thread]