qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 0/4] block: Keep track of parent quiescing


From: Kevin Wolf
Subject: Re: [Qemu-devel] [PATCH 0/4] block: Keep track of parent quiescing
Date: Fri, 14 Jun 2019 17:25:16 +0200
User-agent: Mutt/1.11.3 (2019-02-01)

Am 05.06.2019 um 18:11 hat Max Reitz geschrieben:
> We currently do not keep track of how many times a child has quiesced
> its parent.  We just guess based on the child’s quiesce_counter.  That
> keeps biting us when we try to leave drained sections or detach children
> (see e.g. commit 5cb2737e925042e).
> 
> I think we need an explicit counter to keep track of how many times a
> parent has been quiesced (patch 1).  That just makes the code more
> resilient.
> 
> Actually, no, we don’t need a counter, we need a boolean.  See patch 2
> for the explanation.
> 
> Yes, it is a bit weird to introduce a counter first (patch 1) and then
> immediately make it a boolean (patch 2).  But I believe this to be the
> most logical change set.
> 
> (“Our current model relies on counting, so adding an explicit counter
> makes sense.  It then turns out that counting is probably not the best
> idea, so why not make it a boolean?”)

Trying to summarise an IRC discussion I just had with Max:

The real root problem isn't that the recursion in bdrv_do_drained_end()
doesn't correctly deal with graph changes, but that those graph changes
happen in the first place. The one basic guiding principle in my drain
rewrite was that during the recursion (both to children and parents), no
graph changes are allowed, which means that no aio_poll() calls are
allowed either.

Of course, while I think the principle is right and greatly simplifies
the code (or actually is the only thing that gives us any hope to get
things right), I messed up the implementation because
bdrv_drain_invoke() does use BDRV_POLL_WHILE() for ending a drained
section. This is wrong, and it could still cause crashes after this
series because a recursive call could remove a node that is currently
processed somewhere up the call chain.

The fix for the observed bugs should be to make drained_end completely
symmetric to drained_begin: Just start the bdrv_drain_invoke_entry()
coroutine, do the recursion and call all the callbacks (none of which
may modify the graph) and only after all of this is done, poll once at
the top level drain. (The poll condition could be simplified to just
wait for bdrv_drain_invoke() to be completed, we don't care about other
running requests in drained_end. But this is only an optimisation.)

Despite this being a full fix, we also agreed that patch 1 is a nice
cleanup and we want to keep it even if it doesn't strictly fix a bug any
more.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]