qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 05/21] block: Introduce bdrv_schedule_unref()


From: Kevin Wolf
Subject: Re: [PATCH 05/21] block: Introduce bdrv_schedule_unref()
Date: Wed, 6 Sep 2023 11:17:09 +0200

Am 05.09.2023 um 18:39 hat Kevin Wolf geschrieben:
> Am 22.08.2023 um 21:01 hat Stefan Hajnoczi geschrieben:
> > On Thu, Aug 17, 2023 at 02:50:04PM +0200, Kevin Wolf wrote:
> > > bdrv_unref() is called by a lot of places that need to hold the graph
> > > lock (it naturally happens in the context of operations that change the
> > > graph). However, bdrv_unref() takes the graph writer lock internally, so
> > > it can't actually be called while already holding a graph lock without
> > > causing a deadlock.
> > > 
> > > bdrv_unref() also can't just become GRAPH_WRLOCK because it drains the
> > > node before closing it, and draining requires that the graph is
> > > unlocked.
> > > 
> > > The solution is to defer deleting the node until we don't hold the lock
> > > any more and draining is possible again.
> > > 
> > > Note that keeping images open for longer than necessary can create
> > > problems, too: You can't open an image again before it is really closed
> > > (if image locking didn't prevent it, it would cause corruption).
> > > Reopening an image immediately happens at least during bdrv_open() and
> > > bdrv_co_create().
> > > 
> > > In order to solve this problem, make sure to run the deferred unref in
> > > bdrv_graph_wrunlock(), i.e. the first possible place where we can drain
> > > again. This is also why bdrv_schedule_unref() is marked GRAPH_WRLOCK.
> > > 
> > > The output of iotest 051 is updated because the additional polling
> > > changes the order of HMP output, resulting in a new "(qemu)" prompt in
> > > the test output that was previously on a separate line and filtered out.
> > > 
> > > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > > ---
> > >  include/block/block-global-state.h |  1 +
> > >  block.c                            |  9 +++++++++
> > >  block/graph-lock.c                 | 23 ++++++++++++++++-------
> > >  tests/qemu-iotests/051.pc.out      |  6 +++---
> > >  4 files changed, 29 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/include/block/block-global-state.h 
> > > b/include/block/block-global-state.h
> > > index f347199bff..e570799f85 100644
> > > --- a/include/block/block-global-state.h
> > > +++ b/include/block/block-global-state.h
> > > @@ -224,6 +224,7 @@ void bdrv_img_create(const char *filename, const char 
> > > *fmt,
> > >  void bdrv_ref(BlockDriverState *bs);
> > >  void no_coroutine_fn bdrv_unref(BlockDriverState *bs);
> > >  void coroutine_fn no_co_wrapper bdrv_co_unref(BlockDriverState *bs);
> > > +void GRAPH_WRLOCK bdrv_schedule_unref(BlockDriverState *bs);
> > >  void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
> > >  BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
> > >                               BlockDriverState *child_bs,
> > > diff --git a/block.c b/block.c
> > > index 6376452768..9c4f24f4b9 100644
> > > --- a/block.c
> > > +++ b/block.c
> > > @@ -7033,6 +7033,15 @@ void bdrv_unref(BlockDriverState *bs)
> > >      }
> > >  }
> > >  
> > > +void bdrv_schedule_unref(BlockDriverState *bs)
> > 
> > Please add a doc comment explaining when and why this should be used.
> 
> Ok.
> 
> > > +{
> > > +    if (!bs) {
> > > +        return;
> > > +    }
> > > +    aio_bh_schedule_oneshot(qemu_get_aio_context(),
> > > +                            (QEMUBHFunc *) bdrv_unref, bs);
> > > +}
> > > +
> > >  struct BdrvOpBlocker {
> > >      Error *reason;
> > >      QLIST_ENTRY(BdrvOpBlocker) list;
> > > diff --git a/block/graph-lock.c b/block/graph-lock.c
> > > index 5e66f01ae8..395d387651 100644
> > > --- a/block/graph-lock.c
> > > +++ b/block/graph-lock.c
> > > @@ -163,17 +163,26 @@ void bdrv_graph_wrlock(BlockDriverState *bs)
> > >  void bdrv_graph_wrunlock(void)
> > >  {
> > >      GLOBAL_STATE_CODE();
> > > -    QEMU_LOCK_GUARD(&aio_context_list_lock);
> > >      assert(qatomic_read(&has_writer));
> > >  
> > > +    WITH_QEMU_LOCK_GUARD(&aio_context_list_lock) {
> > > +        /*
> > > +         * No need for memory barriers, this works in pair with
> > > +         * the slow path of rdlock() and both take the lock.
> > > +         */
> > > +        qatomic_store_release(&has_writer, 0);
> > > +
> > > +        /* Wake up all coroutine that are waiting to read the graph */
> > 
> > s/coroutine/coroutines/
> 
> I only changed the indentation, but I guess I can just fix it while I
> touch it.
> 
> > > +        qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> > > +    }
> > > +
> > >      /*
> > > -     * No need for memory barriers, this works in pair with
> > > -     * the slow path of rdlock() and both take the lock.
> > > +     * Run any BHs that were scheduled during the wrlock section and that
> > > +     * callers might expect to have finished (e.g. bdrv_unref() calls). 
> > > Do this
> > 
> > Referring directly to bdrv_schedule_unref() would help make it clearer
> > what you mean.
> > 
> > > +     * only after restarting coroutines so that nested event loops in 
> > > BHs don't
> > > +     * deadlock if their condition relies on the coroutine making 
> > > progress.
> > >       */
> > > -    qatomic_store_release(&has_writer, 0);
> > > -
> > > -    /* Wake up all coroutine that are waiting to read the graph */
> > > -    qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> > > +    aio_bh_poll(qemu_get_aio_context());
> > 
> > Keeping a dedicated list of BDSes pending unref seems cleaner than using
> > the aio_bh_poll(). There's a lot of stuff happening in the event loop
> > and I wonder if those other BHs might cause issues if they run at the
> > end of bdrv_graph_wrunlock(). The change to qemu-iotests 051's output is
> > an example of this, but other things could happen too (e.g. monitor
> > command processing).
> 
> I would argue that it's no worse than the old code: The bdrv_unref()
> that we're replacing would already run a nested event loop if it was the
> last reference. We only moved this a bit later, making the part of the
> code that doesn't run an event loop a bit larger. The difference is that
> we're running it unconditionally now, not only in a special case, but I
> think that's actually an improvement (in terms of test coverage etc.)
> 
> We run nested event loops in all kinds of different places without
> thinking much about it. If we're worried about it here, what do we do
> about all these other places?
> 
> Anyway, if you think that we should process only bdrv_schedule_unref()
> here, how would you approach this? Restrict it to bdrv_schedule_unref()
> in the hope that this is the only operation we'll need to delay, or
> implement another BH mechanism from scratch in graph-lock.c? In theory
> we could also add another AioContext where actual BHs can be queued and
> that is only polled here, but the iohandler context is already painful
> enough that I don't necessarily want to create another thing like it.

Actually, come to think of it, even processing only scheduled unrefs is
effectively a nested event loop (via bdrv_close()) that could run for
any caller. The only thing we would achieve is making it conditional
again so that it triggers only sometimes. But callers still have to make
sure that polling is safe because it could be the last reference.

So even the little use I saw yesterday, I'm not sure about it any more
now.

Kevin

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]