qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v3 3/4] migration: avoid recursive AioContext lo


From: Pavel Butsykin
Subject: Re: [Qemu-devel] [PATCH v3 3/4] migration: avoid recursive AioContext locking in save_vmstate()
Date: Thu, 15 Jun 2017 16:33:53 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.1.1

On 14.06.2017 17:43, Kevin Wolf wrote:
Am 14.06.2017 um 15:15 hat Pavel Butsykin geschrieben:
On 14.06.2017 13:10, Pavel Butsykin wrote:

On 22.05.2017 16:57, Stefan Hajnoczi wrote:
AioContext was designed to allow nested acquire/release calls.  It uses
a recursive mutex so callers don't need to worry about nesting...or so
we thought.

BDRV_POLL_WHILE() is used to wait for block I/O requests.  It releases
the AioContext temporarily around aio_poll().  This gives IOThreads a
chance to acquire the AioContext to process I/O completions.

It turns out that recursive locking and BDRV_POLL_WHILE() don't mix.
BDRV_POLL_WHILE() only releases the AioContext once, so the IOThread
will not be able to acquire the AioContext if it was acquired
multiple times.

Instead of trying to release AioContext n times in BDRV_POLL_WHILE(),
this patch simply avoids nested locking in save_vmstate().  It's the
simplest fix and we should step back to consider the big picture with
all the recent changes to block layer threading.

This patch is the final fix to solve 'savevm' hanging with -object
iothread.

The same I see in external_snapshot_prepare():
[...]
and at the moment BDRV_POLL_WHILE(bs, flush_co.ret == NOT_DONE),
we have at least two locks.. So here is another deadlock.

Sorry, here different kind of deadlock. In external_snapshot case, the
deadlock can happen only if state->old_bs->aio_context == my_iothread->ctx,
because in this case the aio_co_enter() always calls aio_co_schedule():

Can you please write qemu-iotests case for any deadlock case that we're
seeing? Stefan, we could also use one for the bug fixed in this series.

It's 085 test, only need to enable iothread. (patch attached)
# ./check -qcow2 085 -iothread
...
+Timeout waiting for return on handle 0
Failures: 085
Failed 1 of 1 tests

The timeout because of the deadlock.


Actually the deadlock for the same reason :D

A recursive lock occurs when in the transaction more than one action:
void qmp_transaction(TransactionActionList *dev_list,
...
/* We don't do anything in this loop that commits us to the operations */
    while (NULL != dev_entry) {
...
        state->ops->prepare(state, &local_err);
        if (local_err) {
            error_propagate(errp, local_err);
            goto delete_and_fail;
        }
    }

    QSIMPLEQ_FOREACH(state, &snap_bdrv_states, entry) {
        if (state->ops->commit) {
            state->ops->commit(state);

There the contex lock is acquired in state->ops->prepare(), but is
released in state->ops->commit() (at bdrv_reopen_multiple()). And when
in a transaction two or more actions, we will see a recursive lock.
Unfortunately I have no idea how cheap it is to fix this.

Kevin

Attachment: 0001-qemu-iotests-add-iothread-option.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]