qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PULL 00/28] Block layer patches


From: Kevin Wolf
Subject: Re: [PULL 00/28] Block layer patches
Date: Tue, 19 Sep 2023 12:26:22 +0200

Am 18.09.2023 um 20:56 hat Stefan Hajnoczi geschrieben:
> Hi Kevin,
> I believe that my own commit "block-coroutine-wrapper: use
> qemu_get_current_aio_context()" breaks this test. The failure is
> non-deterministic (happens about 1 out of 4 runs).
> 
> It seems the job hangs and the test times out in vm.run_job('job1', wait=5.0).
> 
> I haven't debugged it yet but wanted to share this information to save
> some time. Tomorrow I'll investigate further.

Yes, it's relatively easily reproducible if I run the test in a loop,
and I can't seem to reproduce it without the last patch. Should I
unstage the full series again, or do you think that the last patch is
really optional this time?

However, I'm unsure how the stack traces I'm seeing are related to your
patch. Maybe it just made an existing bug more likely to be triggered?

What I'm seeing is that the reader lock is held by an iothread that is
waiting for its AioContext lock to make progress:

Thread 3 (Thread 0x7f811e9346c0 (LWP 26390) "qemu-system-x86"):
#0  0x00007f81250aaf80 in __lll_lock_wait () at /lib64/libc.so.6
#1  0x00007f81250b149a in pthread_mutex_lock@@GLIBC_2.2.5 () at /lib64/libc.so.6
#2  0x000055b7b170967e in qemu_mutex_lock_impl (mutex=0x55b7b34e3080, 
file=0x55b7b199e1f7 "../util/async.c", line=728) at 
../util/qemu-thread-posix.c:94
#3  0x000055b7b1709953 in qemu_rec_mutex_lock_impl (mutex=0x55b7b34e3080, 
file=0x55b7b199e1f7 "../util/async.c", line=728) at 
../util/qemu-thread-posix.c:149
#4  0x000055b7b1728318 in aio_context_acquire (ctx=0x55b7b34e3020) at 
../util/async.c:728
#5  0x000055b7b1727c49 in co_schedule_bh_cb (opaque=0x55b7b34e3020) at 
../util/async.c:565
#6  0x000055b7b1726f1c in aio_bh_call (bh=0x55b7b34e2e70) at ../util/async.c:169
#7  0x000055b7b17270ee in aio_bh_poll (ctx=0x55b7b34e3020) at 
../util/async.c:216
#8  0x000055b7b170351d in aio_poll (ctx=0x55b7b34e3020, blocking=true) at 
../util/aio-posix.c:722
#9  0x000055b7b1518604 in iothread_run (opaque=0x55b7b2904460) at 
../iothread.c:63
#10 0x000055b7b170a955 in qemu_thread_start (args=0x55b7b34e36b0) at 
../util/qemu-thread-posix.c:541
#11 0x00007f81250ae15d in start_thread () at /lib64/libc.so.6
#12 0x00007f812512fc00 in clone3 () at /lib64/libc.so.6

On the other hand, the main thread wants to acquire the writer lock,
but it holds the AioContext lock of the iothread (it takes it in
job_prepare_locked()):

Thread 1 (Thread 0x7f811f4b7b00 (LWP 26388) "qemu-system-x86"):
#0  0x00007f8125122356 in ppoll () at /lib64/libc.so.6
#1  0x000055b7b172eae0 in qemu_poll_ns (fds=0x55b7b34ec910, nfds=1, timeout=-1) 
at ../util/qemu-timer.c:339
#2  0x000055b7b1704ebd in fdmon_poll_wait (ctx=0x55b7b3269210, 
ready_list=0x7ffc90b05680, timeout=-1) at ../util/fdmon-poll.c:79
#3  0x000055b7b1703284 in aio_poll (ctx=0x55b7b3269210, blocking=true) at 
../util/aio-posix.c:670
#4  0x000055b7b1567c3b in bdrv_graph_wrlock (bs=0x0) at 
../block/graph-lock.c:145
#5  0x000055b7b1554c1c in blk_remove_bs (blk=0x55b7b4425800) at 
../block/block-backend.c:916
#6  0x000055b7b1554779 in blk_delete (blk=0x55b7b4425800) at 
../block/block-backend.c:497
#7  0x000055b7b1554133 in blk_unref (blk=0x55b7b4425800) at 
../block/block-backend.c:557
#8  0x000055b7b157a149 in mirror_exit_common (job=0x55b7b4419000) at 
../block/mirror.c:696
#9  0x000055b7b1577015 in mirror_prepare (job=0x55b7b4419000) at 
../block/mirror.c:807
#10 0x000055b7b153a1a7 in job_prepare_locked (job=0x55b7b4419000) at 
../job.c:988
#11 0x000055b7b153a0d9 in job_txn_apply_locked (job=0x55b7b4419000, 
fn=0x55b7b153a110 <job_prepare_locked>) at ../job.c:191
#12 0x000055b7b1538b6d in job_do_finalize_locked (job=0x55b7b4419000) at 
../job.c:1011
#13 0x000055b7b153a886 in job_completed_txn_success_locked (job=0x55b7b4419000) 
at ../job.c:1068
#14 0x000055b7b1539372 in job_completed_locked (job=0x55b7b4419000) at 
../job.c:1082
#15 0x000055b7b153a71b in job_exit (opaque=0x55b7b4419000) at ../job.c:1103
#16 0x000055b7b1726f1c in aio_bh_call (bh=0x7f8110005470) at ../util/async.c:169
#17 0x000055b7b17270ee in aio_bh_poll (ctx=0x55b7b3269210) at 
../util/async.c:216
#18 0x000055b7b1702c05 in aio_dispatch (ctx=0x55b7b3269210) at 
../util/aio-posix.c:423
#19 0x000055b7b1728a14 in aio_ctx_dispatch (source=0x55b7b3269210, 
callback=0x0, user_data=0x0) at ../util/async.c:358
#20 0x00007f8126c31c7f in g_main_dispatch (context=0x55b7b3269720) at 
../glib/gmain.c:3454
#21 g_main_context_dispatch (context=0x55b7b3269720) at ../glib/gmain.c:4172
#22 0x000055b7b1729c98 in glib_pollfds_poll () at ../util/main-loop.c:290
#23 0x000055b7b1729572 in os_host_main_loop_wait (timeout=27462700) at 
../util/main-loop.c:313
#24 0x000055b7b1729452 in main_loop_wait (nonblocking=0) at 
../util/main-loop.c:592
#25 0x000055b7b119a1eb in qemu_main_loop () at ../softmmu/runstate.c:772
#26 0x000055b7b14c102d in qemu_default_main () at ../softmmu/main.c:37
#27 0x000055b7b14c1068 in main (argc=44, argv=0x7ffc90b05d58) at 
../softmmu/main.c:48

At first I thought we just need to look into the AioContext locking in
job completion and drop it in the right places.

But in fact, first of all, blk_remove_bs() needs to make up its mind if
it wants the caller to hold the AioContext or not and document that.
Because it calls both bdrv_drained_begin() (which requires holding the
AioContext lock) and bdrv_graph_wrlock(NULL) (which forbids it).

If we could fully get rid of the AioContext lock (as we originally
stated as a goal), that would automatically solve this kind of
deadlocks.

Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]