[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH 5/7] block/mirror: utilize job_exit shim
From: |
Max Reitz |
Subject: |
Re: [Qemu-devel] [PATCH 5/7] block/mirror: utilize job_exit shim |
Date: |
Wed, 29 Aug 2018 10:28:57 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 |
On 2018-08-28 22:25, John Snow wrote:
>
>
> On 08/25/2018 11:02 AM, Max Reitz wrote:
>> If you say so... I have to admit I don't really understand. The
>> comment doesn't explain why it's so important to keep src around until
>> job_completed(), so I don't know. I thought AioContexts are recursive
>> so it doesn't matter whether you take them recursively or not.
>
> bdrv_flush has troubles under a recursive lock if it is invoked from a
> different thread. It tries to poll for flush completion but the
> coroutine which gets scheduled (instead of entered) can't actually run
> if we hold the lock twice from, say, the main thread -- which is where
> we're doing the graph manipulation from.
>
>> co = qemu_coroutine_create(bdrv_flush_co_entry, &flush_co);
>> bdrv_coroutine_enter(bs, co);
>> BDRV_POLL_WHILE(bs, flush_co.ret == NOT_DONE);
>
> BDRV_POLL_WHILE there causes us the grief via AIO_WAIT_WHILE, which only
> puts down one reference, so we deadlock in bdrv_flush in a recursive
> context.
>
> Kevin helped me figure it out; I CC'd him off-list on a mail that you
> were also CC'd on ("hang in mirror_exit") that's probably pretty helpful:
>
>> Took a little more than five minutes, but I think I've got it now. The
>> important thing is that the test case is for dataplane, i.e. the job
>> runs in an I/O thread AioContext. Job completion, however, happens in
>> the main loop thread.
>>
>> Therefore, when bdrv_delete() wants to flush the node first, it needs to
>> run bdrv_co_flush() in a foreign AioContext, so the coroutine is
>> scheduled. The I/O thread backtrace shows that it's waiting for the
>> AioContext lock before it can actually enter the bdrv_co_flush()
>> coroutine, so we must have deadlocked:
OK, that makes more sense. I still would have thought that you should
always be allowed to take an AioContext lock more than once in a single
other AioContext, but on my way through the commit log, I found
bd6458e410c1e7, d79df2a2ceb3cb07711, and maybe most importantly
17e2a4a47d4. So maybe not.
So at some point we decided that, yeah, you can take them multiple times
in a single context, and, yeah, that was how it was designed, but don't
do that if you expect a BDRV_POLL_WHILE().
OK. Got it now.
Max
signature.asc
Description: OpenPGP digital signature
- Re: [Qemu-devel] [PATCH 2/7] jobs: canonize Error object, (continued)
[Qemu-devel] [PATCH 4/7] block/commit: utilize job_exit shim, John Snow, 2018/08/17
Re: [Qemu-devel] [PATCH 4/7] block/commit: utilize job_exit shim, Max Reitz, 2018/08/22
[Qemu-devel] [PATCH 3/7] jobs: add exit shim, John Snow, 2018/08/17