[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-block] [PATCH] qemu-iotests: fix 203 migration completion race
From: |
Max Reitz |
Subject: |
Re: [Qemu-block] [PATCH] qemu-iotests: fix 203 migration completion race |
Date: |
Wed, 7 Mar 2018 19:01:53 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 |
On 2018-03-06 17:18, Stefan Hajnoczi wrote:
> On Mon, Mar 05, 2018 at 05:04:52PM +0100, Max Reitz wrote:
>> On 2018-03-05 16:59, Stefan Hajnoczi wrote:
>>> There is a race between the test's 'query-migrate' QMP command after the
>>> QMP 'STOP' event and completing the migration:
>>>
>>> The test case invokes 'query-migrate' upon receiving 'STOP'. At this
>>> point the migration thread may still be in the process of completing.
>>> Therefore 'query-migrate' can return 'status': 'active' for a brief
>>> window of time instead of 'status': 'completed'. This results in
>>> qemu-iotests 203 hanging.
>>>
>>> Solve the race by enabling the 'events' migration capability, which
>>> causes QEMU to emit migration-specific QMP events that do not suffer
>>> from this race condition. Wait for the QMP 'MIGRATION' event with
>>> 'status': 'completed'.
>>>
>>> Reported-by: Max Reitz <address@hidden>
>>> Signed-off-by: Stefan Hajnoczi <address@hidden>
>>> ---
>>> tests/qemu-iotests/203 | 15 +++++++++++----
>>> tests/qemu-iotests/203.out | 5 +++++
>>> 2 files changed, 16 insertions(+), 4 deletions(-)
>>
>> So much for "the ppoll() dungeon"...
>
> It was still a pain to debug :).
>
> I put a ring buffer into the QMP monitor input/output code.
Oh, wow.
> Then it was
> possible to figure out the issue via GDB on a hung QEMU:
>
> (gdb) p current_run_state
> RUN_STATE_POSTMIGRATE
> (gdb) p current_migration->status
> MIGRATION_STATUS_COMPLETED
> (gdb) p monitor_out_ring
> ...'STOP' event...
> (gdb) p monitor_in_ring
> ...query-migrate... <-- okay, the test checked if migration finished
>
> Then looking at the code:
>
> static void migration_completion(MigrationState *s)
> {
> ...
> if (s->state == MIGRATION_STATUS_ACTIVE) {
> qemu_mutex_lock_iothread();
> s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
> s->vm_was_running = runstate_is_running();
> ret = global_state_store();
>
> if (!ret) {
> bool inactivate = !migrate_colo_enabled();
>
> v---- The stop event comes from here
> ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
> ...
> }
> qemu_mutex_unlock_iothread(); <--- oh, no!
> ...
> if (!migrate_colo_enabled()) {
> migrate_set_state(&s->state, current_active_state,
> MIGRATION_STATUS_COMPLETED); <-- too late!
> }
>
> return;
OK... I guess the answer to this just is "the stop event doesn't mean
anything, use the migration events instead" (i.e. what your patch does).
Thanks a lot, applied to my block branch:
https://github.com/XanClic/qemu/commits/block
Max
signature.asc
Description: OpenPGP digital signature