qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] migration: invalidate cache before source start


From: John Snow
Subject: Re: [Qemu-devel] [PATCH] migration: invalidate cache before source start
Date: Mon, 8 Oct 2018 16:21:05 -0400
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1


On 10/08/2018 11:36 AM, Vladimir Sementsov-Ogievskiy wrote:
> 26.06.2018 11:44, Vladimir Sementsov-Ogievskiy wrote:
>> 25.06.2018 20:50, Dr. David Alan Gilbert wrote:
>>> * Dr. David Alan Gilbert (address@hidden) wrote:
>>>> * Vladimir Sementsov-Ogievskiy (address@hidden) wrote:
>>>>> 15.06.2018 15:06, Dr. David Alan Gilbert wrote:
>>>>>> * Vladimir Sementsov-Ogievskiy (address@hidden) wrote:
>>>>>>> Invalidate cache before source start in case of failed migration.
>>>>>>>
>>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy 
>>>>>>> <address@hidden>
>>>>>> Why doesn't the code at the bottom of migration_completion,
>>>>>> fail_invalidate:   and the code in migrate_fd_cancel handle this?
>>>>>>
>>>>>> What case did you see it in that those didn't handle?
>>>>>> (Also I guess it probably should set s->block_inactive = false)
>>>>> on source I see:
>>>>>
>>>>> address@hidden:migrate_set_state new state 7
>>>>> address@hidden:migration_thread_file_err
>>>>> address@hidden:migration_thread_after_loop
>>>>>
>>>>> so, we are leaving loop on
>>>>>          if (qemu_file_get_error(s->to_dst_file)) {
>>>>>              migrate_set_state(&s->state, current_active_state,
>>>>> MIGRATION_STATUS_FAILED);
>>>>> trace_migration_thread_file_err();
>>>>> break;
>>>>>          }
>>>>>
>>>>> and skip migration_completion()
>>
>>
>> John is right, this ls an unrelated log, here we fail before 
>> inactivation and there are no problems.
>>
>> Actual problem is when we fail in postcopy_start, at the end. And 
>> source log looks like:
>>
>> address@hidden:migrate_set_state new state 1
>> address@hidden:migration_fd_outgoing fd=101
>> address@hidden:migration_set_outgoing_channel 
>> ioc=0x56363454d630 ioctype=qio-channel-socket hostname=(null)
>> address@hidden:migration_bitmap_sync_start
>> address@hidden:migration_bitmap_sync_end dirty_pages 932
>> address@hidden:migrate_set_state new state 4
>> address@hidden:migration_thread_setup_complete
>> address@hidden:migrate_pending pending size 1107976192 max 0 
>> (pre = 0 compat=1107976192 post=0)
>> address@hidden:migrate_set_state new state 5
>> Tap fd 33 disable, ret 0
>> address@hidden:migration_bitmap_sync_start
>> address@hidden:migration_bitmap_sync_end dirty_pages 1091
>> address@hidden:migrate_global_state_pre_save saved state: 
>> running
>> 2018-06-26T08:29:56.439134Z qemu-kvm: postcopy_start: Migration stream 
>> errored -5
>> address@hidden:migrate_set_state new state 7
>> address@hidden:migration_thread_after_loop
>> Tap fd 33 enable, ret 0
>> address@hidden:migrate_fd_cleanup
>> qemu-kvm: block/io.c:1655: bdrv_co_pwritev: Assertion 
>> `!(bs->open_flags & 0x0800)' failed.
>> 2018-06-26 08:29:56.605+0000: shutting down, reason=crashed
>>
>>
>>>> Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation
>>>> test that had previously ended with a 'cancelled' state has now 
>>>> ended up
>>>> in 'failed' (which is the state 7 you have above).
>>>> I suspect there's something else going on as well; I think what is
>>>> supposed to happen in the case of 'cancel' is that it spins in 
>>>> 'cancelling' for
>>>> a while in migrate_fd_cancel and then at the bottom of 
>>>> migrate_fd_cancel
>>>> it does the recovery, but because it's going to failed instead, then
>>>> it's jumping over that recovery.
>>> Going back and actually looking at the patch again;
>>> can I ask for 1 small change;
>>>     Can you set s->block_inactive = false   in the case where you
>>> don't get the local_err (Like we do at the bottom of migrate_fd_cancel)
>>>
>>>
>>> Does that make sense?
>>
>> Ok, I'll resend.
>>
>> Hm, looks like I'm fixing an outdated version (based on v2.9.0) And my 
>> reproduce isn't appropriate for upstream.
>> But looks like current code have a possibility of the same fail:
>>
>> postcopy_start()
>>     ....
>>     ret = qemu_file_get_error(ms->to_dst_file);
>>     if (ret) {
>>         error_report("postcopy_start: Migration stream errored");
>>
>> leads to "return MIG_ITERATE_SKIP;" in migration_iteration_run
>>
>> then the loop should finish, as state should be 
>> MIGRATION_STATUS_FAILED, so we will not call migration_completion.
>>
>> Hm, I have questions now:
>>
>> 1. should we check s->block_inactive, and if it is false, don't 
>> invalidate? it is done in migrate_fd_cancel(), but not don in 
>> migration_completion().
>> 2. should we call qemu_mutex_lock_iothread() like in 
>> migration_completion()? Why is it needed in migration_completion(), 
>> when vm is not running?
> 
> 
> Hm, forgotten thread, I should resend, but what do you think about these 
> questions?
> 

Personally, I can't remember where we stand on this thread at all. If
you have changes that you think are still important (because you ran
into the issue again), probably best to send a respin (with a new cover
letter and updated justification) and we'll take it from there.

(I seem to recall we were working off of different versions which
handled migration differently, which was causing confusion...)

--js



reply via email to

[Prev in Thread] Current Thread [Next in Thread]