[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-block] [Qemu-devel] [PATCH] migration: re-active images when m
From: |
Dr. David Alan Gilbert |
Subject: |
Re: [Qemu-block] [Qemu-devel] [PATCH] migration: re-active images when migration fails to complete |
Date: |
Thu, 8 Dec 2016 20:02:31 +0000 |
User-agent: |
Mutt/1.7.1 (2016-10-04) |
* Hailiang Zhang (address@hidden) wrote:
> Hi,
>
> On 2016/12/6 23:24, Dr. David Alan Gilbert wrote:
> > * Kevin Wolf (address@hidden) wrote:
> > > Am 19.11.2016 um 12:43 hat zhanghailiang geschrieben:
> > > > commit fe904ea8242cbae2d7e69c052c754b8f5f1ba1d6 fixed a case
> > > > which migration aborted QEMU because it didn't regain the control
> > > > of images while some errors happened.
> > > >
> > > > Actually, we have another case in that error path to abort QEMU
> > > > because of the same reason:
> > > > migration_thread()
> > > > migration_completion()
> > > > bdrv_inactivate_all() ----------------> inactivate images
> > > > qemu_savevm_state_complete_precopy()
> > > > socket_writev_buffer() --------> error because
> > > > destination fails
> > > > qemu_fflush() -------------------> set error on migration
> > > > stream
> > > > qemu_mutex_unlock_iothread() ------> unlock
> > > > qmp_migrate_cancel() ---------------------> user cancelled
> > > > migration
> > > > migrate_set_state() ------------------> set migrate CANCELLING
> > >
> > > Important to note here: qmp_migrate_cancel() is executed by a concurrent
> > > thread, it doesn't depend on any code paths in migration_completion().
> > >
> > > > migration_completion() -----------------> go on to fail_invalidate
> > > > if (s->state == MIGRATION_STATUS_ACTIVE) -> Jump this branch
> > > > migration_thread() -----------------------> break migration loop
> > > > vm_start() -----------------------------> restart guest with
> > > > inactive
> > > > images
> > > > We failed to regain the control of images because we only regain it
> > > > while the migration state is "active", but here users cancelled the
> > > > migration
> > > > when they found some errors happened (for example, libvirtd daemon is
> > > > shutdown
> > > > in destination unexpectedly).
> > > >
> > > > Signed-off-by: zhanghailiang <address@hidden>
> > > > ---
> > > > migration/migration.c | 3 ++-
> > > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > index f498ab8..0c1ee6d 100644
> > > > --- a/migration/migration.c
> > > > +++ b/migration/migration.c
> > > > @@ -1752,7 +1752,8 @@ fail_invalidate:
> > > > /* If not doing postcopy, vm_start() will be called: let's regain
> > > > * control on images.
> > > > */
> > > > - if (s->state == MIGRATION_STATUS_ACTIVE) {
> > >
> > > This if condition tries to check whether we ran the code path that
> > > called bdrv_inactivate_all(), so that we only try to reactivate images
> > > it if we really inactivated them first.
> > >
> > > The problem with it is that it ignores a possible concurrent
> > > modification of s->state.
> > >
> > > > + if (s->state == MIGRATION_STATUS_ACTIVE ||
> > > > + s->state == MIGRATION_STATUS_CANCELLING) {
> > >
> > > This adds another state that we could end up with with a concurrent
> > > modification, so that even in this case we undo the inactivation.
> > >
> > > However, it is no longer limited to the cases where we inactivated the
> > > image. It also applies to other code paths (like the postcopy one) where
> > > we didn't inactivate images.
> > >
> > > What saves the patch is that bdrv_invalidate_cache() is a no-op for
> > > block devices that aren't inactivated, so calling it more often than
> > > necessary is okay.
> > >
> > > But then, if we're going to rely on this, it would be much better to
> > > just remove the if altogether. I can't say whether there are any other
> > > possible values of s->state that we should consider, and by removing the
> > > if we would be guaranteed to catch all of them.
> > >
> > > If we don't want to rely on it, just keep a local bool that remembers
> > > whether we inactivated images and check that here.
> > >
> > > > Error *local_err = NULL;
> > > >
> > > > bdrv_invalidate_cache_all(&local_err);
> > >
> > > So in summary, this is a horrible patch because it checks the wrong
> > > thing, and for I can't really say if it covers everything it needs to
> > > cover, but arguably it happens to correctly fix the outcome of a
> > > previously failing case.
> > >
> > > Normally I would reject such a patch and require a clean solution, but
> > > then we're on the day of -rc3, so if you can't send v2 right away, we
> > > might not have the time for it.
> > >
> > > Tough call...
> >
> > Hmm, this case is messy; I created this function having split it out
> > of the main loop a couple of years back but it did get more messy
> > with more s->state checks; as far as I can tell it's always
> > done the transition to COMPLETED at the end well after the locked
> > section, so there's always been that chance that cancellation sneaks
> > in just before or just after the locked section.
> >
> > Some of the bad cases that can happen:
> > a) A cancel sneaks in after the ACTIVE check but before or after
> > the locked section; should we reactivate the disks? Well that
> > depends on whether the destination actually got the full migration
> > stream - we don't know!
> > If the destination actually starts running we must not reactivate
> > the disk on the source even if the CPU is stopped.
> >
>
> Yes, we didn't have a mechanism to know exactly whether or not the VM in
> destination is well received.
>
> > b) If the bdrv_inactive_all fails for one device, but the others
> > are fine, we go down the fail: label and don't reactivate, so
> > the source dies even though it might have been mostly OK.
> >
>
> > We can move the _lock to before the check of s->state at the top,
> > which would stop the cancel sneaking in early.
> > In the case where postcopy was never enabled (!migrate_postcopy_ram())
> > we can move the COMPLETED transition into the lock as well; so I think
> > then we kind of become safe.
> > In the case where postcopy was enabled I think we can do the COMPLETED
> > transition before waiting for the return path to close - I think but
> > I need to think more about that.
> > And there seem to be some dodgy cases where we call the invalidate
> > there after a late postcopy failure; that's bad, we shouldn't reactivate
> > the source disks after going into postcopy.
> >
> > So, in summary; this function is a mess - it needs a much bigger
> > fix than this patch.
> >
>
> So what's the conclusion ?
> Will you send a patch to fix it ? Or let's fix it step by step ?
> I think Kevin's suggestion which just remove the *if* check is OK.
I don't see any of the simple solutions are easy; so I plan
to look at it properly, but am not sure when; if you have time
to do it then feel free.
Dave
>
> Thanks,
> Hailiang
>
> > Dave
> >
> > > Kevin
> > >
> > --
> > Dr. David Alan Gilbert / address@hidden / Manchester, UK
> >
> > .
> >
>
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK