qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH RFC 4/5] cpu: Allow cpu_synchronize_all_post_init() to take a


From: Peter Xu
Subject: Re: [PATCH RFC 4/5] cpu: Allow cpu_synchronize_all_post_init() to take an errp
Date: Fri, 10 Jun 2022 10:19:57 -0400

On Thu, Jun 09, 2022 at 05:02:29PM -0400, Peter Xu wrote:
> On Wed, Jun 08, 2022 at 06:05:28PM +0100, Dr. David Alan Gilbert wrote:
> > > @@ -2005,7 +2005,17 @@ static void loadvm_postcopy_handle_run_bh(void 
> > > *opaque)
> > >      /* TODO we should move all of this lot into postcopy_ram.c or a 
> > > shared code
> > >       * in migration.c
> > >       */
> > > -    cpu_synchronize_all_post_init();
> > > +    cpu_synchronize_all_post_init(&local_err);
> > > +    if (local_err) {
> > > +        /*
> > > +         * TODO: a better way to do this is to tell the src that we 
> > > cannot
> > > +         * run the VM here so hopefully we can keep the VM running on src
> > > +         * and immediately halt the switch-over.  But that needs work.
> > 
> > Yes, I think it is possible; unlike some of the later errors in the same
> > function, in this case we know no disks/network/etc have been touched,
> > so we should be able to recover.
> > I wonder if we can move the postcopy_state_set(POSTCOPY_INCOMING_RUNNING)
> > out of loadvm_postcopy_handle_run to after this point.
> > 
> > We've already got the return path, so we should be able to signal the
> > failure unless we're very unlucky.
> 
> Right.  It's just that for the new ACK we may need to modify the return
> path protocol for sure, because none of the existing ones can notify such
> an information.
> 
> One idea is to reuse MIG_RP_MSG_RESUME_ACK, it was only used for postcopy
> recovery before to do the final handshake with offload=1 only (which is
> defined as MIGRATION_RESUME_ACK_VALUE).  We could try to fill in the
> payload with some !1 value, to tell the source that we NACK the migration
> then src fails the migration as long as possible?
> 
> That seems to be even compatibile with one old qemu migrating to a new qemu
> scenario, because when the old qemu notices the MIG_RP_MSG_RESUME_ACK
> message with !1 payload, it'll mark the rp bad:

Oh it won't be compatible..  The clean way to do this is we need to modify
the src qemu to halt in postcopy_start() to wait for that ack before
continue.  That may need another cap/param to enable.

The thing is I'm not very sure whether this will be worth it.

Non-compatible migrations should be rare on put register failures.  For the
issue I was working on, it was actually a kernel bug that triggered it but
it's just hard to figure out where's wrong.  With properly working kernels
and matching hosts they should just not really heppen.  I'm worried adding
too much complexity could over-engineer things without much benefits.
  
In that case, I'd think it proper if we start with what this patchset
provides, which at least allows us to fail in a crystal clear way?

> 
>   if (migrate_handle_rp_resume_ack(ms, tmp32)) {
>       mark_source_rp_bad(ms);
>       goto out;
>   }
> 
>   static int migrate_handle_rp_resume_ack(MigrationState *s, uint32_t value)
>   {
>       trace_source_return_path_thread_resume_ack(value);
>   
>       if (value != MIGRATION_RESUME_ACK_VALUE) {
>           error_report("%s: illegal resume_ack value %"PRIu32,
>                        __func__, value);
>           return -1;
>       }
>       ...
>   }
> 
> If it looks generally good, I can try with such a change in v2.
> 
> Thanks,
> 
> -- 
> Peter Xu

-- 
Peter Xu




reply via email to

[Prev in Thread] Current Thread [Next in Thread]