qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-ppc] Problem with "savevm" on ppc64


From: Dr. David Alan Gilbert
Subject: Re: [Qemu-ppc] Problem with "savevm" on ppc64
Date: Fri, 21 Oct 2016 10:12:28 +0100
User-agent: Mutt/1.7.1 (2016-10-04)

* David Gibson (address@hidden) wrote:
> On Thu, Oct 20, 2016 at 03:17:12PM +0200, Thomas Huth wrote:
> >  Hi all,
> > 
> > I'm currently facing a strange problem with the "savevm" HMP command on
> > ppc64 with TCG and the pseries machine. Steps for reproduction:
> > 
> > 1) Create a disk image:
> >    qemu-img create -f qcow2  /tmp/test.qcow2 1M
> > 
> > 2) Start QEMU (in TCG mode):
> >    qemu-system-ppc64 -nographic -vga none -m 256 -hda /tmp/test.qcow2
> > 
> > 3) Hit "CTRL-a c" to enter the HMP monitor
> > 
> > 4) Type "savevm test1" to save a snapshot
> > 
> > The savevm command then hangs forever and the test.qcow2 image keeps
> > growing and growing.
> > 
> > It seems like qemu_savevm_state_iterate() does not make any more
> > progress because ram_save_iterate() keeps returning 0 ... but why can
> > that happen?
> 
> Hmm.  You don't mention installing anything on the disk image, so I'm
> assuming the VM is just sitting in SLOF, unable to boot.  And it's
> TCG, so there's no VRMA.  Which mans I'd expect the HPT to be
> completely empty.
> 
> I strongly suspect that's what's triggering the problem, although I'm
> not quite sure of the mechanism
> 
> > I've tinkered with the code a little bit, and I can get it to work with
> > the following patch - which however is quite independent of the
> > ram_save_iterate() code:
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index ddb7438..a7ac0bf 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -1290,7 +1290,7 @@ static int htab_save_setup(QEMUFile *f, void *opaque)
> >      return 0;
> >  }
> > 
> > -static void htab_save_first_pass(QEMUFile *f, sPAPRMachineState *spapr,
> > +static int htab_save_first_pass(QEMUFile *f, sPAPRMachineState *spapr,
> >                                   int64_t max_ns)
> >  {
> >      bool has_timeout = max_ns != -1;
> > @@ -1340,6 +1340,8 @@ static void htab_save_first_pass(QEMUFile *f,
> > sPAPRMachineState *spapr,
> >          spapr->htab_first_pass = false;
> >      }
> >      spapr->htab_save_index = index;
> > +
> > +    return !spapr->htab_first_pass;
> >  }
> > 
> >  static int htab_save_later_pass(QEMUFile *f, sPAPRMachineState *spapr,
> > @@ -1444,7 +1446,7 @@ static int htab_save_iterate(QEMUFile *f, void
> > *opaque)
> >              return rc;
> >          }
> >      } else  if (spapr->htab_first_pass) {
> > -        htab_save_first_pass(f, spapr, MAX_ITERATION_NS);
> > +        rc = htab_save_first_pass(f, spapr, MAX_ITERATION_NS);
> >      } else {
> >          rc = htab_save_later_pass(f, spapr, MAX_ITERATION_NS);
> >      }
> > 
> > That means, if htab_save_iterate() does not initially return a 0, then
> > ram_save_iterate() also does not run into the condition of returning
> > zeroes later. But how are these related? And is it safe for returning a
> > non-zero value in htab_save_first_pass()? Does anybody with more
> > experience in this area has a clue what's going on here?
> 
> So, looking at this I think it's unsafe.  htab_save_first_pass() never
> examines dirty bits, so we could get:
>     htab_save_first_pass() called once, saves part of HPT
>     guest dirties an HPTE in the already saved region
>     enter migration completion stage
>     htab_save_first_pass() saves the remainder of the HPT, returns 1
> 
> That would trigger the code to think the HPT migration is complete,
> without ever saving the HPTE that got dirtied part way through.
> 
> But.. then I looked further and got *really* confused.

Ewww, my head.....this is very confusing.

> The comment above qemu_savevm_state_iterate() and the logic in
> qemu_savevm_state() say that the iterate function returns >0 to
> indicate that it is done and we can move onto the completion phase.
> 
> But both ram_save_iterate() and block_save_iterate() seem to have that
> inverted: they return >0 if they actually saved something.

I think you're right that logic looks inverted somewhere, but be careful
that we have:
    a) The return value of qemu_savevm_state_iterate
    b) The return value of the iterators it calls

Both those return values individually make sense *if* the logic
within qemu_savevm_state_iterate did the right thing; which I'm not
convinced of.

> I think the only reason migration has been working *at all* is that
> migration_thread() and qemu_savevm_state_complete_precopy() ignore the
> return value of the iterate() function (except for <0 errors).
>
> qemu_savevm_state(), however, does not.

However, there are two other reasons it works:
   a) We keep iterating in the migration code until the amount of remaining
      iterable RAM/etc is large (i.e. result of qemu_savevm_state_pending).
   b) Even if we never called the _iterate function the stream should be
      correct because the _complete function should ensure that anything
      left is migrated.  I don't *think* this can happen in a normal migration
      but hmm maybe if you set a very large downtime.

> So, I don't think this is actually a Power bug.  But, it's masked on
> machine types other than pseries, because they will almost never have
> >1 iterable state handler.

Well, they have it in the case of qemu block migration, but I guess that's
probably only used in migration not savevm.

> I think what must be happening is that
> because we don't have time limits in the savevm case, the first call
> to ram_save_iterate() from qemu_savevm_state_iterate() is saving all
> of RAM.  On x86 that would be the end of the QTAILQ_FOREACH(), we'd
> return non-zero and that would kick qemu_savevm_state() into the
> completion phase.
> 
> But on Power, we save all RAM, then move onto the HPT iterator.  It
> returns 0 on the first pass, and we loop again in
> qemu_savevm_state().  But on the next call, RAM is all saved, so
> ram_save_iterate now returns 0.  That breaks us out of the FOREACH in
> qemu_savevm_state_iterate(), returning 0.  Repeat forever.
> 
> Juan, Dave, sound right?

I think so.

> I suspect the right fix is to make ram_save_iterate() return void.
> Migration will use the bandwidth / downtime calculations to determine
> completion as always.  savevm would skip the "live" phase entirely and
> move directly on to completion phase unconditionally.

Skipping the live phase sounds OK, haven't got a clue if any other iterators
will like that.  but then why did qemu_savevm_state() ever bother calling
the iterate functions?

However, there's the comment near the end of qemu_savevm_state_iterate
which I don't think we're obeying (because of the disagreement on return types)
but the comment sounds a reasonable goal:

            /* Do not proceed to the next vmstate before this one reported
               completion of the current stage. This serializes the migration
               and reduces the probability that a faster changing state is
               synchronized over and over again. */

I think the logic here is that if the return value worked the way the
comment at the top of the function, and htab_iterator thinks, then if you had
a bandwidth limit, you'd end up repeatedly calling ram_save_iterate and never
calling the block iterator until RAM was done, then you'd do some block;
and maybe come back to RAM, but the iteration would always prefer the
items near the top of the list of iterators.

Dave

> -- 
> David Gibson                  | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au        | minimalist, thank you.  NOT _the_ 
> _other_
>                               | _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / address@hidden / Manchester, UK



reply via email to

[Prev in Thread] Current Thread [Next in Thread]