qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] Re: [PATCH 0/6] Save state error handling (kill off no_migr


From: Alex Williamson
Subject: [Qemu-devel] Re: [PATCH 0/6] Save state error handling (kill off no_migrate)
Date: Tue, 09 Nov 2010 08:47:00 -0700

On Tue, 2010-11-09 at 17:42 +0200, Michael S. Tsirkin wrote:
> On Tue, Nov 09, 2010 at 08:34:54AM -0700, Alex Williamson wrote:
> > On Tue, 2010-11-09 at 17:07 +0200, Michael S. Tsirkin wrote:
> > > On Tue, Nov 09, 2010 at 07:58:23AM -0700, Alex Williamson wrote:
> > > > On Tue, 2010-11-09 at 14:00 +0200, Michael S. Tsirkin wrote:
> > > > > On Mon, Nov 08, 2010 at 02:23:37PM -0700, Alex Williamson wrote:
> > > > > > On Mon, 2010-11-08 at 22:59 +0200, Michael S. Tsirkin wrote:
> > > > > > > On Mon, Nov 08, 2010 at 10:20:46AM -0700, Alex Williamson wrote:
> > > > > > > > On Mon, 2010-11-08 at 18:54 +0200, Michael S. Tsirkin wrote:
> > > > > > > > > On Mon, Nov 08, 2010 at 07:59:57AM -0700, Alex Williamson 
> > > > > > > > > wrote:
> > > > > > > > > > On Mon, 2010-11-08 at 13:40 +0200, Michael S. Tsirkin wrote:
> > > > > > > > > > > On Wed, Oct 06, 2010 at 02:58:57PM -0600, Alex Williamson 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > Our code paths for saving or migrating a VM are full of 
> > > > > > > > > > > > functions that
> > > > > > > > > > > > return void, leaving no opportunity for a device to 
> > > > > > > > > > > > cancel a migration,
> > > > > > > > > > > > either from error or incompatibility.  The ivshmem 
> > > > > > > > > > > > driver attempted to
> > > > > > > > > > > > solve this with a no_migrate flag on the save state 
> > > > > > > > > > > > entry.  I think the
> > > > > > > > > > > > more generic and flexible way to solve this is to allow 
> > > > > > > > > > > > driver save
> > > > > > > > > > > > functions to fail.  This series implements that and 
> > > > > > > > > > > > converts ivshmem
> > > > > > > > > > > > to uses a set_params function to NAK migration much 
> > > > > > > > > > > > earlier in the
> > > > > > > > > > > > processes.  This touches a lot of files, but bulk of 
> > > > > > > > > > > > those changes are
> > > > > > > > > > > > simply s/void/int/ and tacking a "return 0" to the end 
> > > > > > > > > > > > of functions.
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > 
> > > > > > > > > > > > Alex
> > > > > > > > > > > 
> > > > > > > > > > > Well error handling is always tricky: it seems easier to
> > > > > > > > > > > require save handlers to never fail.
> > > > > > > > > > 
> > > > > > > > > > Sure it's easier, but does that make it robust?
> > > > > > > > > 
> > > > > > > > > More robust in the face of wwhat kind of failure?
> > > > > > > > 
> > > > > > > > I really don't understand why we're having a discussion about 
> > > > > > > > whether
> > > > > > > > providing a means to return an error is a good thing or not.  
> > > > > > > > These
> > > > > > > > patches touch a lot of files, but the change is dead simple.
> > > > > > > 
> > > > > > > I just don't see the motivation. Presumably your patches are
> > > > > > > there to achieve some kind of goal, right? I am trying to
> > > > > > > figure out what that goal is.
> > > > > > 
> > > > > > My goal is that I want to be able to NAK a migration when devices 
> > > > > > are
> > > > > > assigned, and I think we can do it more generically than the 
> > > > > > no_migrate
> > > > > > flag so that it supports this application and any other reason that
> > > > > > saves might fail in the future.
> > > > > 
> > > > > More generically but harder to understand and debug, IMO.
> > > > 
> > > > How is returning an error condition hard to understand?  Debugging seems
> > > > easier to me, especially if drivers follow the precedent set in the last
> > > > patch and fprintf the reason for the failure.  Ideally this would be
> > > > some kind of push out to qmp, but it still seems easier than figuring
> > > > out which driver called register_device_unmigratable().
> > > > 
> > > > > > > Currently savevm callbacks never fail. So they
> > > > > > > return void. Why is returing 0 and adding a bunch of code to test 
> > > > > > > the
> > > > > > > condition that never happens a good idea?  It just seems to 
> > > > > > > create more
> > > > > > > ways for devices to shoot themselves in the foot.
> > > > > > 
> > > > > > And more ways to indicate something bad happened and keep running.  
> > > > > > We
> > > > > > already have far too many abort() calls in the code.
> > > > > 
> > > > > If you can keep running why can't you migrate?
> > > > 
> > > > Well, as you know device assignment is tied to the hardware, so can't
> > > > migrate, but can always keep running.  The ivshmem driver has a peer
> > > > role, where it's tied to the host memory, so can't migrate, but can keep
> > > > running.
> > > 
> > > Right. All these are covered with no_migrate flag well enough.
> > > Their inability to migrate does not change at runtime.
> > 
> > But it could.  What if ivshmem is acting in a peer role, but has no
> > clients, could it migrate?  What if ivshmem is migratable when the
> > migration begins, but while the migration continues, a connection is
> > setup and it becomes unmigratable.
> 
> Sounds like something we should work to prevent, not support :)

s/:)/:(/  why?

> >  Using this series, ivshmem would
> > have multiple options how to support this.  It could a) NAK the
> > migration, b) drop connections and prevent new connections until the
> > migration finishes, c) detect that new connections have happened since
> > the migration started and cancel.  And probably more.  no_migrate can
> > only do a).  And in fact, we can only test no_migrate after the VM is
> > stopped (after all memory is migrated) because otherwise it could race
> > with devices setting no_migrate during migration.
> 
> We really want no_migrate to be static. changing it is abusing
> the infrastructure.

You call it abusing, I call it making use of the infrastructure.  Why
unnecessarily restrict ourselves?  Is return 0/-1 really that scary,
unmaintainable, undebuggable?  I don't understand the resistance.

Alex






reply via email to

[Prev in Thread] Current Thread [Next in Thread]