qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] Re: [PATCH 0/6] Save state error handling (kill off no_migr


From: Alex Williamson
Subject: [Qemu-devel] Re: [PATCH 0/6] Save state error handling (kill off no_migrate)
Date: Tue, 09 Nov 2010 08:34:54 -0700

On Tue, 2010-11-09 at 17:07 +0200, Michael S. Tsirkin wrote:
> On Tue, Nov 09, 2010 at 07:58:23AM -0700, Alex Williamson wrote:
> > On Tue, 2010-11-09 at 14:00 +0200, Michael S. Tsirkin wrote:
> > > On Mon, Nov 08, 2010 at 02:23:37PM -0700, Alex Williamson wrote:
> > > > On Mon, 2010-11-08 at 22:59 +0200, Michael S. Tsirkin wrote:
> > > > > On Mon, Nov 08, 2010 at 10:20:46AM -0700, Alex Williamson wrote:
> > > > > > On Mon, 2010-11-08 at 18:54 +0200, Michael S. Tsirkin wrote:
> > > > > > > On Mon, Nov 08, 2010 at 07:59:57AM -0700, Alex Williamson wrote:
> > > > > > > > On Mon, 2010-11-08 at 13:40 +0200, Michael S. Tsirkin wrote:
> > > > > > > > > On Wed, Oct 06, 2010 at 02:58:57PM -0600, Alex Williamson 
> > > > > > > > > wrote:
> > > > > > > > > > Our code paths for saving or migrating a VM are full of 
> > > > > > > > > > functions that
> > > > > > > > > > return void, leaving no opportunity for a device to cancel 
> > > > > > > > > > a migration,
> > > > > > > > > > either from error or incompatibility.  The ivshmem driver 
> > > > > > > > > > attempted to
> > > > > > > > > > solve this with a no_migrate flag on the save state entry.  
> > > > > > > > > > I think the
> > > > > > > > > > more generic and flexible way to solve this is to allow 
> > > > > > > > > > driver save
> > > > > > > > > > functions to fail.  This series implements that and 
> > > > > > > > > > converts ivshmem
> > > > > > > > > > to uses a set_params function to NAK migration much earlier 
> > > > > > > > > > in the
> > > > > > > > > > processes.  This touches a lot of files, but bulk of those 
> > > > > > > > > > changes are
> > > > > > > > > > simply s/void/int/ and tacking a "return 0" to the end of 
> > > > > > > > > > functions.
> > > > > > > > > > Thanks,
> > > > > > > > > > 
> > > > > > > > > > Alex
> > > > > > > > > 
> > > > > > > > > Well error handling is always tricky: it seems easier to
> > > > > > > > > require save handlers to never fail.
> > > > > > > > 
> > > > > > > > Sure it's easier, but does that make it robust?
> > > > > > > 
> > > > > > > More robust in the face of wwhat kind of failure?
> > > > > > 
> > > > > > I really don't understand why we're having a discussion about 
> > > > > > whether
> > > > > > providing a means to return an error is a good thing or not.  These
> > > > > > patches touch a lot of files, but the change is dead simple.
> > > > > 
> > > > > I just don't see the motivation. Presumably your patches are
> > > > > there to achieve some kind of goal, right? I am trying to
> > > > > figure out what that goal is.
> > > > 
> > > > My goal is that I want to be able to NAK a migration when devices are
> > > > assigned, and I think we can do it more generically than the no_migrate
> > > > flag so that it supports this application and any other reason that
> > > > saves might fail in the future.
> > > 
> > > More generically but harder to understand and debug, IMO.
> > 
> > How is returning an error condition hard to understand?  Debugging seems
> > easier to me, especially if drivers follow the precedent set in the last
> > patch and fprintf the reason for the failure.  Ideally this would be
> > some kind of push out to qmp, but it still seems easier than figuring
> > out which driver called register_device_unmigratable().
> > 
> > > > > Currently savevm callbacks never fail. So they
> > > > > return void. Why is returing 0 and adding a bunch of code to test the
> > > > > condition that never happens a good idea?  It just seems to create 
> > > > > more
> > > > > ways for devices to shoot themselves in the foot.
> > > > 
> > > > And more ways to indicate something bad happened and keep running.  We
> > > > already have far too many abort() calls in the code.
> > > 
> > > If you can keep running why can't you migrate?
> > 
> > Well, as you know device assignment is tied to the hardware, so can't
> > migrate, but can always keep running.  The ivshmem driver has a peer
> > role, where it's tied to the host memory, so can't migrate, but can keep
> > running.
> 
> Right. All these are covered with no_migrate flag well enough.
> Their inability to migrate does not change at runtime.

But it could.  What if ivshmem is acting in a peer role, but has no
clients, could it migrate?  What if ivshmem is migratable when the
migration begins, but while the migration continues, a connection is
setup and it becomes unmigratable.  Using this series, ivshmem would
have multiple options how to support this.  It could a) NAK the
migration, b) drop connections and prevent new connections until the
migration finishes, c) detect that new connections have happened since
the migration started and cancel.  And probably more.  no_migrate can
only do a).  And in fact, we can only test no_migrate after the VM is
stopped (after all memory is migrated) because otherwise it could race
with devices setting no_migrate during migration.

> > > > > > > > > So there's a bunch of code here but what exactly is the 
> > > > > > > > > benefit?
> > > > > > > > > Since save handlers have no idea what does the remote do,
> > > > > > > > > what is the compatibility you mention?
> > > > > > > > 
> > > > > > > > There are two users I currently have in mind.  ivshmem 
> > > > > > > > currently makes
> > > > > > > > use of the register_device_unmigratable() because it makes use 
> > > > > > > > of host
> > > > > > > > specific resources and connections (aiui).  This sets the 
> > > > > > > > no_migrate
> > > > > > > > flag, which is not dynamic and a bit of a band-aide.
> > > > > > > >  The other is
> > > > > > > > device assignment, which needs a way to NAK a migration since 
> > > > > > > > physical
> > > > > > > > devices are never migratable.
> > > > > > > 
> > > > > > > Well since all these can't be migrated ever, a fixed property 
> > > > > > > actually seems
> > > > > > > a good match.  Sure it's not dynamic but all the easier to debug.
> > > > > > > 
> > > > > > > >  I imagine we could at some point have
> > > > > > > > devices with state tied to other features that can't always be 
> > > > > > > > detached
> > > > > > > > from the host, this tries to provide the infrastructure for 
> > > > > > > > that to
> > > > > > > > happen.
> > > > > > > > 
> > > > > > > > Alex
> > > > > > > 
> > > > > > > Let guest control whether you can migrate?
> > > > > > > Sounds like something that is more likely to be abused
> > > > > > > than used constructively. 
> > > > > > 
> > > > > > s/guest/device/  So you would rather the migration failed on the
> > > > > > incoming side where it may not be detected
> > > > > 
> > > > > And incoming migration handlers *must* validate the input, anyway.
> > > > > We should not plaster over this with checks on outgoing side.
> > > > 
> > > > I'm not in any way suggesting incoming shouldn't do validation.
> > > 
> > > So that's enough to detect the problem.
> > 
> > No.  Let's say I have a migration source with an assigned device
> > (rombar=0 to even avoid ramblock migration issues), the migration target
> > is identical except it doesn't include the assigned device.  pci-assign
> > on the source can't NAK a migration because save doesn't currently allow
> > error returns.  The target doesn't even have the driver loaded, so
> > there's nothing to NAK the load... migration happens and the device
> > disappeared, wait for crash.  Maybe we could assume that the user did
> > something sane and used pci-assign on the target to match the source,
> > then we could NAK the load, but only after we wait for the entire memory
> > state of the guest to be transferred.
> 
> So set no_migrate flag and that should be enough.

no_migrate is sufficient for today's usage, but undesirable IMO.

> > > > > > or it may be detected too
> > > > > > late to stop the migration?
> > > > > > 
> > > > > > Alex
> > > > > 
> > > > > So there's a bug and device is in an unexpected state.
> > > > > What can we do? Assert, print an error, notify guest - all these
> > > > > come to mind. But stop migration? Seems arbitrary.
> > > > 
> > > > Perhaps the problem is that either an assert or an fprintf are the first
> > > > things that come to mind.  We shouldn't have guests randomly blowing up
> > > > or telling users to go scan through their log files to find errors.
> > > > It's not very hard to allow simple error handling, so why shouldn't our
> > > > first plan of attack be to return an error so that the human/qmp monitor
> > > > can detect it and inform the user.  For the current candidates for this
> > > > interface, there's no point notifying the guest, it's the interface
> > > > attempting to do the migration that needs to know there's something
> > > > blocking it.
> > > > 
> > > > Alex
> > > 
> > > I still don't understand, I am sorry.  When will migration fail?
> > > Assigned devices always fail migration so it's not a good example.
> > 
> > Seems like the perfect example, especially in the scenario above where
> > load failure is insufficient.  This is why the no_migrate flag was
> > introduced and why ivshmem makes use of it today.  This series starts
> > from the assumption that we need a way to NAK a migration, can we do it
> > better than the no_migrate flag, generically, and as early as possible
> > in the process.
> > 
> > Alex
> 
> no_migrate seems better in that we can check it at any point,
> unlike tying it to save callback which can only be invoked
> with VM stopped.

Please see the patches, I switched ivshmem over to NAK set_params, which
happens before the VM is stopped and before memory is migrated.  An
interface could be added to call this at any point, it's dynamic, and it
keeps all of the tests for when and why a driver might NAK in the driver
itself.

Alex




reply via email to

[Prev in Thread] Current Thread [Next in Thread]