qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] Re: [PATCH 0/6] Save state error handling (kill off no_migr


From: Alex Williamson
Subject: [Qemu-devel] Re: [PATCH 0/6] Save state error handling (kill off no_migrate)
Date: Tue, 09 Nov 2010 07:58:23 -0700

On Tue, 2010-11-09 at 14:00 +0200, Michael S. Tsirkin wrote:
> On Mon, Nov 08, 2010 at 02:23:37PM -0700, Alex Williamson wrote:
> > On Mon, 2010-11-08 at 22:59 +0200, Michael S. Tsirkin wrote:
> > > On Mon, Nov 08, 2010 at 10:20:46AM -0700, Alex Williamson wrote:
> > > > On Mon, 2010-11-08 at 18:54 +0200, Michael S. Tsirkin wrote:
> > > > > On Mon, Nov 08, 2010 at 07:59:57AM -0700, Alex Williamson wrote:
> > > > > > On Mon, 2010-11-08 at 13:40 +0200, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Oct 06, 2010 at 02:58:57PM -0600, Alex Williamson wrote:
> > > > > > > > Our code paths for saving or migrating a VM are full of 
> > > > > > > > functions that
> > > > > > > > return void, leaving no opportunity for a device to cancel a 
> > > > > > > > migration,
> > > > > > > > either from error or incompatibility.  The ivshmem driver 
> > > > > > > > attempted to
> > > > > > > > solve this with a no_migrate flag on the save state entry.  I 
> > > > > > > > think the
> > > > > > > > more generic and flexible way to solve this is to allow driver 
> > > > > > > > save
> > > > > > > > functions to fail.  This series implements that and converts 
> > > > > > > > ivshmem
> > > > > > > > to uses a set_params function to NAK migration much earlier in 
> > > > > > > > the
> > > > > > > > processes.  This touches a lot of files, but bulk of those 
> > > > > > > > changes are
> > > > > > > > simply s/void/int/ and tacking a "return 0" to the end of 
> > > > > > > > functions.
> > > > > > > > Thanks,
> > > > > > > > 
> > > > > > > > Alex
> > > > > > > 
> > > > > > > Well error handling is always tricky: it seems easier to
> > > > > > > require save handlers to never fail.
> > > > > > 
> > > > > > Sure it's easier, but does that make it robust?
> > > > > 
> > > > > More robust in the face of wwhat kind of failure?
> > > > 
> > > > I really don't understand why we're having a discussion about whether
> > > > providing a means to return an error is a good thing or not.  These
> > > > patches touch a lot of files, but the change is dead simple.
> > > 
> > > I just don't see the motivation. Presumably your patches are
> > > there to achieve some kind of goal, right? I am trying to
> > > figure out what that goal is.
> > 
> > My goal is that I want to be able to NAK a migration when devices are
> > assigned, and I think we can do it more generically than the no_migrate
> > flag so that it supports this application and any other reason that
> > saves might fail in the future.
> 
> More generically but harder to understand and debug, IMO.

How is returning an error condition hard to understand?  Debugging seems
easier to me, especially if drivers follow the precedent set in the last
patch and fprintf the reason for the failure.  Ideally this would be
some kind of push out to qmp, but it still seems easier than figuring
out which driver called register_device_unmigratable().

> > > Currently savevm callbacks never fail. So they
> > > return void. Why is returing 0 and adding a bunch of code to test the
> > > condition that never happens a good idea?  It just seems to create more
> > > ways for devices to shoot themselves in the foot.
> > 
> > And more ways to indicate something bad happened and keep running.  We
> > already have far too many abort() calls in the code.
> 
> If you can keep running why can't you migrate?

Well, as you know device assignment is tied to the hardware, so can't
migrate, but can always keep running.  The ivshmem driver has a peer
role, where it's tied to the host memory, so can't migrate, but can keep
running.

> > > > > > > So there's a bunch of code here but what exactly is the benefit?
> > > > > > > Since save handlers have no idea what does the remote do,
> > > > > > > what is the compatibility you mention?
> > > > > > 
> > > > > > There are two users I currently have in mind.  ivshmem currently 
> > > > > > makes
> > > > > > use of the register_device_unmigratable() because it makes use of 
> > > > > > host
> > > > > > specific resources and connections (aiui).  This sets the no_migrate
> > > > > > flag, which is not dynamic and a bit of a band-aide.
> > > > > >  The other is
> > > > > > device assignment, which needs a way to NAK a migration since 
> > > > > > physical
> > > > > > devices are never migratable.
> > > > > 
> > > > > Well since all these can't be migrated ever, a fixed property 
> > > > > actually seems
> > > > > a good match.  Sure it's not dynamic but all the easier to debug.
> > > > > 
> > > > > >  I imagine we could at some point have
> > > > > > devices with state tied to other features that can't always be 
> > > > > > detached
> > > > > > from the host, this tries to provide the infrastructure for that to
> > > > > > happen.
> > > > > > 
> > > > > > Alex
> > > > > 
> > > > > Let guest control whether you can migrate?
> > > > > Sounds like something that is more likely to be abused
> > > > > than used constructively. 
> > > > 
> > > > s/guest/device/  So you would rather the migration failed on the
> > > > incoming side where it may not be detected
> > > 
> > > And incoming migration handlers *must* validate the input, anyway.
> > > We should not plaster over this with checks on outgoing side.
> > 
> > I'm not in any way suggesting incoming shouldn't do validation.
> 
> So that's enough to detect the problem.

No.  Let's say I have a migration source with an assigned device
(rombar=0 to even avoid ramblock migration issues), the migration target
is identical except it doesn't include the assigned device.  pci-assign
on the source can't NAK a migration because save doesn't currently allow
error returns.  The target doesn't even have the driver loaded, so
there's nothing to NAK the load... migration happens and the device
disappeared, wait for crash.  Maybe we could assume that the user did
something sane and used pci-assign on the target to match the source,
then we could NAK the load, but only after we wait for the entire memory
state of the guest to be transferred.

> > > > or it may be detected too
> > > > late to stop the migration?
> > > > 
> > > > Alex
> > > 
> > > So there's a bug and device is in an unexpected state.
> > > What can we do? Assert, print an error, notify guest - all these
> > > come to mind. But stop migration? Seems arbitrary.
> > 
> > Perhaps the problem is that either an assert or an fprintf are the first
> > things that come to mind.  We shouldn't have guests randomly blowing up
> > or telling users to go scan through their log files to find errors.
> > It's not very hard to allow simple error handling, so why shouldn't our
> > first plan of attack be to return an error so that the human/qmp monitor
> > can detect it and inform the user.  For the current candidates for this
> > interface, there's no point notifying the guest, it's the interface
> > attempting to do the migration that needs to know there's something
> > blocking it.
> > 
> > Alex
> 
> I still don't understand, I am sorry.  When will migration fail?
> Assigned devices always fail migration so it's not a good example.

Seems like the perfect example, especially in the scenario above where
load failure is insufficient.  This is why the no_migrate flag was
introduced and why ivshmem makes use of it today.  This series starts
from the assumption that we need a way to NAK a migration, can we do it
better than the no_migrate flag, generically, and as early as possible
in the process.

Alex





reply via email to

[Prev in Thread] Current Thread [Next in Thread]