qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] Re: [PATCH 0/6] Save state error handling (kill off no_migr


From: Alex Williamson
Subject: [Qemu-devel] Re: [PATCH 0/6] Save state error handling (kill off no_migrate)
Date: Tue, 09 Nov 2010 10:44:06 -0700

On Tue, 2010-11-09 at 18:49 +0200, Michael S. Tsirkin wrote:
> On Tue, Nov 09, 2010 at 09:30:45AM -0700, Alex Williamson wrote:
> > On Tue, 2010-11-09 at 18:15 +0200, Michael S. Tsirkin wrote:
> > > On Tue, Nov 09, 2010 at 08:47:00AM -0700, Alex Williamson wrote:
> > > > > > But it could.  What if ivshmem is acting in a peer role, but has no
> > > > > > clients, could it migrate?  What if ivshmem is migratable when the
> > > > > > migration begins, but while the migration continues, a connection is
> > > > > > setup and it becomes unmigratable.
> > > > > 
> > > > > Sounds like something we should work to prevent, not support :)
> > > > 
> > > > s/:)/:(/  why?
> > > 
> > > It will just confuse everyone. Also if it happens after sending
> > > all of memory, it's pretty painful.
> > 
> > It happens after sending all of memory with no_migrate, and I think
> > pushing that earlier might introduce some races around when
> > register_device_unmigratable() can be called.
> 
> Good point. I guess we could check it twice just to speed things up.
> 
> > > > > >  Using this series, ivshmem would
> > > > > > have multiple options how to support this.  It could a) NAK the
> > > > > > migration, b) drop connections and prevent new connections until the
> > > > > > migration finishes, c) detect that new connections have happened 
> > > > > > since
> > > > > > the migration started and cancel.  And probably more.  no_migrate 
> > > > > > can
> > > > > > only do a).  And in fact, we can only test no_migrate after the VM 
> > > > > > is
> > > > > > stopped (after all memory is migrated) because otherwise it could 
> > > > > > race
> > > > > > with devices setting no_migrate during migration.
> > > > > 
> > > > > We really want no_migrate to be static. changing it is abusing
> > > > > the infrastructure.
> > > > 
> > > > You call it abusing, I call it making use of the infrastructure.  Why
> > > > unnecessarily restrict ourselves?  Is return 0/-1 really that scary,
> > > > unmaintainable, undebuggable?  I don't understand the resistance.
> > > > 
> > > > Alex
> > > 
> > > management really does not know how to handle unexpected
> > > migration failures. They must be avoided.
> > > 
> > > There are some very special cases that fail migration. They are
> > > currently easy to find with grep register_device_unmigratable.
> > > I prefer to keep it that way.
> > 
> > How can management tools be improved to better handle unexpected
> > migration failures when the only way for qemu to fail is an abort?
> > We need the infrastructure to at least return an error first.  Do we just
> > need to add some fprintfs to the save core to print the id string of the
> > device that failed to save?  I just can't buy the "code is easier to
> > grep" as an argument against adding better error handling to the save
> > code path.
> 
> I just don't buy the 'we'll return meaningless error codes at random
> point in time and management will figure it out' as an argument :)

Why is the error code meaningless?  The error code stops the migration
in qemu and hopefully prints an error message (we could easily add an
fprintf to the core save code to ensure we know the device responsible
for the NAK).  From there we can figure out how to return the error to
monitors and management tools, but we need to have a way to know there's
an error first.

> >  Anyone else want to chime in?
> > 
> > Alex
> 
> Maybe try coding up some user using the new infrastructure to do
> something useful, that register_device_unmigratable can't do.

With the number of people I hear complaining about how qemu has too many
aborts, no error checking, and no way to return errors, I'm a little
dumbfounded that there's such a roadblock to actually add some simple
error handling.  Is it the error handling you're opposed to, or the way
I'm using it to NAK a migration?

Alex






reply via email to

[Prev in Thread] Current Thread [Next in Thread]