[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH] migration/fd: abort migration if receive POLLHU
From: |
Peter Xu |
Subject: |
Re: [Qemu-devel] [PATCH] migration/fd: abort migration if receive POLLHUP event |
Date: |
Wed, 25 Apr 2018 11:14:23 +0800 |
User-agent: |
Mutt/1.9.1 (2017-09-22) |
On Tue, Apr 24, 2018 at 07:24:05PM +0100, Daniel P. Berrangé wrote:
> On Tue, Apr 24, 2018 at 06:16:31PM +0100, Dr. David Alan Gilbert wrote:
> > * Wang Xin (address@hidden) wrote:
> > > If the fd socket peer closed shortly, ppoll may receive a POLLHUP
> > > event before the expected POLLIN event, and qemu will do nothing
> > > but goes into an infinite loop of the POLLHUP event.
> > >
> > > So, abort the migration if we receive a POLLHUP event.
> >
> > Hi Wang Xin,
> > Can you explain how you manage to trigger this case; I've not hit it.
> >
> > > Signed-off-by: Wang Xin <address@hidden>
> > >
> > > diff --git a/migration/fd.c b/migration/fd.c
> > > index cd06182..5932c87 100644
> > > --- a/migration/fd.c
> > > +++ b/migration/fd.c
> > > @@ -15,6 +15,7 @@
> > > */
> > >
> > > #include "qemu/osdep.h"
> > > +#include "qemu/error-report.h"
> > > #include "channel.h"
> > > #include "fd.h"
> > > #include "monitor/monitor.h"
> > > @@ -46,6 +47,11 @@ static gboolean
> > > fd_accept_incoming_migration(QIOChannel *ioc,
> > > GIOCondition condition,
> > > gpointer opaque)
> > > {
> > > + if (condition & G_IO_HUP) {
> > > + error_report("The migration peer closed, job abort");
> > > + exit(EXIT_FAILURE);
> > > + }
> > > +
> >
> > OK, I wish we had a nicer way for failing; especially for the
> > multifd/postcopy recovery worlds where one failed connection might not
> > be fatal; but I don't see how to do that here.
>
> This doesn't feel right to me.
>
> We have passed in a pre-opened FD to QEMU, and we registered a watch
> on it to detect when there is data from the src QEMU that is available
> to read. Normally the src will have sent something so we'll get G_IO_IN,
> but you're suggesting the client has quit immediately, so we're getting
> G_IO_HUP due to end of file.
>
> The migration_channel_process_incoming() method that we pass the ioc
> object to will be calling qio_channel_read(ioc) somewhere to try to
> read that data.
>
> For QEMU to spin in infinite loop there must be code in the
> migration_channel_process_incoming() that is ignoring the return
> value of qio_channel_read() in some manner causing it to retry
> the read again & again I presume.
>
> Putting this check for G_IO_HUP fixes your immediate problem scenario,
> but whatever code was spinning in infinite loop is still broken and
> I'd guess it was possible to still trigger the loop. eg by writing
> 1 single byte and then closing the socket.
>
> So, IMHO this fix is wrong - we need to find the root cause and fix
> that, not try to avoid calling the buggy code.
I agree. AFAIU the first read should be in qemu_loadvm_state():
v = qemu_get_be32(f);
if (v != QEMU_VM_FILE_MAGIC) {
error_report("Not a migration stream");
return -EINVAL;
}
So I would be curious more about how that infinite loop happened.
--
Peter Xu