[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v8 17/20] multi-process: heartbeat messages to remote
From: |
Elena Ufimtseva |
Subject: |
Re: [PATCH v8 17/20] multi-process: heartbeat messages to remote |
Date: |
Fri, 14 Aug 2020 16:01:47 -0700 |
On Tue, Aug 11, 2020 at 03:41:30PM +0100, Stefan Hajnoczi wrote:
> On Fri, Jul 31, 2020 at 02:20:24PM -0400, Jagannathan Raman wrote:
> > @@ -343,3 +349,49 @@ static void probe_pci_info(PCIDevice *dev, Error
> > **errp)
> > }
> > }
> > }
> > +
> > +static void hb_msg(PCIProxyDev *dev)
> > +{
> > + DeviceState *ds = DEVICE(dev);
> > + Error *local_err = NULL;
> > + MPQemuMsg msg = { 0 };
> > +
> > + msg.cmd = PROXY_PING;
> > + msg.bytestream = 0;
> > + msg.size = 0;
> > +
> > + (void)mpqemu_msg_send_and_await_reply(&msg, dev->ioc, &local_err);
> > + if (local_err) {
> > + error_report_err(local_err);
> > + qio_channel_close(dev->ioc, &local_err);
> > + error_setg(&error_fatal, "Lost contact with device %s", ds->id);
> > + }
> > +}
>
> Here is my feedback from the last revision. Was this addressed?
>
Hi Stefan,
Thank you for reviewing the patchset. In this version we decided to
shutdown the guest when the heartbeat did not get a reply from the
remote by setting the error_fatal.
Should we approach it differently or you prefer us to get rid of the
heartbeat in this form?
Thank you,
Elena
> This patch seems incomplete since no action is taken when the device
> fails to respond. vCPU threads that access the device will still get
> stuck.
>
> The simplest way to make this useful is to close the connection when a
> timeout occurs. Then the G_IO_HUP handler for the UNIX domain socket
> should perform connection cleanup. At that point there are a few
> choices:
>
> 1. Stop guest execution and wait for the host admin to restore the
> mplink so execution can resume. This is similar to how -drive
> rerror=stop pauses the guest when a disk I/O error is encountered.
>
> 2. Stop guest execution but defer it until this stale device is actually
> accessed. This maximizes guest uptime. Guests that rarely access the
> device may not notice at all.
>
> 3. Return 0 from MemoryRegion read operations and ignore writes. The
> guest continues executing but the device is broken. This is risky
> because device drivers inside the guest may not be ready to deal with
> this. The result could be data loss or corruption.
>
> 4. Raise a bus-level event. Maybe PCI error reporting can be used to
> offline the device.
>
> 5. Terminate the guest with an error message.
>
> 6. ?
>
> Until the heartbeat is fully implemented and tested I suggest dropping
> it from this patch series. Remember the G_IO_HUP will happen anyway if
> the remote device process terminates.