[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC 00/29] Migration: postcopy failure recovery
From: |
Peter Xu |
Subject: |
Re: [Qemu-devel] [RFC 00/29] Migration: postcopy failure recovery |
Date: |
Mon, 21 Aug 2017 15:47:44 +0800 |
User-agent: |
Mutt/1.5.24 (2015-08-30) |
On Thu, Aug 03, 2017 at 04:57:54PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (address@hidden) wrote:
> > As we all know that postcopy migration has a potential risk to lost
> > the VM if the network is broken during the migration. This series
> > tries to solve the problem by allowing the migration to pause at the
> > failure point, and do recovery after the link is reconnected.
> >
> > There was existing work on this issue from Md Haris Iqbal:
> >
> > https://lists.nongnu.org/archive/html/qemu-devel/2016-08/msg03468.html
> >
> > This series is a totally re-work of the issue, based on Alexey
> > Perevalov's recved bitmap v8 series:
> >
> > https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg06401.html
>
>
> Hi Peter,
> See my comments on the individual patches; but at a top level I think
> it looks pretty good.
>
> I still worry about two related things, one I see is similar to what
> you discussed with Dan.
>
> 1) Is what happens if we end up hanging on a missing page with the bql
> taken and can't use the monitor.
> Checking my notes from when I was chatting to Harris last year,
> 'info cpu' was pretty good at doing this because it needed the vcpus
> to come out of their loops, so if any vcpu was blocked on memory we'd
> block waiting. The other case is where an emulated IO device accesses
> it, and that's easiest by doing a migrate with inbound network
> traffic.
> In this case, will your 'accept' still work?
It will not work.
To solve this problem, I posted the series:
[RFC 0/6] monitor: allow per-monitor thread
Let's see whether that is acceptable.
>
> 2) Similar to Dan's question of what happens if the network just hangs
> as opposed to gives an error; it should eventually sort itself out
> with TCP timeouts - eventually. Perhaps the easiest way to test this
> is just to add a iptables -j DROP for the migration port - it's
> probably easier to trigger (1).
Yeah, so I think I'll just avoid considering this for now. Thanks,
--
Peter Xu