qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC 00/29] Migration: postcopy failure recovery


From: Peter Xu
Subject: Re: [Qemu-devel] [RFC 00/29] Migration: postcopy failure recovery
Date: Mon, 21 Aug 2017 15:47:44 +0800
User-agent: Mutt/1.5.24 (2015-08-30)

On Thu, Aug 03, 2017 at 04:57:54PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (address@hidden) wrote:
> > As we all know that postcopy migration has a potential risk to lost
> > the VM if the network is broken during the migration. This series
> > tries to solve the problem by allowing the migration to pause at the
> > failure point, and do recovery after the link is reconnected.
> > 
> > There was existing work on this issue from Md Haris Iqbal:
> > 
> > https://lists.nongnu.org/archive/html/qemu-devel/2016-08/msg03468.html
> > 
> > This series is a totally re-work of the issue, based on Alexey
> > Perevalov's recved bitmap v8 series:
> > 
> > https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg06401.html
> 
> 
> Hi Peter,
>   See my comments on the individual patches; but at a top level I think
> it looks pretty good.
> 
>   I still worry about two related things, one I see is similar to what
> you discussed with Dan.
> 
>   1) Is what happens if we end up hanging on a missing page with the bql
>   taken and can't use the monitor.
>   Checking my notes from when I was chatting to Harris last year,
>     'info cpu' was pretty good at doing this because it needed the vcpus
>   to come out of their loops, so if any vcpu was blocked on memory we'd
>   block waiting.  The other case is where an emulated IO device accesses
>   it, and that's easiest by doing a migrate with inbound network
>   traffic.
>   In this case, will your 'accept' still work?

It will not work.

To solve this problem, I posted the series:

  [RFC 0/6] monitor: allow per-monitor thread

Let's see whether that is acceptable.

> 
>   2) Similar to Dan's question of what happens if the network just hangs
>   as opposed to gives an error;  it should eventually sort itself out
>   with TCP timeouts - eventually.  Perhaps the easiest way to test this
>   is just to add a iptables -j DROP  for the migration port - it's
>   probably easier to trigger (1).

Yeah, so I think I'll just avoid considering this for now.  Thanks,

-- 
Peter Xu



reply via email to

[Prev in Thread] Current Thread [Next in Thread]