qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] When does live migration give up?


From: Paolo Bonzini
Subject: Re: [Qemu-devel] When does live migration give up?
Date: Wed, 04 Sep 2013 20:34:27 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130805 Thunderbird/17.0.8

Il 04/09/2013 20:05, Alex Bligh ha scritto:
> Paolo,
> 
> --On 4 September 2013 19:07:53 +0200 Paolo Bonzini <address@hidden>
> wrote:
> 
>> Il 04/09/2013 17:24, Alex Bligh ha scritto:
>>> We have seen a situation when migrating about 50 VMs at once where some
>>> of them fail. I think this is because they are dirtying pages faster
>>> than
>>> they can be transmitted.
>>
>> No, migration never "gives up".  It may never converge, but it keeps
>> trying until cancelled.
>>
>> Could it be that you are choosing migration server ports from a small
>> range, and some of them are failing because two migrations pick the same
>> random port for the destination (which is where the server socket lies)?
> 
> Should not be that. We create FDs (which are sockets) and pass them in at
> both ends.

Do you mean something like this?

   destination
      socket()
      bind() to { sin_port = 0, sin_addr.s_addr = INADDR_ANY }
      listen()
      getsockname()
      send address to source
      accept()
      start QEMU with file descriptor returned by accept

   source
      read address
      socket()
      connect()
      pass socket file descriptor to QEMU and migrate to it

Anything that doesn't use sin_port = 0 and getsockname() is prone to
race conditions.

> Approx 10% of migrations die after many minutes on the
> customer's platform. This does not appear to happen if migrations are
> not carried out 50 at a time.

Dying after many minutes usually means that the destination is not set
up the same as the source, as you said below.

Paolo

> We appear to be getting something other than 'ms' returned through the
> monitoring system. Unhelpfully what that is is not logged.
> 
> Is there anything (apart from the socket closing prematurely) which can
> cause a failed migration after many minutes? We've seen problems where
> the destination is not set up the same as the source (e.g. different
> numbers of NICs) but IIRC that fails much earlier.
> 
> To make things easier (cough), this is qemu 1.0 (as shipped with Ubuntu
> Precise).
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]