qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Stalls on Live Migration of VMs with a lot of memory


From: Peter Lieven
Subject: Re: [Qemu-devel] Stalls on Live Migration of VMs with a lot of memory
Date: Wed, 04 Jan 2012 14:08:18 +0100
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110921 Thunderbird/3.1.15

On 04.01.2012 13:28, Paolo Bonzini wrote:
On 01/04/2012 12:42 PM, Peter Lieven wrote:

ok, then i misunderstood the ram blocks thing. i thought the guest ram
would consist of a collection of ram blocks.
then let me describe it differntly. would it make sense to process
bigger portions of memory (e.g. 1M) in stage 2 to reduce the number of
calls to cpu_physical_memory_reset_dirty and instead run it on bigger
portions of memory. we might loose a few dirty pages but they will be
tracked in the next iteration in stage 2 or in stage 3 at least. what
would be necessary is that nobody marks a page dirty
while i copy the dirty information for the portion of memory i want to
process.

Dirty memory tracking is done by the hypervisor and must be done at page granularity.
ok, so this is unfortunately no option.

thus my only option at the moment is to limit the runtime of the while loop in stage 2 or
are there any post 1.0 patches in git that might already help?

i tried to limit it to migrate_max_downtime() and this at least resolves the problem with the vm stalls. however, migration speed is very limited (approx. 80MB/s on a 10G link).
with that.



- in stage 3 the vm is stopped, right? so there can't be any more dirty
blocks after scanning the whole memory once?

No, stage 3 is entered when there are very few dirty memory pages
remaining.  This may happen after scanning the whole memory many
times.  It may even never happen if migration does not converge
because of low bandwidth or too strict downtime requirements.

ok, is there a chance that i lose one final page if it is modified just
after i walked over it and i found no other page dirty (so bytes_sent = 0).

No, of course not. Stage 3 will send all missing pages while the VM is stopped. There is a chance that the guest will go crazy and start touching lots of pages at exactly the wrong time, and thus the downtime will be longer than expected. However, that's a necessary evil; if you cannot accept that, post-copy migration would provide a completely different set of tradeoffs.
i don't suffer from long downtimes in stage 3. my issue is in stage 2.

(BTW, bytes_sent = 0 is very rare).
i know, but when the vm is stopped there is no issue. i understood your "No, stage 3 is entered ..." wrong ;-)

thanks for your help and explainations.

peter

Paolo




reply via email to

[Prev in Thread] Current Thread [Next in Thread]