qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qmp interface for save vmstate to image


From: Wenchao Xia
Subject: Re: [Qemu-devel] [RFC] qmp interface for save vmstate to image
Date: Wed, 27 Mar 2013 11:35:24 +0800
User-agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130307 Thunderbird/17.0.4

>
  With a deeper thinking, I'd like to share some more analyse:
Vmstate saving equals memory snapshotting, to do it in theory methods
can be concluded as:
1 get a mirror of it just in the time sending the "snapshot" request,
kernel cow that region.
2 get a mirror of it by gradually coping out the region, complete
when clone sync with the original region, basically similar to migrate.

  Take a closer look:
1 cow the memory region:
Saving: block I/O, cpu, since any duplicated step do not exist.
Sacrifice: mem.
Industry improvement solution: NUMA, price: expensive.
Implement: hard, need quite some work.
Qemu code maintain: easy.
Detail:
  This method is the closest one to the meaning of "snapshot", but it
contains a hidden requirement: reserved memory. As a really used
server today, it is not possible that a huge memory is reserved for it:
for example, one 4G mem server will possible to run a 3.5G mem guest,
to get benefit of easing deploying, hardware independency, whole
machine backup/restore. In this case, memory is not enough to do it.
Let's take another example more possible happen: one 4G mem server
run two 1.5G guest, in this case one guest need to be migrated out,
obvious bad. So a much better solution is adding memory at the time
doing snapshot, to do it without hardware plug and economic, it need
NUMA+memory sharing:

Host1    Host2    Host3
|  |     |  |     |  |
|  mem   | mem    |  mem
|        |        |
|------------------
         |
      shared mem

  Some hosts share a memory to do snapshot, they get it when
doing snapshot and return it to cluster manager after complete.
This is possible on expensive architecture, but hard to be done
on x86 architecture which labels itself cheap.
  One unrelated topic I thought: does qemu support migrating
to a host device? If not it should support migrate to a block device
with fixed size(different with snapshot, two mirror need sync), when
shared memory present they can be migrated to a RAM block device
quickly.

Implement detail:
  It should be done by adding an API in kernel: mem_snapshot(),
from where kernel can cow a region, and write the snapshotted pages
to far slower shared mem(if this logic is added as optimization).
Fork() can do it, but brings many trouble and wound not benefit
from NUMA architecture by moving snapshotted pages to slower mem.

2 gradually coping out and sync the memory region, two ways to do it:
2.1 migrate to block device.(migrate to fd, or migrate to image):
Saving: mem.
Sacrifice: CPU, block I/O.
Industry improvement solution: Flash disk, cheap.
Implement: easy, based on migration.
Qemu code maintain: easy.
Detail:
  It is a relative easier case, we just need to make the size fixed.
And flash disk is possible on X86 architecture.

2.2 migrate to a stream, use another process to receive and rearrange
the data.
Saving: mem.
Sacrifice: CPU(very high), block I/O(unless big buffer).
Industry improvement solution: another host or CPU do it.
Implement: hard, need new qemu tool.
Qemu code maintain: hard, data need to be encoded in qemu, decoded
on another process and rearrange, every change or new device adding
need change it on both side.
Detail:
  It invokes a process to receive the data, or invoke a fake qemu
to recieve it and save(need many memory). Since code are hard
to maintain, personally I think it is worse than 2.1.


Summary:
  suggest:
  1) support both method 1 and 2.1, treat 2.1 as an improvement
for migrate fd. Adding a new qmp interface as "vmsate snapshot"
for method 1 to declare it as true snapshot. This allow it
work on different architecture.
  2) pushing a API to Linux to do method 1, instead of fork().
I'd like to send a RFC to Linux memory mail-list to get feedback.



-- 
Best Regards

Wenchao Xia




reply via email to

[Prev in Thread] Current Thread [Next in Thread]