qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's mem


From: zhanghailiang
Subject: Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration
Date: Fri, 3 Apr 2015 17:20:11 +0800
User-agent: Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20100101 Thunderbird/31.1.1

On 2015/4/3 16:51, Jason Wang wrote:


On 04/02/2015 07:52 PM, zhanghailiang wrote:
On 2015/4/1 3:06, Dr. David Alan Gilbert wrote:
* zhanghailiang (address@hidden) wrote:
On 2015/3/30 15:59, Dr. David Alan Gilbert wrote:
* zhanghailiang (address@hidden) wrote:
On 2015/3/27 18:18, Dr. David Alan Gilbert wrote:
* zhanghailiang (address@hidden) wrote:
On 2015/3/26 11:52, Li Zhijian wrote:
On 03/26/2015 11:12 AM, Wen Congyang wrote:
On 03/25/2015 05:50 PM, Juan Quintela wrote:
zhanghailiang<address@hidden>  wrote:
Hi all,

We found that, sometimes, the content of VM's memory is
inconsistent between Source side and Destination side
when we check it just after finishing migration but before
VM continue to Run.

We use a patch like bellow to find this issue, you can find
it from affix,
and Steps to reprduce:

(1) Compile QEMU:
   ./configure --target-list=x86_64-softmmu
--extra-ldflags="-lssl" && make

(2) Command and output:
SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu
qemu64,-kvmclock -netdev tap,id=hn0-device
virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive
file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe
-device
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
-vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device
usb-tablet -monitor stdio
Could you try to reproduce:
- without vhost
- without virtio-net
- cache=unsafe is going to give you trouble, but trouble
should only
    happen after migration of pages have finished.
If I use ide disk, it doesn't happen.
Even if I use virtio-net with vhost=on, it still doesn't
happen. I guess
it is because I migrate the guest when it is booting. The
virtio net
device is not used in this case.
Er??????
it reproduces in my ide disk
there is no any virtio device, my command line like below

x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu
qemu64,-kvmclock -net none
-boot c -drive file=/home/lizj/ubuntu.raw -vnc :7 -m 2048 -smp
2 -machine
usb=off -no-user-config -nodefaults -monitor stdio -vga std

it seems easily to reproduce this issue by following steps in
_ubuntu_ guest
1.  in source side, choose memtest in grub
2. do live migration
3. exit memtest(type Esc in when memory testing)
4. wait migration complete


Yes???it is a thorny problem. It is indeed easy to reproduce,
just as
your steps in the above.

This is my test result: (I also test accel=tcg, it can be
reproduced also.)
Source side:
# x86_64-softmmu/qemu-system-x86_64 -machine
pc-i440fx-2.3,accel=kvm,usb=off -no-user-config -nodefaults
-cpu qemu64,-kvmclock -boot c -drive
file=/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw
-device cirrus-vga,id=video0,vgamem_mb=8 -vnc :7 -m 2048 -smp 2
-monitor stdio
(qemu) ACPI_BUILD: init ACPI tables
ACPI_BUILD: init ACPI tables
migrate tcp:9.61.1.8:3004
ACPI_BUILD: init ACPI tables
before cpu_synchronize_all_states
5a8f72d66732cac80d6a0d5713654c0e
md_host : before saving ram complete
5a8f72d66732cac80d6a0d5713654c0e
md_host : after saving ram complete
5a8f72d66732cac80d6a0d5713654c0e
(qemu)

Destination side:
# x86_64-softmmu/qemu-system-x86_64 -machine
pc-i440fx-2.3,accel=kvm,usb=off -no-user-config -nodefaults
-cpu qemu64,-kvmclock -boot c -drive
file=/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw
-device cirrus-vga,id=video0,vgamem_mb=8 -vnc :7 -m 2048 -smp 2
-monitor stdio -incoming tcp:0:3004
(qemu) QEMU_VM_SECTION_END, after loading ram
d7cb0d8a4bdd1557fb0e78baee50c986
md_host : after loading all vmstate
d7cb0d8a4bdd1557fb0e78baee50c986
md_host : after cpu_synchronize_all_post_init
d7cb0d8a4bdd1557fb0e78baee50c986

Hmm, that's not good.  I suggest you md5 each of the RAMBlock's
individually;
to see if it's main RAM that's different or something more subtle
like
video RAM.


Er, all my previous tests are md5 'pc.ram' block only.

But then maybe it's easier just to dump the whole of RAM to file
and byte compare it (hexdump the two dumps and diff ?)

Hmm, we also used memcmp function to compare every page, but the
addresses
seem to be random.

Besides, in our previous test, we found it seems to be more easy
to reproduce
when migration occurs during VM's start-up or reboot process.

Is there any possible that some devices have special treatment
when VM start-up
which may miss setting dirty-bitmap ?

I don't think there should be, but the code paths used during
startup are
probably much less tested with migration.  I'm sure the startup code
uses different part of device emulation.   I do know we have some bugs

Er, Maybe there is a special case:

During VM's start-up, i found that the KVMslot changed many times,
it was a process of
smashing total memory space into small slot.

If some pages was dirtied and its dirty-bitmap has been set in KVM
module,
but we didn't sync the bitmaps to QEMU user-space before this slot
was smashed,
with its previous bitmap been destroyed.
The bitmap of dirty pages in the previous KVMslot maybe be missed.

What's your opinion? Can this situation i described in the above
happen?

The bellow log was grabbed, when i tried to figure out a quite same
question (some pages miss dirty-bitmap setting) we found in COLO:
Occasionally, there will be an error report in SLAVE side:

      qemu: warning: error while loading state for instance 0x0 of
device
      'kvm-tpr-opt'                                                 '
      qemu-system-x86_64: loadvm failed

We found that it related to three address (gpa:
0xca000,0xcb000,0xcc000, which are the address of 'kvmvapic.rom ?'),
and sometimes
their corresponding dirty-map will be missed in Master side, because
their KVMSlot is destroyed before we sync its dirty-bitmap to qemu.

(I'm still not quite sure if this can also happen in common
migration, i will try to test it in normal migration)

Hi,

We have found two bugs (places) that miss setting migration-bitmap of
dirty pages,
The virtio-blk related can be fixed by patch of Wen Congyang, you can
find his reply in the list.
And the 'kvm-tpr-opt' related can be fixed by the follow patch.

Thanks,
zhang

>From 0c63687d0f14f928d6eb4903022a7981db6ba59f Mon Sep 17 00:00:00 2001
From: zhanghailiang <address@hidden>
Date: Thu, 2 Apr 2015 19:26:31 +0000
Subject: [PATCH] kvm-all: Sync dirty-bitmap from kvm before kvm
destroy the
  corresponding dirty_bitmap

Sometimes, we destroy the dirty_bitmap in kvm_memory_slot before any
sync action
occur, this bit in dirty_bitmap will be missed, and which will lead
the corresponding
dirty pages to be missed in migration.

This usually happens when do migration during VM's Start-up or Reboot.

Signed-off-by: zhanghailiang <address@hidden>
---
  exec.c    | 2 +-
  kvm-all.c | 4 +++-
  2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/exec.c b/exec.c
index 874ecfc..4b1b39b 100644
--- a/exec.c
+++ b/exec.c
@@ -59,7 +59,7 @@
  //#define DEBUG_SUBPAGE

  #if !defined(CONFIG_USER_ONLY)
-static bool in_migration;
+bool in_migration;

  /* ram_list is read under rcu_read_lock()/rcu_read_unlock().  Writes
   * are protected by the ramlist lock.
diff --git a/kvm-all.c b/kvm-all.c
index 335438a..dd75eff 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -128,6 +128,8 @@ bool kvm_allowed;
  bool kvm_readonly_mem_allowed;
  bool kvm_vm_attributes_allowed;

+extern bool in_migration;
+
  static const KVMCapabilityInfo kvm_required_capabilites[] = {
      KVM_CAP_INFO(USER_MEMORY),
      KVM_CAP_INFO(DESTROY_MEMORY_REGION_WORKS),
@@ -715,7 +717,7 @@ static void kvm_set_phys_mem(MemoryRegionSection
*section, bool add)

          old = *mem;

-        if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
+        if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES || in_migration) {
              kvm_physical_sync_dirty_bitmap(section);
          }

--

I can still see XFS panic of complaining "Corruption of in-memory data
detected." in guest after migration even with this patch and IDE disk.


What's you command line of qemu ?

Thanks,
zhanghailiang





reply via email to

[Prev in Thread] Current Thread [Next in Thread]