qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's mem


From: Juan Quintela
Subject: Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration
Date: Wed, 25 Mar 2015 10:50:26 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4 (gnu/linux)

zhanghailiang <address@hidden> wrote:
> Hi all,
>
> We found that, sometimes, the content of VM's memory is inconsistent between 
> Source side and Destination side
> when we check it just after finishing migration but before VM continue to Run.
>
> We use a patch like bellow to find this issue, you can find it from affix,
> and Steps to reprduce:
>
> (1) Compile QEMU:
>  ./configure --target-list=x86_64-softmmu  --extra-ldflags="-lssl" && make
>
> (2) Command and output:
> SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock 
> -netdev tap,id=hn0-device virtio-net-pci,id=net-pci0,netdev=hn0 -boot c 
> -drive 
> file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe
>  -device 
> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 
> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor 
> stdio

Could you try to reproduce:
- without vhost
- without virtio-net
- cache=unsafe is going to give you trouble, but trouble should only
  happen after migration of pages have finished.

What kind of load were you having when reproducing this issue?
Just to confirm, you have been able to reproduce this without COLO
patches, right?

> (qemu) migrate tcp:192.168.3.8:3004
> before saving ram complete
> ff703f6889ab8701e4e040872d079a28
> md_host : after saving ram complete
> ff703f6889ab8701e4e040872d079a28
>
> DST: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock 
> -netdev tap,id=hn0,vhost=on -device virtio-net-pci,id=net-pci0,netdev=hn0 
> -boot c -drive 
> file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe
>  -device 
> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 
> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor 
> stdio -incoming tcp:0:3004
> (qemu) QEMU_VM_SECTION_END, after loading ram
> 230e1e68ece9cd4e769630e1bcb5ddfb
> md_host : after loading all vmstate
> 230e1e68ece9cd4e769630e1bcb5ddfb
> md_host : after cpu_synchronize_all_post_init
> 230e1e68ece9cd4e769630e1bcb5ddfb
>
> This happens occasionally, and it is more easy to reproduce when issue 
> migration command during VM's startup time.

OK, a couple of things.  Memory don't have to be exactly identical.
Virtio devices in particular do funny things on "post-load".  There
aren't warantees for that as far as I know, we should end with an
equivalent device state in memory.

> We have done further test and found that some pages has been dirtied but its 
> corresponding migration_bitmap is not set.
> We can't figure out which modules of QEMU has missed setting bitmap when 
> dirty page of VM,
> it is very difficult for us to trace all the actions of dirtying VM's pages.

This seems to point to a bug in one of the devices.

> Actually, the first time we found this problem was in the COLO FT 
> development, and it triggered some strange issues in
> VM which all pointed to the issue of inconsistent of VM's memory. (We have 
> try to save all memory of VM to slave side every time
> when do checkpoint in COLO FT, and everything will be OK.)
>
> Is it OK for some pages that not transferred to destination when do migration 
> ? Or is it a bug?

Pages transferred should be the same, after device state transmission is
when things could change.

> This issue has blocked our COLO development... :(
>
> Any help will be greatly appreciated!

Later, Juan.

>
> Thanks,
> zhanghailiang
>
> --- a/savevm.c
> +++ b/savevm.c
> @@ -51,6 +51,26 @@
>  #define ARP_PTYPE_IP 0x0800
>  #define ARP_OP_REQUEST_REV 0x3
>
> +#include "qemu/rcu_queue.h"
> +#include <openssl/md5.h>
> +
> +static void check_host_md5(void)
> +{
> +    int i;
> +    unsigned char md[MD5_DIGEST_LENGTH];
> +    MD5_CTX ctx;
> +    RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check 
> 'pc.ram' block */
> +
> +    MD5_Init(&ctx);
> +    MD5_Update(&ctx, (void *)block->host, block->used_length);
> +    MD5_Final(md, &ctx);
> +    printf("md_host : ");
> +    for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
> +        fprintf(stderr, "%02x", md[i]);
> +    }
> +    fprintf(stderr, "\n");
> +}
> +
>  static int announce_self_create(uint8_t *buf,
>                                  uint8_t *mac_addr)
>  {
> @@ -741,7 +761,13 @@ void qemu_savevm_state_complete(QEMUFile *f)
>          qemu_put_byte(f, QEMU_VM_SECTION_END);
>          qemu_put_be32(f, se->section_id);
>
> +        printf("before saving %s complete\n", se->idstr);
> +        check_host_md5();
> +
>          ret = se->ops->save_live_complete(f, se->opaque);
> +        printf("after saving %s complete\n", se->idstr);
> +        check_host_md5();
> +
>          trace_savevm_section_end(se->idstr, se->section_id, ret);
>          if (ret < 0) {
>              qemu_file_set_error(f, ret);
> @@ -1030,6 +1063,11 @@ int qemu_loadvm_state(QEMUFile *f)
>              }
>
>              ret = vmstate_load(f, le->se, le->version_id);
> +            if (section_type == QEMU_VM_SECTION_END) {
> +                printf("QEMU_VM_SECTION_END, after loading %s\n", 
> le->se->idstr);
> +                check_host_md5();
> +            }
> +
>              if (ret < 0) {
>                  error_report("error while loading state section id %d(%s)",
>                               section_id, le->se->idstr);
> @@ -1061,7 +1099,11 @@ int qemu_loadvm_state(QEMUFile *f)
>          g_free(buf);
>      }
>
> +    printf("after loading all vmstate\n");
> +    check_host_md5();
>      cpu_synchronize_all_post_init();
> +    printf("after cpu_synchronize_all_post_init\n");
> +    check_host_md5();
>
>      ret = 0;



reply via email to

[Prev in Thread] Current Thread [Next in Thread]