[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's mem
From: |
Wen Congyang |
Subject: |
Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration |
Date: |
Thu, 2 Apr 2015 17:14:48 +0800 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 |
On 03/26/2015 06:29 PM, Juan Quintela wrote:
> Wen Congyang <address@hidden> wrote:
>> On 03/25/2015 05:50 PM, Juan Quintela wrote:
>>> zhanghailiang <address@hidden> wrote:
>>>> Hi all,
>>>>
>>>> We found that, sometimes, the content of VM's memory is
>>>> inconsistent between Source side and Destination side
>>>> when we check it just after finishing migration but before VM continue to
>>>> Run.
>>>>
>>>> We use a patch like bellow to find this issue, you can find it from affix,
>>>> and Steps to reprduce:
>>>>
>>>> (1) Compile QEMU:
>>>> ./configure --target-list=x86_64-softmmu --extra-ldflags="-lssl" && make
>>>>
>>>> (2) Command and output:
>>>> SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu
>>>> qemu64,-kvmclock -netdev tap,id=hn0-device
>>>> virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive
>>>> file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe
>>>> -device
>>>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
>>>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet
>>>> -monitor stdio
>>>
>>> Could you try to reproduce:
>>> - without vhost
>>> - without virtio-net
>>> - cache=unsafe is going to give you trouble, but trouble should only
>>> happen after migration of pages have finished.
>>
>> If I use ide disk, it doesn't happen.
>> Even if I use virtio-net with vhost=on, it still doesn't happen. I guess
>> it is because I migrate the guest when it is booting. The virtio net
>> device is not used in this case.
>
> Kevin, Stefan, Michael, any great idea?
The following patch can fix this problem(vhost=off):
>From ebc024702dd3147e0cbdfd173c599103dc87796c Mon Sep 17 00:00:00 2001
From: Wen Congyang <address@hidden>
Date: Thu, 2 Apr 2015 16:28:17 +0800
Subject: [PATCH] fix qiov size
Signed-off-by: Wen Congyang <address@hidden>
---
hw/block/virtio-blk.c | 15 +++++++++++++++
include/hw/virtio/virtio-blk.h | 1 +
2 files changed, 16 insertions(+)
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 000c38d..13967bc 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -33,6 +33,7 @@ VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
VirtIOBlockReq *req = g_slice_new(VirtIOBlockReq);
req->dev = s;
req->qiov.size = 0;
+ req->size = 0;
req->next = NULL;
req->mr_next = NULL;
return req;
@@ -97,12 +98,20 @@ static void virtio_blk_rw_complete(void *opaque, int ret)
* external iovec. It was allocated in submit_merged_requests
* to be able to merge requests. */
qemu_iovec_destroy(&req->qiov);
+
+ /* Restore qiov->size here */
+ req->qiov.size = req->size;
}
if (ret) {
int p = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
bool is_read = !(p & VIRTIO_BLK_T_OUT);
if (virtio_blk_handle_rw_error(req, -ret, is_read)) {
+ /*
+ * FIXME:
+ * The memory may be dirtied on read failure, it will
+ * break live migration.
+ */
continue;
}
}
@@ -323,6 +332,12 @@ static inline void submit_requests(BlockBackend *blk,
MultiReqBuffer *mrb,
struct iovec *tmp_iov = qiov->iov;
int tmp_niov = qiov->niov;
+ /*
+ * Save old qiov->size, which will used in
+ * virtio_blk_complete_request()
+ */
+ mrb->reqs[start]->size = qiov->size;
+
/* mrb->reqs[start]->qiov was initialized from external so we can't
* modifiy it here. We need to initialize it locally and then add the
* external iovecs. */
diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index b3ffcd9..7d47310 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -67,6 +67,7 @@ typedef struct VirtIOBlockReq {
struct virtio_blk_inhdr *in;
struct virtio_blk_outhdr out;
QEMUIOVector qiov;
+ size_t size;
struct VirtIOBlockReq *next;
struct VirtIOBlockReq *mr_next;
BlockAcctCookie acct;
--
2.1.0
PS: I don't check if virtio-scsi, virtio-net... has the similar problem.
If vhost=on, we can also reproduce this problem.
>
> Thanks, Juan.
>
>>
>> Thanks
>> Wen Congyang
>>
>>>
>>> What kind of load were you having when reproducing this issue?
>>> Just to confirm, you have been able to reproduce this without COLO
>>> patches, right?
>>>
>>>> (qemu) migrate tcp:192.168.3.8:3004
>>>> before saving ram complete
>>>> ff703f6889ab8701e4e040872d079a28
>>>> md_host : after saving ram complete
>>>> ff703f6889ab8701e4e040872d079a28
>>>>
>>>> DST: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu
>>>> qemu64,-kvmclock -netdev tap,id=hn0,vhost=on -device
>>>> virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive
>>>> file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe
>>>> -device
>>>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
>>>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet
>>>> -monitor stdio -incoming tcp:0:3004
>>>> (qemu) QEMU_VM_SECTION_END, after loading ram
>>>> 230e1e68ece9cd4e769630e1bcb5ddfb
>>>> md_host : after loading all vmstate
>>>> 230e1e68ece9cd4e769630e1bcb5ddfb
>>>> md_host : after cpu_synchronize_all_post_init
>>>> 230e1e68ece9cd4e769630e1bcb5ddfb
>>>>
>>>> This happens occasionally, and it is more easy to reproduce when
>>>> issue migration command during VM's startup time.
>>>
>>> OK, a couple of things. Memory don't have to be exactly identical.
>>> Virtio devices in particular do funny things on "post-load". There
>>> aren't warantees for that as far as I know, we should end with an
>>> equivalent device state in memory.
>>>
>>>> We have done further test and found that some pages has been
>>>> dirtied but its corresponding migration_bitmap is not set.
>>>> We can't figure out which modules of QEMU has missed setting bitmap
>>>> when dirty page of VM,
>>>> it is very difficult for us to trace all the actions of dirtying VM's
>>>> pages.
>>>
>>> This seems to point to a bug in one of the devices.
>>>
>>>> Actually, the first time we found this problem was in the COLO FT
>>>> development, and it triggered some strange issues in
>>>> VM which all pointed to the issue of inconsistent of VM's
>>>> memory. (We have try to save all memory of VM to slave side every
>>>> time
>>>> when do checkpoint in COLO FT, and everything will be OK.)
>>>>
>>>> Is it OK for some pages that not transferred to destination when do
>>>> migration ? Or is it a bug?
>>>
>>> Pages transferred should be the same, after device state transmission is
>>> when things could change.
>>>
>>>> This issue has blocked our COLO development... :(
>>>>
>>>> Any help will be greatly appreciated!
>>>
>>> Later, Juan.
>>>
> .
>
- Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration,
Wen Congyang <=