[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] Live migration results in non-working virtio-net device
From: |
Neil Skrypuch |
Subject: |
Re: [Qemu-devel] Live migration results in non-working virtio-net device (sometimes) |
Date: |
Mon, 03 Mar 2014 15:15:06 -0500 |
User-agent: |
KMail/4.10.2 (Linux/3.9.1-gentoo-r1; KDE/4.10.2; x86_64; ; ) |
On Saturday 01 March 2014 10:34:03 陈梁 wrote:
> > On Thursday 30 January 2014 13:23:04 Neil Skrypuch wrote:
> >> First, let me briefly outline the way we use live migration, as it is
> >> probably not typical. We use live migration (with block migration) to
> >> make
> >> backups of VMs with zero downtime. The basic process goes like this:
> >>
> >> 1) migrate src VM -> dest VM
> >> 2) migration completes
> >> 3) cont src VM
> >> 4) gracefully shut down dest VM
> >> 5) dest VM's disk image is now a valid backup
> >>
> >> In general, this works very well.
> >>
> >> Up until now we have been using qemu-kvm 1.1.2 and have not had any
> >> issues
> >> with the above process. I am now attempting to upgrade us to a newer
> >> version of qemu, but all newer versions I've tried occasionally result
> >> in the virtio- net device ceasing to function on the src VM after step
> >> 3.
> >>
> >> I am able to reproduce this reliably (given enough iterations), it
> >> happens
> >> in roughly 2% of all migrations.
> >>
> >> Here is the complete qemu command line for the src VM:
> >>
> >> /usr/bin/qemu-system-x86_64 -machine accel=kvm -drive
> >> file=/var/lib/kvm/testbackup.polldev.com.img,if=virtio -m 2048 -smp
> >> 4,cores=4,sockets=1,threads=1 -net
> >> nic,macaddr=52:54:98:00:00:00,model=virtio -net
> >> tap,script=/etc/qemu-ifup-
> >> br2,downscript=no -curses -name
> >> "testbackup.polldev.com",process=testbackup.polldev.com -monitor
> >> unix:/var/lib/kvm/monitor/testbackup,server,nowait
> >>
> >> The dest VM:
> >>
> >> /usr/bin/qemu-system-x86_64 -machine accel=kvm -drive
> >> file=/backup/testbackup.polldev.com.img.bak20140129,if=virtio -m 2048
> >> -smp
> >> 4,cores=4,sockets=1,threads=1 -net
> >> nic,macaddr=52:54:98:00:00:00,model=virtio -net
> >> tap,script=no,downscript=no
> >> - curses -name "testbackup.polldev.com",process=testbackup.polldev.com
> >> -monitor unix:/var/lib/kvm/monitor/testbackup.bak,server,nowait -incoming
> >> tcp:0:4444
> >>
> >> The migration is performed like so:
> >>
> >> echo "migrate -b tcp:localhost:4444" | socat STDIO UNIX-
> >> CONNECT:/var/lib/kvm/monitor/testbackup
> >> echo "migrate_set_speed 1G" | socat STDIO UNIX-
> >> CONNECT:/var/lib/kvm/monitor/testbackup
> >> #wait
> >> echo cont | socat STDIO UNIX-CONNECT:/var/lib/kvm/monitor/testbackup
> >>
> >> The guest in question is a minimal install of CentOS 6.5.
> >>
> >> I have observed this issue across the following qemu versions:
> >>
> >> qemu 1.4.2
> >> qemu 1.6.0
> >> qemu 1.6.1
> >> qemu 1.7.0
> >>
> >> I also attempted to test qemu 1.5.3, but live migration flat out crashed
> >> there (totally different issue).
> >>
> >> I have also tested a number of other scenarios with qemu 1.6.0, all of
> >> which exhibit the same failure mode:
> >>
> >> qemu 1.6.0 + host kernel 3.1.0
> >> qemu 1.6.0 + host kernel 3.10.7
> >> qemu 1.6.0 + host kernel 3.10.17
> >> qemu 1.6.0 + virtio with -netdev/-device syntax
> >> qemu 1.6.0 + accel=tcg
> >>
> >> The one case I have found that works properly is the following:
> >>
> >> qemu 1.6.0 + e1000
> >>
> >> It is worth noting that when the virtio-net device ceases to function in
> >> the guest that removing and reinserting the virtio-net kernel module
> >> results in the device working again (except in 1.4.2, this had no effect
> >> there).
> >>
> >> As mentioned above I can reproduce this with minimal effort, and am
> >> willing
> >> to test out any patches or provide further details as necessary.
> >>
> >> - Neil
> >
> > Ok, I was able to narrow this down to somewhere in between 1.2.2 (or
> > rather, 1.2.0) and 1.3.0. Migration in 1.3.0 is broken, however, I was
> > able to cherry pick d7cd369, d5f1f28, and 9ee0cb2 on top of 1.3.0 to fix
> > the unrelated migration bug and confirm that the bug from this thread is
> > still present in 1.3.0.
> >
> > I started a git bisect on 1.2.2..1.3.0 but didn't get very far before
> > running into several unrelated bugs which kept migration from working.
> >
> > I also tested out the latest master code (d844a7b) and it fails in the
> > same
> > way as 1.7.0.
> >
> > - Neil
>
> hi,have you try to ping from vm to other host after migration?
Yes, pings from the VM to anywhere result in "Destination Host Unreachable",
it's not the usual MAC address moved problem with migration. Note that the
problem occurs on the *source* VM, not the destination VM, the destination VM
is intentionally configured with an unconnected network interface (script=no).
Also, I had a closer look at the source VM's state after the network stops
working. If I initiate a ping from inside the VM, via tcpdump I can see ARP
traffic on the host's corresponding tap and bridge adaptors (both the request
and response), however, tcpdump from inside the guest does not see either of
these.
I can see the TX count on eth0 inside the guest is increasing, but the RX
count is not moving. On the host, I can see the RX count on the tap is
increasing, but the TX is not. Similarly, the dropped count on the tap is
rising rapidly:
tap0 Link encap:Ethernet HWaddr fe:19:99:0a:9b:07
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:452 errors:0 dropped:0 overruns:0 frame:0
TX packets:62626 errors:0 dropped:954919 overruns:0 carrier:0
collisions:0 txqueuelen:500
RX bytes:29104 (28.4 KiB) TX bytes:8726592 (8.3 MiB)
If I try to ping the guest from an external host, I can see the ICMP request
reach the tap adaptor on the host, but never a response and nothing in the
guest.
It seems like the TX side is working properly. Is it possible that the RX side
of the virtio-net adaptor is in a confused state and thus resulting in dropped
packets?
- Neil
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: [Qemu-devel] Live migration results in non-working virtio-net device (sometimes),
Neil Skrypuch <=