qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Live migration results in non-working virtio-net device


From: Neil Skrypuch
Subject: Re: [Qemu-devel] Live migration results in non-working virtio-net device (sometimes)
Date: Mon, 03 Mar 2014 15:15:06 -0500
User-agent: KMail/4.10.2 (Linux/3.9.1-gentoo-r1; KDE/4.10.2; x86_64; ; )

On Saturday 01 March 2014 10:34:03 陈梁 wrote:
> > On Thursday 30 January 2014 13:23:04 Neil Skrypuch wrote:
> >> First, let me briefly outline the way we use live migration, as it is
> >> probably not typical. We use live migration (with block migration) to
> >> make
> >> backups of VMs with zero downtime. The basic process goes like this:
> >> 
> >> 1) migrate src VM -> dest VM
> >> 2) migration completes
> >> 3) cont src VM
> >> 4) gracefully shut down dest VM
> >> 5) dest VM's disk image is now a valid backup
> >> 
> >> In general, this works very well.
> >> 
> >> Up until now we have been using qemu-kvm 1.1.2 and have not had any
> >> issues
> >> with the above process. I am now attempting to upgrade us to a newer
> >> version of qemu, but all newer versions I've tried occasionally result
> >> in the virtio- net device ceasing to function on the src VM after step
> >> 3.
> >> 
> >> I am able to reproduce this reliably (given enough iterations), it
> >> happens
> >> in roughly 2% of all migrations.
> >> 
> >> Here is the complete qemu command line for the src VM:
> >> 
> >> /usr/bin/qemu-system-x86_64 -machine accel=kvm -drive
> >> file=/var/lib/kvm/testbackup.polldev.com.img,if=virtio -m 2048 -smp
> >> 4,cores=4,sockets=1,threads=1 -net
> >> nic,macaddr=52:54:98:00:00:00,model=virtio -net
> >> tap,script=/etc/qemu-ifup-
> >> br2,downscript=no -curses -name
> >> "testbackup.polldev.com",process=testbackup.polldev.com -monitor
> >> unix:/var/lib/kvm/monitor/testbackup,server,nowait
> >> 
> >> The dest VM:
> >> 
> >> /usr/bin/qemu-system-x86_64 -machine accel=kvm -drive
> >> file=/backup/testbackup.polldev.com.img.bak20140129,if=virtio -m 2048
> >> -smp
> >> 4,cores=4,sockets=1,threads=1 -net
> >> nic,macaddr=52:54:98:00:00:00,model=virtio -net
> >> tap,script=no,downscript=no
> >> - curses -name "testbackup.polldev.com",process=testbackup.polldev.com
> >> -monitor unix:/var/lib/kvm/monitor/testbackup.bak,server,nowait -incoming
> >> tcp:0:4444
> >> 
> >> The migration is performed like so:
> >> 
> >> echo "migrate -b tcp:localhost:4444" | socat STDIO UNIX-
> >> CONNECT:/var/lib/kvm/monitor/testbackup
> >> echo "migrate_set_speed 1G" | socat STDIO UNIX-
> >> CONNECT:/var/lib/kvm/monitor/testbackup
> >> #wait
> >> echo cont | socat STDIO UNIX-CONNECT:/var/lib/kvm/monitor/testbackup
> >> 
> >> The guest in question is a minimal install of CentOS 6.5.
> >> 
> >> I have observed this issue across the following qemu versions:
> >> 
> >> qemu 1.4.2
> >> qemu 1.6.0
> >> qemu 1.6.1
> >> qemu 1.7.0
> >> 
> >> I also attempted to test qemu 1.5.3, but live migration flat out crashed
> >> there (totally different issue).
> >> 
> >> I have also tested a number of other scenarios with qemu 1.6.0, all of
> >> which exhibit the same failure mode:
> >> 
> >> qemu 1.6.0 + host kernel 3.1.0
> >> qemu 1.6.0 + host kernel 3.10.7
> >> qemu 1.6.0 + host kernel 3.10.17
> >> qemu 1.6.0 + virtio with -netdev/-device syntax
> >> qemu 1.6.0 + accel=tcg
> >> 
> >> The one case I have found that works properly is the following:
> >> 
> >> qemu 1.6.0 + e1000
> >> 
> >> It is worth noting that when the virtio-net device ceases to function in
> >> the guest that removing and reinserting the virtio-net kernel module
> >> results in the device working again (except in 1.4.2, this had no effect
> >> there).
> >> 
> >> As mentioned above I can reproduce this with minimal effort, and am
> >> willing
> >> to test out any patches or provide further details as necessary.
> >> 
> >> - Neil
> > 
> > Ok, I was able to narrow this down to somewhere in between 1.2.2 (or
> > rather, 1.2.0) and 1.3.0. Migration in 1.3.0 is broken, however, I was
> > able to cherry pick d7cd369, d5f1f28, and 9ee0cb2 on top of 1.3.0 to fix
> > the unrelated migration bug and confirm that the bug from this thread is
> > still present in 1.3.0.
> > 
> > I started a git bisect on 1.2.2..1.3.0 but didn't get very far before
> > running into several unrelated bugs which kept migration from working.
> > 
> > I also tested out the latest master code (d844a7b) and it fails in the
> > same
> > way as 1.7.0.
> > 
> > - Neil
> 
> hi,have you try to ping from vm to other host after migration?

Yes, pings from the VM to anywhere result in "Destination Host Unreachable", 
it's not the usual MAC address moved problem with migration. Note that the 
problem occurs on the *source* VM, not the destination VM, the destination VM 
is intentionally configured with an unconnected network interface (script=no).

Also, I had a closer look at the source VM's state after the network stops 
working. If I initiate a ping from inside the VM, via tcpdump I can see ARP 
traffic on the host's corresponding tap and bridge adaptors (both the request 
and response), however, tcpdump from inside the guest does not see either of 
these.

I can see the TX count on eth0 inside the guest is increasing, but the RX 
count is not moving. On the host, I can see the RX count on the tap is 
increasing, but the TX is not. Similarly, the dropped count on the tap is 
rising rapidly:

tap0      Link encap:Ethernet  HWaddr fe:19:99:0a:9b:07
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:452 errors:0 dropped:0 overruns:0 frame:0
          TX packets:62626 errors:0 dropped:954919 overruns:0 carrier:0
          collisions:0 txqueuelen:500
          RX bytes:29104 (28.4 KiB)  TX bytes:8726592 (8.3 MiB)

If I try to ping the guest from an external host, I can see the ICMP request 
reach the tap adaptor on the host, but never a response and nothing in the 
guest.

It seems like the TX side is working properly. Is it possible that the RX side 
of the virtio-net adaptor is in a confused state and thus resulting in dropped 
packets?

- Neil



reply via email to

[Prev in Thread] Current Thread [Next in Thread]