qemu-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-discuss] KVM guest gets aborted if blockcommit is called


From: Christian Rößner
Subject: [Qemu-discuss] KVM guest gets aborted if blockcommit is called
Date: Mon, 24 Aug 2015 22:45:16 +0200

Hello,

I spent now full five days to debug a major problem with backing up VMs. I run 
a HP ProLiant Server SE316M1-R2 aka DL160G6) with two Xeon L5520 and 48GB RAM 
tripple channel. On this server I do monitoring and Qemu/libvirt. I run 7 
guests on this server, which runs with Gentoo Linux (hardened; Grsecurity 
patched kernel, PaX, no RBAC).

All guests use raw images as disks (also tested QED and QCOW2). The systems are 
all Gentoo and Ubuntu. All having qemu-guest-agent running.

app-emulation/libvirt-1.2.18-r1::gentoo was built with the following:
USE="caps fuse iscsi libvirtd lvm lxc macvtap nfs nls parted pcap qemu sasl 
systemd udev vepa -apparmor -audit -avahi -firewalld -glusterfs -numa -openvz 
-phyp -policykit -rbd (-selinux) -uml -virt-network -virtualbox 
(-wireshark-plugins) -xen"

app-emulation/qemu-2.4.0::gentoo was built with the following:
USE="aio caps curl fdt filecaps jpeg ncurses nls pin-upstream-blobs png python 
sasl seccomp spice ssh threads tls uuid vhost-net vnc xattr -accessibility 
-alsa -bluetooth -debug -glusterfs -gtk -gtk2 -infiniband -iscsi -lzo -nfs 
-numa -opengl -pulseaudio -rbd -sdl -sdl2 (-selinux) -smartcard -snappy -static 
-static-softmmu -static-user -systemtap -tci -test -usb -usbredir -vde -virtfs 
-vte -xen -xfs" PYTHON_TARGETS="python2_7" QEMU_SOFTMMU_TARGETS="i386 x86_64 
-aarch64 (-alpha) (-arm) -cris -lm32 (-m68k) -microblaze -microblazeel (-mips) 
-mips64 -mips64el -mipsel -moxie -or32 (-ppc) (-ppc64) -ppcemb -s390x -sh4 
-sh4eb (-sparc) -sparc64 -unicore32 -xtensa -xtensaeb" QEMU_USER_TARGETS="i386 
x86_64 -aarch64 (-alpha) (-arm) -armeb -cris (-m68k) -microblaze -microblazeel 
(-mips) -mips64 -mips64el -mipsel -mipsn32 -mipsn32el -or32 (-ppc) (-ppc64) 
-ppc64abi32 -s390x -sh4 -sh4eb (-sparc) -sparc32plus -sparc64 -unicore32"

I wrote a bash script hat shall backup all guests. It works like this:

1. Create external snapshot
2. Copy/rsync away the image
3. blockcommit snapshot
4. blockjob pivot
5. Copy/rsync away the XML description for the guest
6. Remove Snapshot file

I did some test running the script in a cron job. For this I found out that 
copying the image file takes round about 15 minutes. So I did a 30 minute cycle 
for the script.

4 or 5 cycles work perfectly. (1) and (2) are working and when it comes to 
blockcommit, the guest may (random) be aborted and the command fails to 
continue, because the guest is no longer running. Starting the guest again, I 
found two situations:

1. I can directly call blockjob … —pivot, because the last blockcommit that 
failed reached 100%, or
2. Run a blockjob abort action. Re-sync and pivot on command line and that 
might work.

Anyways, blockcommit is not stable here. I tested this on qemu-2.3.0 and 2.4.0

In the logs I only get this:

…
2015-08-24 18:38:13.077+0000: starting up libvirt version: 1.2.18, qemu 
version: 2.4.0
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 
QEMU_AUDIO_DRV=none /usr/bin/qemu-system-x86_64 -name 
mx.roessner-net.de-TESTING -S -machine pc-i440fx-2.1,accel=kvm,usb=off -cpu 
qemu64,+kvm_pv_eoi -m 4096 -realtime mlock=off -smp 
4,sockets=4,cores=1,threads=1 -uuid d86b82d5-153f-4dd9-aa66-d98c2e65db8c 
-no-user-config -nodefaults -device sga -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/mx.roessner-net.de-TESTING.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew 
-global kvm-pit.lost_tick_policy=discard -no-shutdown -boot 
order=cd,menu=on,strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 
-device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x8 -drive 
file=/var/lib/libvirt/images/mx.roessner-net.de-TESTING.img,if=none,id=drive-virtio-disk0,format=raw,cache=writeback
 -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw -device 
ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev 
tap,fd=34,id=hostnet0,vhost=on,vhostfd=35 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=54:52:00:27:ac:8d,bus=pci.0,addr=0x3 
-chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 
-chardev 
socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/mx.roessner-net.de-TESTING.org.qemu.guest_agent.0,server,nowait
 -device 
virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
 -vnc 127.0.0.1:7 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device 
i6300esb,id=watchdog0,bus=pci.0,addr=0x7 -watchdog-action reset -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -object 
rng-random,id=objrng0,filename=/dev/random -device 
virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x6 -msg timestamp=on
char device redirected to /dev/pts/8 (label charserial0)
Formatting 
'/var/backups/snapshots/backup-snapshot-mx.roessner-net.de-TESTING.qcow2', 
fmt=qcow2 size=107374182400 
backing_file='/var/lib/libvirt/images/mx.roessner-net.de-TESTING.img' 
backing_fmt='raw' encryption=off cluster_size=65536 lazy_refcounts=off 
refcount_bits=16
Formatting 
'/var/backups/snapshots/backup-snapshot-mx.roessner-net.de-TESTING.qcow2', 
fmt=qcow2 size=107374182400 
backing_file='/var/lib/libvirt/images/mx.roessner-net.de-TESTING.img' 
backing_fmt='raw' encryption=off cluster_size=65536 lazy_refcounts=off 
refcount_bits=16
Co-routine re-entered recursively
2015-08-24 19:43:17.700+0000: shutting down

I tried to find out what this error: "Co-routine re-entered recursively" means? 
I have no idea. I only know that is is in qemu-coroutine.c line 111. But what 
causes this error? What am I missing?

I checked a different linux kernel. Pur vanilla sources with NUMA-balancing on 
and off. Several Grsecurity-Kernels. Kernel makes no difference. Qemu version 
makes no difference. If I clean memory, I have round about 36GB of free memory. 
Storage is also ok, because it is a BBU driven P410i RAID-controller with 
RAID1+0 15k SAS disks. Even this server is 6 years old, it has enough power. So 
I don't think it is a resource or hardware problem. Anything else on the server 
runs perfectly without any issues.

So if you have any idea, what could cause these aborts, please let me know :-)

Only stuff I found on the web is that someone said that this co-routine code 
would be ugly and probably not thread save. No idea where I found this message. 
But could this be a threading problem?

Many, many thanks in advance

Christian

Attachment: smime.p7s
Description: S/MIME cryptographic signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]