qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] vfio-pci freezes host


From: Harald Braumann
Subject: [Qemu-devel] vfio-pci freezes host
Date: Sat, 9 Nov 2013 02:33:59 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

(please CC as I'm not subscribed)

Hi,

I'm passing through a GPU using vfio-pci. This regularly completely
freezes the host. I'm hoping the attached files give some clue as to
what the problem might be.

Specs:
Chipset: AMD 990FX
Kernel: 3.12.0
QEMU: 
latest as of today (commit 964668b03d26f0b5baa5e5aff0c966f4fcb76e9e)
GPU:
06:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] 
Juniper XT [Radeon HD 5770]
06:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Juniper HDMI Audio 
[Radeon HD 5700 Series]

QEMU command line:
/home/harry/dev/kvm-gpu-passthrough/qemu/x86_64-softmmu/qemu-system-x86_64 \
-runas spielzeug \
-monitor unix:monitor,server,nowait \
-L /home/harry/dev/kvm-gpu-passthrough/qemu/pc-bios \
-drive file=spielzeug_tmp.qcow2,if=virtio,cache=none,media=disk \
-boot order=c \
-smp 4 \
-cpu host  \
-m 4096M  \
-net nic,model=virtio,macaddr=52:54:00:12:34:57  \
-net tap,ifname=tap0,script=no,downscript=no \
-localtime  \
-enable-kvm  \
-M q35  \
-vga none \
-nographic \
-device 
ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1,romfile=radeon-hd-5770.rom
  \
-device 
vfio-pci,host=0000:06:00.0,bus=root.1,addr=00.0,multifunction=on,x-vga=on \
-device vfio-pci,host=0000:06:00.1,bus=root.1,addr=00.1 \
-usbdevice tablet

QEMU starts up and after a view seconds the host completely
freezes. Sometimes I'm able to still get some dmesg output or a kernel
panic. In these cases it can be seen, that always some other PCI
device produces some error. 

Example:
[  179.998189] ------------[ cut here ]------------
[  179.998211] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:264 
dev_watchdog+0xd9/0x13f()
[  179.998228] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
[  179.998229] Modules linked in: tun vfio_pci vfio_iommu_type1 vfio vboxpci(O) 
vboxnetadp(O) binfmt_misc vboxnetflt(O) vboxdrv(O) deflate ctr twofish_generic 
twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common 
camellia_generic camellia_aesni_avx_x86_64 camellia_x86_64 serpent_avx_x86_64 
serpent_sse2_x86_64 xts serpent_generic blowfish_generic blowfish_x86_64 
blowfish_common cast5_avx_x86_64 cast5_generic cast_common des_generic cbc cmac 
xcbc rmd160 sha512_ssse3 sha512_generic sha256_ssse3 sha256_generic crypto_null 
af_key xfrm_algo bridge stp llc iptable_mangle iptable_nat nf_conntrack_ipv4 
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables 
x_tables ext2 it87 hwmon_vid fuse joydev hid_generic radeon snd_hda_codec_hdmi 
usbhid snd_hda_codec_realtek hid snd_hda_intel snd_hda_codec snd_hwdep 
snd_pcm_oss ttm snd_mixer_oss drm_kms_helper kvm_amd kvm snd_pcm drm 
snd_page_alloc snd_seq_dummy snd_seq_midi snd_seq_oss snd_seq_midi_event 
snd_rawmidi sp5100_tco mxm_wmi agpgart snd_seq i2c_piix4 i2c_algo_bit i2c_core 
fam15h_power microcode pcspkr evdev snd_seq_device wmi k10temp snd_timer button 
snd processor soundcore edac_core ohci_pci thermal_sys ohci_hcd ext4 crc16 jbd2 
mbcache dm_crypt dm_mod md_mod pci_stub sg sr_mod cdrom sd_mod crc_t10dif 
crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel 
aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd 
firewire_ohci firewire_core crc_itu_t r8169 mii ehci_pci ehci_hcd xhci_hcd 
usbcore usb_common ahci libahci libata scsi_mod
[  179.998304] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G           O 3.12.0-hb #1
[  179.998306] Hardware name: To be filled by O.E.M. To be filled by 
O.E.M./SABERTOOTH 990FX, BIOS 1208 04/18/2012
[  179.998313]  0000000000000000 ffffffff81390b45 ffff88024ecc3e30 
ffffffff81036e55
[  179.998316]  ffffffff812efdbe ffff880241240000 ffff88024ecc3e80 
ffffffff812efce5
[  179.998318]  ffff880241240348 ffffffff81036eb1 ffffffff81526cee 
0000000000000030
[  179.998324] Call Trace:
[  179.998326]  <IRQ>  [<ffffffff81390b45>] ? dump_stack+0x41/0x51
[  179.998333]  [<ffffffff81036e55>] ? warn_slowpath_common+0x74/0x89
[  179.998336]  [<ffffffff812efdbe>] ? dev_watchdog+0xd9/0x13f
[  179.998338]  [<ffffffff812efce5>] ? dev_deactivate_queue+0x54/0x54
[  179.998340]  [<ffffffff81036eb1>] ? warn_slowpath_fmt+0x47/0x49
[  179.998341]  [<ffffffff812ef9e8>] ? netif_tx_lock+0x47/0x72
[  179.998345]  [<ffffffff812efdbe>] ? dev_watchdog+0xd9/0x13f
[  179.998347]  [<ffffffff8103fd35>] ? call_timer_fn+0x2d/0xdc
[  179.998350]  [<ffffffff81040677>] ? run_timer_softirq+0x18c/0x1b0
[  179.998351]  [<ffffffff812efce5>] ? dev_deactivate_queue+0x54/0x54
[  179.998353]  [<ffffffff8103a68a>] ? __do_softirq+0xc3/0x1df
[  179.998355]  [<ffffffff81396cdc>] ? call_softirq+0x1c/0x30
[  179.998357]  [<ffffffff8100422a>] ? do_softirq+0x2a/0x64
[  179.998359]  [<ffffffff8103a866>] ? irq_exit+0x3a/0x7a
[  179.998361]  [<ffffffff81024111>] ? smp_apic_timer_interrupt+0x2c/0x37
[  179.998363]  [<ffffffff8139620a>] ? apic_timer_interrupt+0x6a/0x70
[  179.998365]  <EOI>  [<ffffffff81077257>] ? 
clockevents_program_event+0x98/0xb4
[  179.998368]  [<ffffffff812af2f7>] ? cpuidle_enter_state+0x4d/0x9e
[  179.998376]  [<ffffffff812af421>] ? cpuidle_idle_call+0xd9/0x12e
[  179.998379]  [<ffffffff81009e72>] ? arch_cpu_idle+0x5/0x14
[  179.998382]  [<ffffffff8106c212>] ? cpu_startup_entry+0x102/0x152
[  179.998385]  [<ffffffff81022d2d>] ? start_secondary+0x1d9/0x1dd
[  179.998387] ---[ end trace 206ceb71b6aa3a0a ]---
[  180.023699] r8169 0000:09:00.0 eth0: link up
[  230.425196] kvm: zapping shadow pages for mmio generation wraparound
[  240.028560] SysRq : Emergency Sync
[  240.632371] Emergency Sync complete
[  244.008964] br0: port 2(tap0) entered disabled state

Other example:

[  165.586276] usb 9-3: USB disconnect, device number 2
[  165.596353] r8169 0000:09:00.0 eth0: rtl_chipcmd_cond == 1 (loop: 100, 
delay: 100).
[  165.597471] r8169 0000:09:00.0 eth0: link up
[  165.622236] r8169 0000:09:00.0 eth0: rtl_chipcmd_cond == 1 (loop: 100, 
delay: 100).
[  165.627619] r8169 0000:09:00.0 eth0: link down
[  165.627765] br0: port 1(eth0) entered disabled state
[  165.712495] ohci-pci 0000:00:13.0: leak ed ffff880243a010a0 (#81) state 0 
(has tds)
[  165.712498] ohci-pci 0000:00:13.0: leak ed ffff880243a01050 (#82) state 0 
(has tds)
[  166.205984] irq 20: nobody cared (try booting with the "irqpoll" option)
[  166.205988] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G           O 3.12.0-hb #1
[  166.205989] Hardware name: To be filled by O.E.M. To be filled by 
O.E.M./SABERTOOTH 990FX, BIOS 1208 04/18/2012
[  166.205991]  0000000000000000 ffffffff81390b45 ffff880244c90d00 
ffffffff8106e18c
[  166.205993]  ffff880244c90d00 0000000000000000 ffff880244c90d00 
ffffffff8106e4ed
[  166.205995]  0000000000000000 0000000000000014 ffff880244c90d00 
0000000000000000
[  166.205997] Call Trace:
[  166.205998]  <IRQ>  [<ffffffff81390b45>] ? dump_stack+0x41/0x51
[  166.206005]  [<ffffffff8106e18c>] ? __report_bad_irq+0x2c/0xb4
[  166.206008]  [<ffffffff8106e4ed>] ? note_interrupt+0x136/0x1b3
[  166.206010]  [<ffffffff8106c9af>] ? handle_irq_event_percpu+0x105/0x16c
[  166.206012]  [<ffffffff8106ca41>] ? handle_irq_event+0x2b/0x46
[  166.206014]  [<ffffffff8106ece9>] ? handle_fasteoi_irq+0x71/0xa1
[  166.206016]  [<ffffffff810041f8>] ? handle_irq+0x15/0x1d
[  166.206018]  [<ffffffff81003e8e>] ? do_IRQ+0x40/0x95
[  166.206020]  [<ffffffff81394e2a>] ? common_interrupt+0x6a/0x6a
[  166.206021]  <EOI>  [<ffffffff812af2f7>] ? cpuidle_enter_state+0x4d/0x9e
[  166.206040]  [<ffffffff812af421>] ? cpuidle_idle_call+0xd9/0x12e
[  166.206042]  [<ffffffff81009e72>] ? arch_cpu_idle+0x5/0x14
[  166.206044]  [<ffffffff8106c212>] ? cpu_startup_entry+0x102/0x152
[  166.206047]  [<ffffffff81022d2d>] ? start_secondary+0x1d9/0x1dd
[  166.206048] handlers:
[  166.206059] [<ffffffffa0093fa6>] usb_hcd_irq [usbcore]
[  166.206060] Disabling IRQ #20
[  304.730601] br0: port 2(tap0) entered disabled state

Quite often the SATA controller has some error (see ahci-error.jpg)

Another symptom was spam of "[R600] flush TLB failed" in dmesg for
some time, then the host freezes.

Attached is a tgz with the following files:
- ahci-error.jpg
- dmesg
- interrupts: copy of /proc/interrupts
- pci-dump: produced with lspci -vvvxxx
- qemu-config.log: config.log from QEMU source
- vfio.log: output of QEMU with vfio debugging enabled

Cheers,
harry

Attachment: vfio-freeze-dumps.tgz
Description: application/gtar-compressed


reply via email to

[Prev in Thread] Current Thread [Next in Thread]