qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] sda abort with virtio-scsi


From: Denis V. Lunev
Subject: Re: [Qemu-devel] sda abort with virtio-scsi
Date: Thu, 4 Feb 2016 14:00:24 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1

On 02/04/2016 01:23 PM, Paolo Bonzini wrote:

On 04/02/2016 00:34, Jim Minter wrote:
I was worried there was
some way in which the contention could cause an abort and perhaps thence
the lockup (which does not seem to recover when the host load goes down).
I don't know... It's not the most tested code, but it is not very
complicated either.

The certain points that can be extracted from the kernel messages are:
1) there was a cancellation request that took a long time, >20 seconds;
2) despite taking a long time, it _did_ recover sooner or later because
otherwise you'd not have the lockup splat either.

Paolo

Firing the NMI watchdog is fixed in more recent QEMU, which has
asynchronous cancellation, assuming you're running RHEL's QEMU 1.5.3
(try /usr/libexec/qemu-kvm --version, or rpm -qf /usr/libexec/qemu-kvm).
/usr/libexec/qemu-kvm --version reports QEMU emulator version 1.5.3
(qemu-kvm-1.5.3-105.el7_2.3)

my $0.02 to the account. This could be related or could be
not related.

speaking about NMI we do observe similar problems on older
AMD hosts with big enough number of VCPUs in guest.
On a simple boot we see something like this (the probability to
face the problem is around 1/10). RHEV 2.3 QEMU is used, the same
kernel is running on host node:

[   72.189005] Kernel panic - not syncing: softlockup: hung tasks
[   72.189005] CPU: 5 PID: 593 Comm: systemd-udevd Tainted: G             L 
------------   3.10.0-327.4.4.el7.x86_64 #1
[   72.189005] Hardware name: Red Hat KVM, BIOS seabios-1.7.5-11.vz7.2 
04/01/2014
[   72.189005]  ffffffff81871a03 0000000050291887 ffff88007fd43e18 
ffffffff8163515c
[   72.189005]  ffff88007fd43e98 ffffffff8162e9d7 0000000000000008 
ffff88007fd43ea8
[   72.189005]  ffff88007fd43e48 0000000050291887 0000000000002710 
0000000000000000
[   72.189005] Call Trace:
[   72.189005]  <IRQ>  [<ffffffff8163515c>] dump_stack+0x19/0x1b
[   72.189005]  [<ffffffff8162e9d7>] panic+0xd8/0x1e7
[   72.189005]  [<ffffffff8111b376>] watchdog_timer_fn+0x1b6/0x1c0
[   72.189005]  [<ffffffff8111b1c0>] ? watchdog_enable+0xc0/0xc0
[   72.189005]  [<ffffffff810a9d42>] __hrtimer_run_queues+0xd2/0x260
[   72.189005]  [<ffffffff810aa2e0>] hrtimer_interrupt+0xb0/0x1e0
[   72.189005]  [<ffffffff816471dc>] ? call_softirq+0x1c/0x30
[   72.189005]  [<ffffffff81065d40>] ? flush_tlb_func+0xb0/0xb0
[   72.189005]  [<ffffffff81049537>] local_apic_timer_interrupt+0x37/0x60
[   72.189005]  [<ffffffff81647e4f>] smp_apic_timer_interrupt+0x3f/0x60
[   72.189005]  [<ffffffff8164651d>] apic_timer_interrupt+0x6d/0x80
[   72.189005]  <EOI>  [<ffffffff812f27e9>] ? free_cpumask_var+0x9/0x10
[   72.189005]  [<ffffffff810e6de2>] ? smp_call_function_many+0x202/0x260
[   72.189005]  [<ffffffff81065d40>] ? flush_tlb_func+0xb0/0xb0
[   72.189005]  [<ffffffff810e6e9d>] on_each_cpu+0x2d/0x60
[   72.189005]  [<ffffffff81066119>] flush_tlb_kernel_range+0x59/0xa0
[   72.189005]  [<ffffffff811a4730>] __purge_vmap_area_lazy+0x1a0/0x210
[   72.189005]  [<ffffffff811a4927>] vm_unmap_aliases+0x187/0x1b0
[   72.189005]  [<ffffffff810620e8>] change_page_attr_set_clr+0xe8/0x4d0
[   72.189005]  [<ffffffff81131127>] ? ring_buffer_time_stamp+0x7/0x10
[   72.189005]  [<ffffffff8106299f>] set_memory_ro+0x2f/0x40
[   72.189005]  [<ffffffff8162f7c4>] set_section_ro_nx+0x3a/0x71
[   72.189005]  [<ffffffff810ed19c>] load_module+0x103c/0x1b50
[   72.189005]  [<ffffffff810e9743>] ? copy_module_from_fd.isra.42+0x53/0x150
[   72.189005]  [<ffffffff810ede66>] SyS_finit_module+0xa6/0xd0
[   72.189005]  [<ffffffff816458c9>] system_call_fastpath+0x16/0x1b
[   72.189005] Shutting down cpus with NMI


Timesources changes does not help. Also there is a strange stuff like this:

Having configured the serial port to point at a unix socket, and timestamping
the messages on the host side, we observe

guest timestamp | host timestamp
       0.000000 | 20:09:26.805461
       2.587056 | 20:09:30.000993
       7.607329 | 20:09:35.062367
      12.645539 | 20:09:40.057634
      22.608054 | 20:09:50.028727
      32.395499 | 20:10:00.041215
      42.571265 | 20:10:10.041960
      47.606661 | 20:10:15.028973
      48.627059 | 20:10:20.022359
      49.029059 | 20:10:25.047857
      49.399065 | 20:10:30.066884
      49.809077 | 20:10:35.036467
      58.159132 | 20:10:40.013387
      68.043371 | 20:10:40.266714

Note the anomaly around 47 seconds from boot.

anyway, this story is far from the end and we are unable to provide more details. Roma is digging this story. Den



reply via email to

[Prev in Thread] Current Thread [Next in Thread]