|
From: | Denis V. Lunev |
Subject: | Re: [Qemu-devel] sda abort with virtio-scsi |
Date: | Thu, 4 Feb 2016 14:00:24 +0300 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 |
On 02/04/2016 01:23 PM, Paolo Bonzini wrote:
On 04/02/2016 00:34, Jim Minter wrote:I was worried there was some way in which the contention could cause an abort and perhaps thence the lockup (which does not seem to recover when the host load goes down).I don't know... It's not the most tested code, but it is not very complicated either. The certain points that can be extracted from the kernel messages are: 1) there was a cancellation request that took a long time, >20 seconds; 2) despite taking a long time, it _did_ recover sooner or later because otherwise you'd not have the lockup splat either. PaoloFiring the NMI watchdog is fixed in more recent QEMU, which has asynchronous cancellation, assuming you're running RHEL's QEMU 1.5.3 (try /usr/libexec/qemu-kvm --version, or rpm -qf /usr/libexec/qemu-kvm)./usr/libexec/qemu-kvm --version reports QEMU emulator version 1.5.3 (qemu-kvm-1.5.3-105.el7_2.3)
my $0.02 to the account. This could be related or could be not related. speaking about NMI we do observe similar problems on older AMD hosts with big enough number of VCPUs in guest. On a simple boot we see something like this (the probability to face the problem is around 1/10). RHEV 2.3 QEMU is used, the same kernel is running on host node: [ 72.189005] Kernel panic - not syncing: softlockup: hung tasks [ 72.189005] CPU: 5 PID: 593 Comm: systemd-udevd Tainted: G L ------------ 3.10.0-327.4.4.el7.x86_64 #1 [ 72.189005] Hardware name: Red Hat KVM, BIOS seabios-1.7.5-11.vz7.2 04/01/2014 [ 72.189005] ffffffff81871a03 0000000050291887 ffff88007fd43e18 ffffffff8163515c [ 72.189005] ffff88007fd43e98 ffffffff8162e9d7 0000000000000008 ffff88007fd43ea8 [ 72.189005] ffff88007fd43e48 0000000050291887 0000000000002710 0000000000000000 [ 72.189005] Call Trace: [ 72.189005] <IRQ> [<ffffffff8163515c>] dump_stack+0x19/0x1b [ 72.189005] [<ffffffff8162e9d7>] panic+0xd8/0x1e7 [ 72.189005] [<ffffffff8111b376>] watchdog_timer_fn+0x1b6/0x1c0 [ 72.189005] [<ffffffff8111b1c0>] ? watchdog_enable+0xc0/0xc0 [ 72.189005] [<ffffffff810a9d42>] __hrtimer_run_queues+0xd2/0x260 [ 72.189005] [<ffffffff810aa2e0>] hrtimer_interrupt+0xb0/0x1e0 [ 72.189005] [<ffffffff816471dc>] ? call_softirq+0x1c/0x30 [ 72.189005] [<ffffffff81065d40>] ? flush_tlb_func+0xb0/0xb0 [ 72.189005] [<ffffffff81049537>] local_apic_timer_interrupt+0x37/0x60 [ 72.189005] [<ffffffff81647e4f>] smp_apic_timer_interrupt+0x3f/0x60 [ 72.189005] [<ffffffff8164651d>] apic_timer_interrupt+0x6d/0x80 [ 72.189005] <EOI> [<ffffffff812f27e9>] ? free_cpumask_var+0x9/0x10 [ 72.189005] [<ffffffff810e6de2>] ? smp_call_function_many+0x202/0x260 [ 72.189005] [<ffffffff81065d40>] ? flush_tlb_func+0xb0/0xb0 [ 72.189005] [<ffffffff810e6e9d>] on_each_cpu+0x2d/0x60 [ 72.189005] [<ffffffff81066119>] flush_tlb_kernel_range+0x59/0xa0 [ 72.189005] [<ffffffff811a4730>] __purge_vmap_area_lazy+0x1a0/0x210 [ 72.189005] [<ffffffff811a4927>] vm_unmap_aliases+0x187/0x1b0 [ 72.189005] [<ffffffff810620e8>] change_page_attr_set_clr+0xe8/0x4d0 [ 72.189005] [<ffffffff81131127>] ? ring_buffer_time_stamp+0x7/0x10 [ 72.189005] [<ffffffff8106299f>] set_memory_ro+0x2f/0x40 [ 72.189005] [<ffffffff8162f7c4>] set_section_ro_nx+0x3a/0x71 [ 72.189005] [<ffffffff810ed19c>] load_module+0x103c/0x1b50 [ 72.189005] [<ffffffff810e9743>] ? copy_module_from_fd.isra.42+0x53/0x150 [ 72.189005] [<ffffffff810ede66>] SyS_finit_module+0xa6/0xd0 [ 72.189005] [<ffffffff816458c9>] system_call_fastpath+0x16/0x1b [ 72.189005] Shutting down cpus with NMI Timesources changes does not help. Also there is a strange stuff like this: Having configured the serial port to point at a unix socket, and timestamping the messages on the host side, we observe guest timestamp | host timestamp 0.000000 | 20:09:26.805461 2.587056 | 20:09:30.000993 7.607329 | 20:09:35.062367 12.645539 | 20:09:40.057634 22.608054 | 20:09:50.028727 32.395499 | 20:10:00.041215 42.571265 | 20:10:10.041960 47.606661 | 20:10:15.028973 48.627059 | 20:10:20.022359 49.029059 | 20:10:25.047857 49.399065 | 20:10:30.066884 49.809077 | 20:10:35.036467 58.159132 | 20:10:40.013387 68.043371 | 20:10:40.266714 Note the anomaly around 47 seconds from boot.anyway, this story is far from the end and we are unable to provide more details. Roma is digging this story. Den
[Prev in Thread] | Current Thread | [Next in Thread] |