Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES1

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES1

From:	Igor Mammedov
Subject:	Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot
Date:	Tue, 7 Jul 2015 14:21:32 +0200

On Tue, 7 Jul 2015 19:43:35 +0800
zhanghailiang <address@hidden> wrote:

> On 2015/7/7 19:23, Igor Mammedov wrote:
> > On Mon, 6 Jul 2015 17:59:10 +0800
> > zhanghailiang <address@hidden> wrote:
> >
> >> On 2015/7/6 16:45, Paolo Bonzini wrote:
> >>>
> >>>
> >>> On 06/07/2015 09:54, zhanghailiang wrote:
> >>>>
> >>>>   From host, we found that QEMU vcpu1 thread and vcpu7 thread were not
> >>>> consuming any cpu (Should be in idle state),
> >>>> All of VCPUs' stacks in host is like bellow:
> >>>>
> >>>> [<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm]
> >>>> [<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm]
> >>>> [<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm]
> >>>> [<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm]
> >>>> [<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0
> >>>> [<ffffffff8116c251>] sys_ioctl+0xa1/0xb0
> >>>> [<ffffffff81468092>] system_call_fastpath+0x16/0x1b
> >>>> [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7
> >>>> [<ffffffffffffffff>] 0xffffffffffffffff
> >>>>
> >>>> We looked into the kernel codes that could leading to the above 'Stuck'
> >>>> warning,
> > in current upstream there isn't any printk(...Stuck...) left since that 
> > code path
> > has been reworked.
> > I've often seen this on over-committed host during guest CPUs up/down 
> > torture test.
> > Could you update guest kernel to upstream and see if issue reproduces?
> >
> 
> Hmm, Unfortunately, it is very hard to reproduce, and we are still trying to 
> reproduce it.
> 
> For your test case, is it a kernel bug?
> Or is there any related patch could solve your test problem been merged into
> upstream ?
I don't remember all prerequisite patches but you should be able to find
  http://marc.info/?l=linux-kernel&m=140326703108009&w=2
  "x86/smpboot: Initialize secondary CPU only if master CPU will wait for it"
and then look for dependencies.


> 
> Thanks,
> zhanghailiang
> 
> >>>> and found that the only possible is the emulation of 'cpuid' instruct in
> >>>> kvm/qemu has something wrong.
> >>>> But since we can’t reproduce this problem, we are not quite sure.
> >>>> Is there any possible that the cupid emulation in kvm/qemu has some bug ?
> >>>
> >>> Can you explain the relationship to the cpuid emulation?  What do the
> >>> traces say about vcpus 1 and 7?
> >>
> >> OK, we searched the VM's kernel codes with the 'Stuck' message, and  it is 
> >> located in
> >> do_boot_cpu(). It's in BSP context, the call process is:
> >> BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> 
> >> do_boot_cpu() -> wakeup_secondary_via_INIT() to trigger APs.
> >> It will wait 5s for APs to startup, if some AP not startup normally, it 
> >> will print 'CPU%d Stuck' or 'CPU%d: Not responding'.
> >>
> >> If it prints 'Stuck', it means the AP has received the SIPI interrupt and 
> >> begins to execute the code
> >> 'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places 
> >> before smp_callin()(smpboot.c).
> >> The follow is the starup process of BSP and AP.
> >> BSP:
> >> start_kernel()
> >>     ->smp_init()
> >>        ->smp_boot_cpus()
> >>          ->do_boot_cpu()
> >>              ->start_ip = trampoline_address(); //set the address that AP 
> >> will go to execute
> >>              ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU
> >>              ->for (timeout = 0; timeout < 50000; timeout++)
> >>                  if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// 
> >> check if AP startup or not
> >>
> >> APs:
> >> ENTRY(trampoline_data) (trampoline_64.S)
> >>         ->ENTRY(secondary_startup_64) (head_64.S)
> >>            ->start_secondary() (smpboot.c)
> >>               ->cpu_init();
> >>               ->smp_callin();
> >>                   ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP 
> >> comes here, the BSP will not prints the error message.
> >>
> >>   From above call process, we can be sure that, the AP has been stuck 
> >> between trampoline_data and the cpumask_set_cpu() in
> >> smp_callin(), we look through these codes path carefully, and only found a 
> >> 'hlt' instruct that could block the process.
> >> It is located in trampoline_data():
> >>
> >> ENTRY(trampoline_data)
> >>           ...
> >>
> >>    call    verify_cpu              # Verify the cpu supports long mode
> >>    testl   %eax, %eax              # Check for return code
> >>    jnz     no_longmode
> >>
> >>           ...
> >>
> >> no_longmode:
> >>    hlt
> >>    jmp no_longmode
> >>
> >> For the process verify_cpu(),
> >> we can only find the 'cpuid' sensitive instruct that could lead VM exit 
> >> from No-root mode.
> >> This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading 
> >> to the fail in verify_cpu.
> >>
> >>   From the message in VM, we know vcpu1 and vcpu7 is something wrong.
> >> [    5.060042] CPU1: Stuck ??
> >> [   10.170815] CPU7: Stuck ??
> >> [   10.171648] Brought up 6 CPUs
> >>
> >> Besides, the follow is the cpus message got from host.
> >> 80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh 
> >> qemu-monitor-command instance-0000000
> >> * CPU #0: pc=0x00007f64160c683d thread_id=68570
> >>     CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573
> >>     CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575
> >>     CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576
> >>     CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577
> >>     CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578
> >>     CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583
> >>     CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584
> >>
> >> Oh, i also forgot to mention in the above message that, we have bond each 
> >> vCPU to different physical CPU in
> >> host.
> >>
> >> Thanks,
> >> zhanghailiang
> >>
> >>
> >>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> the body of a message to address@hidden
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> > .
> >
> 
> 
>

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot, zhanghailiang, 2015/07/06
- Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot, Paolo Bonzini, 2015/07/06
  - Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot, zhanghailiang, 2015/07/06
    - Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot, Paolo Bonzini, 2015/07/06
    - Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot, Igor Mammedov, 2015/07/07
    - Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot, zhanghailiang, 2015/07/07
    - Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot, Igor Mammedov <=
    - Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot, zhanghailiang, 2015/07/07

Prev by Date: Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
Next by Date: Re: [Qemu-devel] [RFC PATCH V6 05/18] protect TBContext with tb_lock.
Previous by thread: Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot
Next by thread: Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot
Index(es):
- Date
- Thread