qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] E5-2620v2 - emulation stop error


From: Andrey Korolyov
Subject: Re: [Qemu-devel] E5-2620v2 - emulation stop error
Date: Tue, 10 Mar 2015 21:21:30 +0300

On Tue, Mar 10, 2015 at 9:16 PM, Dr. David Alan Gilbert
<address@hidden> wrote:
> * Andrey Korolyov (address@hidden) wrote:
>> On Tue, Mar 10, 2015 at 7:57 PM, Dr. David Alan Gilbert
>> <address@hidden> wrote:
>> > * Andrey Korolyov (address@hidden) wrote:
>> >> On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov <address@hidden> wrote:
>> >> > On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das <address@hidden> wrote:
>> >> >> Andrey Korolyov <address@hidden> writes:
>> >> >>
>> >> >>> On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov <address@hidden> 
>> >> >>> wrote:
>> >> >>>> Hello,
>> >> >>>>
>> >> >>>> recently I`ve got a couple of shiny new Intel 2620v2s for future
>> >> >>>> replacement of the E5-2620v1, but I experienced relatively many 
>> >> >>>> events
>> >> >>>> with emulation errors, all traces looks simular to the one below. I 
>> >> >>>> am
>> >> >>>> running qemu-2.1 on x86 on top of 3.10 branch for testing purposes 
>> >> >>>> but
>> >> >>>> can switch to some other versions if necessary. Most of crashes
>> >> >>>> happened during reboot cycle or at the end of ACPI-based shutdown
>> >> >>>> action, if this can help. I have zero clues of what can introduce 
>> >> >>>> such
>> >> >>>> a mess inside same processor family using identical software, as
>> >> >>>> 2620v1 has no simular problem ever. Please let me know if there can 
>> >> >>>> be
>> >> >>>> some side measures for making entire story more clear.
>> >> >>>>
>> >> >>>> Thanks!
>> >> >>>>
>> >> >>>> KVM internal error. Suberror: 2
>> >> >>>> extra data[0]: 800000d1
>> >> >>>> extra data[1]: 80000b0d
>> >> >>>> EAX=00000003 EBX=00000000 ECX=00000000 EDX=00000000
>> >> >>>> ESI=00000000 EDI=00000000 EBP=00000000 ESP=00006cd4
>> >> >>>> EIP=0000d3f9 EFL=00010202 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
>> >> >>>> ES =0000 00000000 0000ffff 00009300
>> >> >>>> CS =f000 000f0000 0000ffff 00009b00
>> >> >>>> SS =0000 00000000 0000ffff 00009300
>> >> >>>> DS =0000 00000000 0000ffff 00009300
>> >> >>>> FS =0000 00000000 0000ffff 00009300
>> >> >>>> GS =0000 00000000 0000ffff 00009300
>> >> >>>> LDT=0000 00000000 0000ffff 00008200
>> >> >>>> TR =0000 00000000 0000ffff 00008b00
>> >> >>>> GDT=     000f6e98 00000037
>> >> >>>> IDT=     00000000 000003ff
>> >> >>>> CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000
>> >> >>>> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
>> >> >>>> DR3=0000000000000000
>> >> >>>> DR6=00000000ffff0ff0 DR7=0000000000000400
>> >> >>>> EFER=0000000000000000
>> >> >>>> Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb <cd>
>> >> >>>> 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
>> >> >>>> b8 00 e0 00 00 8e
>> >> >>>
>> >> >>>
>> >> >>> It turns out that those errors are introduced by APICv, which gets
>> >> >>> enabled due to different feature set. If anyone is interested in
>> >> >>> reproducing/fixing this exactly on 3.10, it takes about one hundred of
>> >> >>> migrations/power state changes for an issue to appear, guest OS can be
>> >> >>> Linux or Win.
>> >> >>
>> >> >> Are you able to reproduce this on a more recent upstream kernel as 
>> >> >> well ?
>> >> >>
>> >> >> Bandan
>> >> >
>> >> > I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and
>> >> > follow up with any reproduceable results.
>> >>
>> >> Heh.. issue is not triggered on 2603v2 at all, at least I am not able
>> >> to hit this. The only difference with 2620v2 except lower frequency is
>> >> an Intel Dynamic Acceleration feature. I`d appreciate any testing with
>> >> higher CPU models with same or richer feature set. The testing itself
>> >> can be done on both generic 3.10 or RH7 kernels, as both of them are
>> >> experiencing this issue. I conducted all tests with disabled cstates
>> >> so I advise to do the same for a first reproduction step.
>> >>
>> >> Thanks!
>> >>
>> >> model name      : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
>> >> stepping        : 4
>> >> microcode       : 0x416
>> >> cpu MHz         : 2100.039
>> >> cache size      : 15360 KB
>> >> siblings        : 12
>> >> apicid          : 43
>> >> initial apicid  : 43
>> >> fpu             : yes
>> >> fpu_exception   : yes
>> >> cpuid level     : 13
>> >> wp              : yes
>> >> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>> >> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
>> >> syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
>> >> rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
>> >> dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca
>> >> sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
>> >> rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi
>> >> flexpriority ept vpid fsgsbase smep erms
>> >
>> > I'm seeing something similar; it's very intermittent and generally
>> > happening right at boot of the guest;   I'm running this on qemu
>> > head+my postcopy world (but it's happening right at boot before postcopy
>> > gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
>> > but hey maybe I'm seeing a different bug.
>> >
>> > Dave
>>
>> Yep, looks like we are hitting same bug - two thirds of mine failure
>> events shot during boot/reboot cycle and approx. one third of events
>> happened in the middle of runtime. What CPU, v0 or v2 are you using
>> (in other words, is APICv enabled)?
>
> processor       : 7
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 45
> model name      : Intel(R) Xeon(R) CPU E5-2407 0 @ 2.20GHz
> stepping        : 7
> microcode       : 0x70d
> cpu MHz         : 2200.000
> cache size      : 10240 KB
> physical id     : 1
> siblings        : 4
> core id         : 3
> cpu cores       : 4
> apicid          : 38
> initial apicid  : 38
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 13
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
> pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
> nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx 
> est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt 
> tsc_deadline_timer aes xsave avx lahf_lm arat pln pts dtherm tpr_shadow vnmi 
> flexpriority ept vpid xsaveopt
> bugs            :
> bogomips        : 4409.23
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 46 bits physical, 48 bits virtual
> power management:
>
> It's really random as well; I had two within half an hour yesterday, and then
> it survived overnight with no change.
>
> KVM internal error. Suberror: 1
> emulation failure
> EAX=00000000 EBX=00000000 ECX=00000000 EDX=000fd2bc
> ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
> EIP=000fd2c5 EFL=00010007 [-----PC] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
> SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> DS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> FS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> GS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
> TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
> GDT=     000f6a80 00000037
> IDT=     000f6abe 00000000
> CR0=60000011 CR2=00000000 CR3=00000000 CR4=00000000
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 
> DR3=0000000000000000
> DR6=00000000ffff0ff0 DR7=0000000000000400
> EFER=0000000000000000
> Code=66 ba bc d2 0f 00 e9 a2 fe f3 90 f0 0f ba 2d 04 ff fb bf 00 <72> f3 8b 
> 25 00 ff fb bf e8 44 66 ff ff c7 05 04 ff
>  fb bf 00 00 00 00 f4 eb fd fa fc 66 b8
> KVM internal error. Suberror: 1
> emulation failure
>
> and
>
> 11:37:49 INFO | [qemu output] KVM internal error. Suberror: 1
> 11:37:49 INFO | [qemu output] emulation failure
> 11:37:49 INFO | [qemu output] EAX=00000000 EBX=00000000 ECX=00000000 
> EDX=000fd2bc
> 11:37:49 INFO | [qemu output] ESI=00000000 EDI=00000000 EBP=00000000 
> ESP=00000000
> 11:37:49 INFO | [qemu output] EIP=000fd2bc EFL=00010007 [-----PC] CPL=0 II=0 
> A20=1 SMM=0 HLT=0
> 11:37:49 INFO | [qemu output] ES =0010 00000000 ffffffff 00c09300 DPL=0 DS   
> [-WA]
> 11:37:49 INFO | [qemu output] CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 
> [-RA]
> 11:37:49 INFO | [qemu output] SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   
> [-WA]
> 11:37:49 INFO | [qemu output] DS =0010 00000000 ffffffff 00c09300 DPL=0 DS   
> [-WA]
> 11:37:49 INFO | [qemu output] FS =0010 00000000 ffffffff 00c09300 DPL=0 DS   
> [-WA]
> 11:37:49 INFO | [qemu output] GS =0010 00000000 ffffffff 00c09300 DPL=0 DS   
> [-WA]
> 11:37:49 INFO | [qemu output] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
> 11:37:49 INFO | [qemu output] TR =0000 00000000 0000ffff 00008b00 DPL=0 
> TSS32-busy
> 11:37:49 INFO | [qemu output] GDT=     000f6a80 00000037
> 11:37:49 INFO | [qemu output] IDT=     000f6abe 00000000
> 11:37:49 INFO | [qemu output] CR0=60000011 CR2=00000000 CR3=00000000 
> CR4=00000000
> 11:37:49 INFO | [qemu output] DR0=0000000000000000 DR1=0000000000000000 
> DR2=0000000000000000 DR3=0000000000000000
> 11:37:49 INFO | [qemu output] DR6=00000000ffff0ff0 DR7=0000000000000400
> 11:37:49 INFO | [qemu output] EFER=0000000000000000
> 11:37:49 INFO | [qemu output] Code=0a 00 e8 a0 64 ff ff 0f aa 66 ba bc d2 0f 
> 00 e9 a2 fe f3 90 <f0> 0f ba 2d 04 ff fb 3f 00 72 f3 8b 25 00 ff fb 3f e8 44 
> 66 ff ff c7 05 04 ff fb 3f 00 00
>
> note the code in that second one is in the middle of the bios,
> but the code has a few bytes different from what an objdump gets,
> so I'm not quite sure if something is stamping on the bios or
> if that's separate.
>
> Dave

Thanks, AFAIU suberror 1 and suberror 2 are completely different by a
nature so this is a different bug. What is interesting that you`ve got
same reproduction pattern as in mine case, it may point to a single
userspace issue triggering two independent KVM bugs...



reply via email to

[Prev in Thread] Current Thread [Next in Thread]