qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Patch V0] x86, mce: Basic support to add LMCE support


From: Raj, Ashok
Subject: Re: [Qemu-devel] [Patch V0] x86, mce: Basic support to add LMCE support to QEMU
Date: Wed, 9 Dec 2015 18:05:04 -0500
User-agent: Mutt/1.5.23 (2014-03-12)

On Wed, Dec 09, 2015 at 10:07:48PM +0100, Paolo Bonzini wrote:
> 
> 
> On 09/12/2015 20:57, Ashok Raj wrote:
> > +    /*
> > +     * We need to read back the value of MSREXT_MCG_CTL that was set by the
> > +     * guest kernel back into Qemu
> > +     */
> > +    cs->kvm_vcpu_dirty = false;
> > +    cpu_synchronize_state(cs);

This wasn't in my original patch, but was found required.

Will have Gong check this and report back.
> 
> This should not be necessary.  I've only skimmed the patches but, apart
> from this, the patches look good.  Eduardo knows more than me about
> machine types and backwards compatibility to older kernels, however, and
> I'm deferring to him on this aspect.
> 
> How was this tested?  (In general, how do you test MCE? :))

We tested on a real hardware that supported error injection via EINJ.

One additional patch is required to support the testing to translate
from GPA to HPA. Probably we could include this as well to make it easy
and not have us maintain out of tree? 

Here are logs from Gong's testing.. he has a pretty eloborate test to 
test this.  :-)

Look at the MCGCAP and MCGSTATUS in host and guest for the values
introduced by this change set.

===================================================================================================

dmesg on guest system:
…
[   35.294009] mce: [Hardware Error]: Machine check events logged
[   35.294009] mce: Uncorrected hardware memory error in user-access at 7451b000
[   35.334006] MCE 0x7451b: Killing victim:1822 due to hardware memory 
corruption
[   35.334515] MCE 0x7451b: dirty mlocked LRU page still referenced by 1 users
[   35.334930] MCE 0x7451b: recovery action for dirty mlocked LRU page: Failed
[   35.335372] mce: Memory error not recovered
…

------------------------------------------------------------------------------------------------------------------------

dmesg on host system:
…
[57629.858659] kvm: zapping shadow pages for mmio generation wraparound
[57629.859592] kvm: zapping shadow pages for mmio generation wraparound
[57637.023199] kvm [46095]: vcpu0 disabled perfctr wrmsr: 0xc2 data 0xffff
[57637.116429] kvm [46095]: vcpu0 unhandled rdmsr: 0x570
[57637.122112] kvm [46095]: vcpu1 unhandled rdmsr: 0x570
[57672.381651] mce: [Hardware Error]: Machine check events logged
[57672.388178] mce: Uncorrected hardware memory error in user-access at 
1da71b000
[57672.396057] mce: [Hardware Error]: Machine check events logged
[57672.403345] MCE 0x1da71b: Killing qemu-system-x86:46095 due to hardware 
memory corruption
[57672.412499] MCE 0x1da71b: recovery action for dirty LRU page: Recovered

===================================================================================================
Mcelog on host system:

address@hidden host]# mcelog
Hardware event. This is not a software error.
MCE 0
CPU 68 BANK 1 TSC 835ad3e00dfe
MISC 86 ADDR 1da71b000
TIME 1449669775 Wed Dec  9 09:02:55 2015
MCG status:RIPV EIPV MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
SRAR
MCA: Data CACHE Level-0 Data-Read Error
STATUS bd80000000100134 MCGSTATUS 7
MCGCAP 7000c16 APICID f0 SOCKETID 3
CPUID Vendor Intel Family 6 Model 63
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7
MISC 146588a86 ADDR 1da71b000
TIME 1449669775 Wed Dec  9 09:02:55 2015
MCG status:
MCi status:
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNEL2_ERR
Transaction: Memory read error
STATUS ac00000000010092 MCGSTATUS 0
MCGCAP 7000c16 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 63
address@hidden host]#

----------------------------------------------------------------------------

GUEST system mcelog:

address@hidden ~]# cat /var/log/mcelog
mcelog: mcelog server already running
mcelog: mcelog server already running
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 9 TSC 18ce71469a
RIP 33:401535
MISC 8c ADDR 7451b000
TIME 1449669775 Wed Dec  9 09:02:55 2015
MCG status:EIPV MCIP LMCE
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
SRAR
MCA: Data CACHE Level-0 Data-Read Error
STATUS bd80000000000134 MCGSTATUS e
MCGCAP 900010a APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 6

Attachment: qemu-add-monitor.patch
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]