[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[PATCH v4 0/6] Add ACPI CPER firmware first error injection for Arm Proc
From: |
Mauro Carvalho Chehab |
Subject: |
[PATCH v4 0/6] Add ACPI CPER firmware first error injection for Arm Processor |
Date: |
Mon, 29 Jul 2024 15:21:04 +0200 |
Testing OS kernel ACPI APEI CPER support is tricky, as one depends on
having hardware with special-purpose BIOS and/or hardware.
With QEMU, it becomes a lot easier, as it can be done via QMP.
This series add support for ARM Processor CPER error injection,
according with ACPI 6.x and UEFI 2.9A/2.10 specs.
This series consists of:
- one patch using a define for ARM virt GPIO power pin
(requested during last review);
- three patches from Jonathan (one coauthored with Shiju) with basic
EINJ features, already submitted as RFC (but not merged yet) at:
https://lore.kernel.org/qemu-devel/20240628090605.529-1-shiju.jose@huawei.com/
- three patches from me extending it to optionally allow to
generate all sorts of possible valid combinations for
ARM Processor CPER record.
I've been using it to test a Linux Kernel patch series fixing
UEFI 2.9A errata and ARM processor trace event:
https://lore.kernel.org/linux-edac/3853853f820a666253ca8ed6c7c724dc3d50044a.1720679234.git.mchehab+huawei@kernel.org/T/#t
I also wrote some Wiki pages for rasdaemon (a Linux daemon
widely used to monitor and react to RAS events):
https://github.com/mchehab/rasdaemon/wiki/error-injection
Being really helpful to test the Linux Kernel behavior when
firmware-first RAS events for ARM processor arrives there,
helping to validate how CPER and GHES driver handles them
(and further testing userspace apps like rasdaemon):
Sending this command to QMP:
{ "execute": "qmp_capabilities" }
{ "execute": "arm-inject-error", "arguments": {"error": [{"type":
["cache-error"]}]} }
Produces a simple CPER register, properly handled by the Linux
Kernel:
[ 839.952678] {4}[Hardware Error]: Hardware error from APEI Generic Hardware
Error Source: 1
[ 839.953145] {4}[Hardware Error]: event severity: recoverable
[ 839.953451] {4}[Hardware Error]: Error 0, type: recoverable
[ 839.953763] {4}[Hardware Error]: section_type: ARM processor error
[ 839.954094] {4}[Hardware Error]: MIDR: 0x0000000000000000
[ 839.954383] {4}[Hardware Error]: Multiprocessor Affinity Register (MPIDR):
0x0000000080000000
[ 839.954802] {4}[Hardware Error]: running state: 0x0
[ 839.955066] {4}[Hardware Error]: Power State Coordination Interface state: 0
[ 839.955424] {4}[Hardware Error]: Error info structure 0:
[ 839.955712] {4}[Hardware Error]: num errors: 1
[ 839.955983] {4}[Hardware Error]: first error captured
[ 839.956260] {4}[Hardware Error]: propagated error captured
[ 839.956561] {4}[Hardware Error]: error_type: 0x02: cache error
[ 839.956882] {4}[Hardware Error]: error_info: 0x000000000054007f
[ 839.957192] {4}[Hardware Error]: transaction type: Instruction
[ 839.957495] {4}[Hardware Error]: cache error, operation type:
Instruction fetch
[ 839.957888] {4}[Hardware Error]: cache level: 1
[ 839.958166] {4}[Hardware Error]: processor context not corrupted
[ 839.958459] {4}[Hardware Error]: the error has not been corrected
[ 839.958771] {4}[Hardware Error]: PC is imprecise
[ 839.959074] [Firmware Warn]: GHES: Unhandled processor error type 0x02:
cache error
rasdaemon output (rasdaemon still needs to be patched for
UEFI 2.9A errata):
<...>-211 [002] d..1. 0.000129 arm_event 2024-07-11 09:50:45
+0000 affinity: -1 MPIDR: 0x80000000 MIDR: 0x0 running_state: 0 psci_state: 0
ARM Processor Err Info data len: 32
<CANT FIND FIELD buf>cpu: 0; error: 2; affinity level: 255; MPIDR:
0000000080000000; MIDR: 0000000000000000; running state: 0; PSCI state: 0; ARM
Processor Err Info data len: 32; ARM Processor Err Info raw data: 00 20 06 00
02 00 00 05 7f 00 54 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00; ARM Processor Err Context Info data len: 0; ARM Processor Err Context
Info raw data: ; Vendor Specific Err Info data len: 0; Vendor Specific Err Info
raw data:
More complex events with multiple Processor Error Information structures
can be produced like:
{ "execute": "arm-inject-error", "arguments": {
"validation": ["mpidr-valid", "affinity-valid", "running-state-valid",
"vendor-specific-valid"],
"running-state": [], "psci-state": 1229279264,
"error": [{
"validation": ["multiple-error-valid", "flags-valid"],
"type": ["tlb-error", "bus-error", "micro-arch-error"],
"multiple-error": 3, "phy-addr": 57005,
"virt-addr": 48879},
{"type": ["micro-arch-error"]},
{"type": ["tlb-error"]},
{"type": ["bus-error"]},
{"type": ["cache-error"]}],
"context": [{"register": [57005, 48879, 43962, 47787]}],
"vendor-specific": [12, 23, 53, 52, 3, 123, 243, 255]} }
[ 925.340284] {5}[Hardware Error]: Hardware error from APEI Generic Hardware
Error Source: 1
[ 925.340662] {5}[Hardware Error]: event severity: recoverable
[ 925.340924] {5}[Hardware Error]: Error 0, type: recoverable
[ 925.341280] {5}[Hardware Error]: section_type: ARM processor error
[ 925.341631] {5}[Hardware Error]: MIDR: 0x0000000000000000
[ 925.341893] {5}[Hardware Error]: Multiprocessor Affinity Register (MPIDR):
0x0000000080000000
[ 925.342278] {5}[Hardware Error]: error affinity level: 0
[ 925.342571] {5}[Hardware Error]: running state: 0x0
[ 925.342835] {5}[Hardware Error]: Power State Coordination Interface state:
1229279264
[ 925.343157] {5}[Hardware Error]: Error info structure 0:
[ 925.343388] {5}[Hardware Error]: num errors: 4
[ 925.343602] {5}[Hardware Error]: error_type: 0x1c: TLB error|bus
error|micro-architectural error
[ 925.343960] {5}[Hardware Error]: virtual fault address: 0x000000000000beef
[ 925.344241] {5}[Hardware Error]: physical fault address:
0x000000000000dead
[ 925.344526] {5}[Hardware Error]: Error info structure 1:
[ 925.344757] {5}[Hardware Error]: num errors: 1
[ 925.344965] {5}[Hardware Error]: first error captured
[ 925.345183] {5}[Hardware Error]: propagated error captured
[ 925.345416] {5}[Hardware Error]: error_type: 0x10: micro-architectural
error
[ 925.345714] {5}[Hardware Error]: Error info structure 2:
[ 925.345946] {5}[Hardware Error]: num errors: 1
[ 925.346148] {5}[Hardware Error]: first error captured
[ 925.346413] {5}[Hardware Error]: propagated error captured
[ 925.346719] {5}[Hardware Error]: error_type: 0x04: TLB error
[ 925.346988] {5}[Hardware Error]: error_info: 0x00000080d6460fff
[ 925.347248] {5}[Hardware Error]: transaction type: Generic
[ 925.347492] {5}[Hardware Error]: TLB error, operation type: Generic read
(type of instruction or data request cannot be determined)
[ 925.347945] {5}[Hardware Error]: TLB level: 1
[ 925.348153] {5}[Hardware Error]: processor context corrupted
[ 925.348392] {5}[Hardware Error]: the error has been corrected
[ 925.348635] {5}[Hardware Error]: PC is imprecise
[ 925.348848] {5}[Hardware Error]: Program execution can be restarted
reliably at the PC associated with the error.
[ 925.349232] {5}[Hardware Error]: Error info structure 3:
[ 925.349459] {5}[Hardware Error]: num errors: 1
[ 925.349662] {5}[Hardware Error]: first error captured
[ 925.349884] {5}[Hardware Error]: propagated error captured
[ 925.350115] {5}[Hardware Error]: error_type: 0x08: bus error
[ 925.350371] {5}[Hardware Error]: error_info: 0x0000000078da03ff
[ 925.350629] {5}[Hardware Error]: transaction type: Generic
[ 925.350878] {5}[Hardware Error]: bus error, operation type: Prefetch
[ 925.351144] {5}[Hardware Error]: affinity level at which the bus error
occurred: 3
[ 925.351451] {5}[Hardware Error]: processor context not corrupted
[ 925.351702] {5}[Hardware Error]: the error has not been corrected
[ 925.351960] {5}[Hardware Error]: PC is precise
[ 925.352164] {5}[Hardware Error]: Program execution can be restarted
reliably at the PC associated with the error.
[ 925.352546] {5}[Hardware Error]: participation type: Generic
[ 925.352801] {5}[Hardware Error]: address space: External Memory Access
[ 925.353071] {5}[Hardware Error]: Error info structure 4:
[ 925.353299] {5}[Hardware Error]: num errors: 1
[ 925.353502] {5}[Hardware Error]: first error captured
[ 925.353720] {5}[Hardware Error]: propagated error captured
[ 925.353963] {5}[Hardware Error]: error_type: 0x02: cache error
[ 925.354222] {5}[Hardware Error]: error_info: 0x000000000054007f
[ 925.354478] {5}[Hardware Error]: transaction type: Instruction
[ 925.354782] {5}[Hardware Error]: cache error, operation type:
Instruction fetch
[ 925.355203] {5}[Hardware Error]: cache level: 1
[ 925.355495] {5}[Hardware Error]: processor context not corrupted
[ 925.355848] {5}[Hardware Error]: the error has not been corrected
[ 925.356206] {5}[Hardware Error]: PC is imprecise
[ 925.356493] {5}[Hardware Error]: Context info structure 0:
[ 925.356809] {5}[Hardware Error]: register context type: AArch64 EL1
context registers
[ 925.357282] {5}[Hardware Error]: 00000000: 0000dead 00000000 0000beef
00000000
[ 925.357800] {5}[Hardware Error]: 00000010: 0000abba 00000000 0000baab
00000000
[ 925.358267] {5}[Hardware Error]: 00000020: 00000000 00000000
[ 925.358523] {5}[Hardware Error]: Vendor specific error info has 8 bytes:
[ 925.358822] {5}[Hardware Error]: 00000000: 3435170c fff37b03
..54.{..
[ 925.359192] [Firmware Warn]: GHES: Unhandled processor error type 0x1c: TLB
error|bus error|micro-architectural error
[ 925.359590] [Firmware Warn]: GHES: Unhandled processor error type 0x10:
micro-architectural error
[ 925.359935] [Firmware Warn]: GHES: Unhandled processor error type 0x04: TLB
error
[ 925.360235] [Firmware Warn]: GHES: Unhandled processor error type 0x08: bus
error
[ 925.360534] [Firmware Warn]: GHES: Unhandled processor error type 0x02:
cache error
---
v4:
- patches 4 and 7 folded;
- Several improvements at QAPI documentation: now, both man-page and html
outputs look nicer and have tables to better define some fields;
- QAPI updates are now changed to QEMU version 9.2 and upper;
- Minor coding style improvements;
- added a MAINTAINERS entry for arm error inject;
- Generic Error Device notify callback renamed to generic_error_device_notify();
- GED patch description fixed;
- running_state/psci logic fixed.
v3:
- patch 1 cleanups with some comment changes and adding another place where
the poweroff GPIO define should be used. No changes on other patches (except
due to conflict resolution).
v2:
- added a new patch using a define for GPIO power pin;
- patch 2 changed to also use a define for generic error GPIO pin;
- a couple cleanups at patch 2 removing uneeded else clauses.
Jonathan Cameron (2):
arm/virt: Wire up GPIO error source for ACPI / GHES
acpi/ghes: Support GPIO error source.
Mauro Carvalho Chehab (4):
arm/virt: place power button pin number on a define
target/arm: preserve mpidr value
acpi/ghes: update comments to point to newer ACPI specs
acpi/ghes: Add a logic to inject ARM processor CPER
MAINTAINERS | 7 +
configs/targets/aarch64-softmmu.mak | 1 +
hw/acpi/ghes.c | 339 +++++++++++++++++++---
hw/arm/Kconfig | 4 +
hw/arm/arm_error_inject.c | 420 ++++++++++++++++++++++++++++
hw/arm/arm_error_inject_stubs.c | 34 +++
hw/arm/meson.build | 3 +
hw/arm/virt-acpi-build.c | 34 ++-
hw/arm/virt.c | 21 +-
include/hw/acpi/ghes.h | 41 +++
include/hw/arm/virt.h | 4 +
include/hw/boards.h | 1 +
qapi/arm-error-inject.json | 284 +++++++++++++++++++
qapi/meson.build | 1 +
qapi/qapi-schema.json | 1 +
target/arm/cpu.h | 1 +
target/arm/helper.c | 10 +-
17 files changed, 1160 insertions(+), 46 deletions(-)
create mode 100644 hw/arm/arm_error_inject.c
create mode 100644 hw/arm/arm_error_inject_stubs.c
create mode 100644 qapi/arm-error-inject.json
--
2.45.2