[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[PATCH 00/11] hvf and APIC fixes, improvements, and optimisations
From: |
phil |
Subject: |
[PATCH 00/11] hvf and APIC fixes, improvements, and optimisations |
Date: |
Mon, 9 Dec 2024 21:36:18 +0100 |
From: Phil Dennis-Jordan <phil@philjordan.eu>
These patches are a loosely related series of changes that I've
collected while working on the VMApple/PVG patch set and my
doomed project of integrating Hypervisor.framework's APIC
implementation. (The latter turns out to suffer from a number of
serious bugs which prevent it from being usable with
level-triggered interrupts, at least as of macOS 15.1.)
The motivations behind the patches are:
* Correctness. A bunch of these fix bugs or improve compliance
with the appropriate specification.
* Performance. The biggest performance problem on macOS hosts
and the hvf accel is BQL contention. Depending on workload
and configuration, busy vCPU threads seem to spend between
10% and 30% of their time in bql_lock(). A number of these
patches either reduce total work done, or move work out of
the BQL in certain configurations.
* Code hygiene. I've encountered a few instances of incorrect
types used, outdated comments, magic numbers, etc. which
were easy to fix.
There are some dependencies between the patches:
* Patch 1 is required for both patches 2 and 7. This is why
I have not split the aarch64 (patch 2) and the x86 (3-10)
specific changes into separate patch sets.
* Patch 6 will not apply cleanly without 5 in its current state
but can be modified to work.
* Patches 6 and 7 are only useful in combination.
Beyond that, we can mix and match - i.e. pull
reviewed-and-approved patches while further iterating on any
that might need refinement or dropping unwanted ones entirely.
The patches, in brief:
1. Adds a function for architecture-specific vCPU setup
before the vCPU's first VM entry, but after QEMU init has
concluded. This is useful for correctly setting the aarch64
GIC flag (patch 2) and for telling Hypervisor.framework the
initial APICBASE address. (patch 7)
2. Set the GICv3 flag in the PFR0_EL1 system register after
the GIC has actually been set up. Previously the flag was
computed before the GIC was ready, and thus always 0.
3. Don't send signal to thread when kicking. cpus_kick_thread
calls pthread_kill() to send a SIG_IPI to the target vCPU's
thread. There does not appear to be any need to do this
with HVF, so let's not call cpus_kick_thread() when kicking
a vCPU there.
4. Prefetch all bytes of an instruction when emulating it in
software. The x86_decode system in the HVF accel read guest
memory in 1-, 2-, 4-, or 8-byte parcels when decoding
instructions. The VM exit tells us how big the instruction
is, so let's read all of it in one go for efficiency.
5. x86 instruction decoding is somewhat expensive, and if we
know that it must definitely be done before we acquire the
BQL, we can actually move decoding to before the lock. The
simplest case for this are APIC accesses, so we start with
those.
6. APIC access fast paths. EPT faults are somewhat expensive
to process. This includes xAPIC MMIO. We can skip walking
the memory graph and fully software-emulating the faulting
instruction by obtaining the access type and MMIO offset
provided by VMX/HVF's virtualised APIC access feature, and
detecting common MOV instructions to call the APIC register
read/write functions directly.
7. To enable the APIC_ACCESS VM exit type, both the APIC base
and the relevant ctl must be set in the VMCS; this change
implements the former and toggles the latter according to
APIC state, thus enabling the fast path implemented in patch
6 and also adding better support for APIC relocation.
8. This fixes a technically incorrect variable type. No change
in functionality.
9. The HVF "failed to decode instruction" error prints
instruction bytes in variable-length hex; it's easier to
read when the hex digits are always paired, even if the
first one is 0.
10. This APIC change replaces a magic number with a symbolic
constant and removes a comment which is no longer accurate.
11. This change causes the software APIC to raise an exception
when the guest attempts to write to the reserved bits of the
APICBASE MSR.
In combination, the patches improve QEMU/hvf's relevant
kvm-unit-tests results in the APIC/xAPIC/x2APIC suites (after
disabling the crashing test_pv_ipi test) from:
FAIL apic-split (56 tests, 9 unexpected failures, 2 skipped)
FAIL x2apic (56 tests, 9 unexpected failures, 2 skipped)
FAIL xapic (45 tests, 7 unexpected failures, 3 skipped)
To:
FAIL apic-split (56 tests, 2 unexpected failures, 2 skipped)
FAIL x2apic (56 tests, 2 unexpected failures, 2 skipped)
FAIL xapic (45 tests, 2 unexpected failures, 3 skipped)
The remaining test failures are, in each case:
FAIL: multiple nmi (received: 1)
FAIL: TMCCT should stay at zero
These are not regressions, they were part of the original
test failures; APIC logic relevant to them was not touched
in this patch set.
Performance improvements vary a lot depending on workload.
MMIO-based APIC accesses are most affected, so APIC-heavy
loads with x2apic disabled benefit most.
macOS guests seem to use the APIC very heavily, so it comes
as no surprise that the improvement is noticeable, with
the guest OS booting in about 33 seconds with xAPIC, down
from about 43 with unmodified QEMU 9.2-rc3. Even with x2APIC
enabled, the changes shaved about 1-2 seconds off boot times.
"Warm" builds of QEMU 9.2-rc3 with
make clean ; time ( ./configure --target-list=x86_64-softmmu ; make -j4)
inside an OpenSUSE 15.6 VM with -smp 4 and -cpu host,-x2apic
on my 8-core MacBook Pro 2019 with macOS 14 repeatably went from
2m 58s (+/- 1.2s) real, 8m 39s (+/- 2s) user, 1m 14s (+/- <1s) sys
with 9.2-rc3 to
2m 56s (+/- 1.2s) real, 8m 32s (+/- 2s) user, 1m 13s (+/- <1s) sys
with this patch set applied. It's not a huge improvement
(>1% for real/user times, 2% for sys) but the runs are very
consistently in favour of the modifications.
My wall-clock measurements with boot times for Windows 10,
Linux, and FreeBSD guests were all within margin of error,
though profiling the QEMU process did indicate a reduction of
vCPU thread time spent in bql_lock().
I did not spot any performance regressions, but I can't
entirely rule them out either.
There is scope for similar optimisations that yield more widely
applicable improvements. In particular EPT fault MMIO handling
can be split into non-BQL and BQL portions. I measured around 1-2%
additional improvement with a naive implementation of this, but the
hvf memory slot data structure needs to be made RCU-safe before such
an optimisation can be upstreamed.
I have done my best to do a reasonable amount of testing with
these patches, in particular I have tested Linux, FreeBSD,
macOS, Windows 10, and Windows XP guests on x86-64, each with
x2APIC alternately enabled and disabled. On aarch64, I have
checked Linux and macOS (VMApple) guests. All of this on
macOS hosts.
Phil Dennis-Jordan (11):
hvf: Add facility for initialisation code prior to first vCPU run
arm/hvf: Initialise GICv3 state just before first vCPU run
i386/hvf: Don't send signal to thread when kicking
i386/hvf: Pre-fetch emulated instructions
i386/hvf: Decode APIC access x86 instruction outside BQL
i386/hvf: APIC access exit with fast-path for common mov cases
i386/hvf: Enables APIC_ACCESS VM exits by setting APICBASE
i386/hvf: Variable type fixup in decoder
i386/hvf: Print hex pairs for each opcode byte in decode error
hw/intc/apic: Fixes magic number use, removes outdated comment
hw/intc/apic: Raise exception when setting reserved APICBASE bits
accel/hvf/hvf-accel-ops.c | 5 ++
hw/intc/apic.c | 12 +++--
include/hw/i386/apic.h | 2 +
include/sysemu/hvf_int.h | 1 +
meson.build | 1 +
target/arm/hvf/hvf.c | 21 +++++---
target/i386/hvf/hvf.c | 50 +++++++++++++----
target/i386/hvf/trace-events | 9 ++++
target/i386/hvf/trace.h | 1 +
target/i386/hvf/x86_decode.c | 24 ++++++---
target/i386/hvf/x86_decode.h | 5 +-
target/i386/hvf/x86_emu.c | 102 +++++++++++++++++++++++++++++++++++
target/i386/hvf/x86_emu.h | 2 +
13 files changed, 206 insertions(+), 29 deletions(-)
create mode 100644 target/i386/hvf/trace-events
create mode 100644 target/i386/hvf/trace.h
--
2.39.3 (Apple Git-146)