[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] [RFC v4 0/9] Slow-path for atomic instruction translation
From: |
Alvise Rigo |
Subject: |
[Qemu-devel] [RFC v4 0/9] Slow-path for atomic instruction translation |
Date: |
Fri, 7 Aug 2015 19:03:06 +0200 |
This is the fourth iteration of the patch series which applies to the
upstream branch of QEMU (v2.4.0-rc0).
Changes versus previous versions are at the bottom of this cover letter.
The code is also available at following repository:
https://git.virtualopensystems.com/dev/qemu-mt.git
branch:
slowpath-for-atomic-v4-no-mttcg
(To enable the new slow path configure QEMU with:
./configure --enable-tcg-ldst-excl ...)
This patch series provides an infrastructure for atomic instruction
implementation in QEMU, thus offering a 'legacy' solution for
translating guest atomic instructions. Moreover, it can be considered as
a first step toward a multi-thread TCG.
The underlying idea is to provide new TCG instructions that guarantee
atomicity to some memory accesses or in general a way to define memory
transactions. More specifically, a new version of TCG
qemu_{ld,st}_{i32,i64} instructions has been implemented that behaves as
a pair of LoadLink and StoreConditional atomic instructions. The
implementation heavily uses the software TLB together with a new bitmap
that has been added to the ram_list structure which flags, on a per-CPU
basis, all the memory pages that are in the middle of a LoadLink (LL),
StoreConditional (SC) operation.
Since all these pages can be accessed directly through the fast-path
and alter a vCPU's linked value, the new bitmap has been coupled with a
new TLB flag for the TLB virtual address which forces the slow-path
execution for all the accesses to a page containing a linked address.
The new slow-path is implemented such that:
- the LL behaves as a normal load slow-path, except for clearing the
dirty flag in the bitmap. The cputlb.c code while generating a TLB
entry, checks if there is at least one vCPU that has the bit cleared
in the exclusive bitmap, it that case the TLB entry will have the EXCL
flag set, thus forcing the slow-path. In order to ensure that all the
vCPUs will follow the slow-path for that page, we flush the TLB cache
of all the other vCPUs.
The LL will also set the linked address and size of the access in a
vCPU's private variable. After the corresponding SC, this address will
be set to a reset value.
- the SC can fail returning 1, or succeed, returning 0. It has to come
always after a LL and has to access the same address 'linked' by the
previous LL, otherwise it will fail. If in the time window delimited
by a legit pair of LL/SC operations another write access happens to
the linked address, the SC will fail.
In theory, the provided implementation of TCG LoadLink/StoreConditional
can be used to properly handle atomic instructions on any architecture.
In this series the ARM ldrex/strex instructions (all flavours) are
implemented for arm, aarch64 and i386 hosts.
The code has been tested with bare-metal test cases and by booting Linux.
* Performance considerations
The new slow-path adds some overhead to the translation of the ARM
atomic instructions, since their emulation doesn't happen anymore only
in the guest (by mean of pure TCG generated code), but requires the
execution of two helpers functions. Despite this, the additional time
required to boot an ARM Linux kernel on an i7 clocked at 2.5GHz is
negligible.
Instead, on a LL/SC bound test scenario - like:
https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git - this
solution requires 30% (1 million iterations) and 70% (10 millions
iterations) of additional time for the test to complete.
Changes from v3:
- based on upstream QEMU
- addressed comments from Alex Bennée
- the slow path can be enabled by the user with:
./configure --enable-tcg-ldst-excl only if the backend supports it
- all the ARM ldex/stex instructions make now use of the slow path
- added aarch64 TCG backend support
- part of the code has been rewritten
Changes from v2:
- the bitmap accessors are now atomic
- a rendezvous between vCPUs and a simple callback support before executing
a TB have been added to handle the TLB flush support
- the softmmu_template and softmmu_llsc_template have been adapted to work
on real multi-threading
Changes from v1:
- The ram bitmap is not reversed anymore, 1 = dirty, 0 = exclusive
- The way how the offset to access the bitmap is calculated has
been improved and fixed
- A page to be set as dirty requires a vCPU to target the protected address
and not just an address in the page
- Addressed comments from Richard Henderson to improve the logic in
softmmu_template.h and to simplify the methods generation through
softmmu_llsc_template.h
- Added initial implementation of qemu_{ldlink,stcond}_i32 for tcg/i386
This work has been sponsored by Huawei Technologies Duesseldorf GmbH.
Alvise Rigo (9):
exec.c: Add new exclusive bitmap to ram_list
softmmu: Add new TLB_EXCL flag
softmmu: Add helpers for a new slowpath
tcg-op: create new TCG qemu_{ld,st} excl variants
configure: Enable/disable new qemu_{ld,st} excl insns
tcg-i386: Implement excl variants of qemu_{ld,st}
tcg-arm: Implement excl variants of qemu_{ld,st}
tcg-aarch64: Implement excl variants of qemu_{ld,st}
target-arm: translate: Use ld/st excl for atomic insns
configure | 21 ++++++
cputlb.c | 42 ++++++++++-
exec.c | 7 +-
include/exec/cpu-all.h | 8 ++
include/exec/cpu-defs.h | 12 +++
include/exec/memory.h | 3 +-
include/exec/ram_addr.h | 68 +++++++++++++++++
softmmu_llsc_template.h | 124 ++++++++++++++++++++++++++++++
softmmu_template.h | 124 ++++++++++++++++++++++++------
target-arm/translate.c | 191 +++++++++++++++++++++++++++++++++++++++++++++--
tcg/aarch64/tcg-target.c | 99 ++++++++++++++++++++++--
tcg/arm/tcg-target.c | 152 +++++++++++++++++++++++++++++--------
tcg/i386/tcg-target.c | 148 +++++++++++++++++++++++++++++++-----
tcg/tcg-be-ldst.h | 1 +
tcg/tcg-op.c | 65 ++++++++++++++++
tcg/tcg-op.h | 4 +
tcg/tcg-opc.h | 8 ++
tcg/tcg.c | 2 +
tcg/tcg.h | 32 ++++++++
19 files changed, 1018 insertions(+), 93 deletions(-)
create mode 100644 softmmu_llsc_template.h
--
2.5.0
- [Qemu-devel] [RFC v4 0/9] Slow-path for atomic instruction translation,
Alvise Rigo <=
- [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns, Alvise Rigo, 2015/08/07
- Re: [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns, Aurelien Jarno, 2015/08/08
- Re: [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns, Peter Maydell, 2015/08/08
- Re: [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns, Alex Bennée, 2015/08/09
- Re: [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns, Aurelien Jarno, 2015/08/09
- Re: [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns, Alex Bennée, 2015/08/09
- Re: [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns, Aurelien Jarno, 2015/08/09
- Re: [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns, Alex Bennée, 2015/08/09
[Qemu-devel] [RFC v4 3/9] softmmu: Add helpers for a new slowpath, Alvise Rigo, 2015/08/07