Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction transla

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction transla

From:	Frederic Konrad
Subject:	Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation
Date:	Fri, 10 Jul 2015 10:39:30 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0

On 10/07/2015 10:23, Alvise Rigo wrote:

This is the third iteration of the patch series; starting from PATCH 007
there are the changes to move the whole work to multi-threading.
Changes versus previous versions are at the bottom of this cover letter.

This patch series provides an infrastructure for atomic
instruction implementation in QEMU, paving the way for TCG multi-threading.
The adopted design does not rely on host atomic
instructions and is intended to propose a 'legacy' solution for
translating guest atomic instructions.

The underlying idea is to provide new TCG instructions that guarantee
atomicity to some memory accesses or in general a way to define memory
transactions. More specifically, a new pair of TCG instructions are
implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as
LoadLink and StoreConditional primitives (only 32 bit variant
implemented).  In order to achieve this, a new bitmap is added to the
ram_list structure (always unique) which flags all memory pages that
could not be accessed directly through the fast-path, due to previous
exclusive operations. This new bitmap is coupled with a new TLB flag
which forces the slow-path execution. All stores which are performed
between an LL/SC operation by other vCPUs to the same (protected) address
will fail the subsequent StoreConditional.

In theory, the provided implementation of TCG LoadLink/StoreConditional
can be used to properly handle atomic instructions on any architecture.

The new slow-path is implemented such that:
- the LoadLink behaves as a normal load slow-path, except for cleaning
   the dirty flag in the bitmap. The TLB entries created from now on will
   force the slow-path. To ensure it, we flush the TLB cache for the
   other vCPUs. The vCPU also sets into a private variable the accessed
   address, in order to make it visible to the other vCPUs
- the StoreConditional behaves as a normal store slow-path, except for
   checking whether other vCPUs have set the same exclusive address

All those write accesses that are forced to follow the 'legacy'
slow-path will set the accessed memory page to dirty.

In this series only the ARM ldrex/strex instructions are implemented
for ARM and i386 hosts.
The code has been tested with bare-metal test cases and by booting Linux,
using the latest mttcg QEMU branch available at
http://git.greensocs.com/fkonrad/mttcg.git.

branch multi_tcg_v6 at this time.


* Performance considerations
This implementation shows good results while booting a Linux kernel,
where tons of flushes affect the overall performance. A complete ARM
Linux boot, without any filesystem, requires 30% longer if compared to
the mttcg implementation, benefiting however of being capable to offer
the infrastructure to handle atomic instructions on any architecture.
Instead compared to the current TCG upstream, it is 40% faster with four
vCPUs and 2.1 times faster with 8 vCPUs.
In addition, there is still margin to improve such performance, since at
the moment TLB is flushed quite often, probably more than the required.

On the other hand, the test case
https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git
that stresses heavily the LL/SC mechanic but not that much the TLB related
part, performs up to 1.9 times faster with 8 cores and one milion iterations
if compared with the mttcg implementation.

Changes from v2:
- the bitmap accessors are now atomic
- a rendezvous between vCPUs and a simple callback support before executing
   a TB have been added to handle the TLB flush support

Isn't exactly what my async_safe_work is supposed to do?

- the softmmu_template and softmmu_llsc_template have been adapted to work
   on real multi-threading

Changes from v1:
- The ram bitmap is not reversed anymore, 1 = dirty, 0 = exclusive
- The way how the offset to access the bitmap is calculated has
   been improved and fixed
- A page to be set as dirty requires a vCPU to target the protected address
   and not just an address in the page
- Addressed comments from Richard Henderson to improve the logic in
   softmmu_template.h and to simplify the methods generation through
   softmmu_llsc_template.h
- Added initial implementation of qemu_{ldlink,stcond}_i32 for tcg/i386

This work has been sponsored by Huawei Technologies Duesseldorf GmbH.

Alvise Rigo (13):
   exec: Add new exclusive bitmap to ram_list
   cputlb: Add new TLB_EXCL flag
   softmmu: Add helpers for a new slow-path
   tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
   target-arm: translate: implement qemu_ldlink and qemu_stcond ops
   target-i386: translate: implement qemu_ldlink and qemu_stcond ops
   ram_addr.h: Make exclusive bitmap accessors atomic
   exec.c: introduce a simple rendezvous support
   cpus.c: introduce simple callback support
   Simple TLB flush wrap to use as exit callback
   Introduce exit_flush_req and tcg_excl_access_lock
   softmmu_llsc_template.h: move to multithreading
   softmmu_template.h: move to multithreading

  cpus.c                  |  39 ++++++++
  cputlb.c                |  33 +++++-
  exec.c                  |  46 +++++++++
  include/exec/cpu-all.h  |   2 +
  include/exec/cpu-defs.h |   8 ++
  include/exec/memory.h   |   3 +-
  include/exec/ram_addr.h |  22 ++++
  include/qom/cpu.h       |  37 +++++++
  softmmu_llsc_template.h | 184 ++++++++++++++++++++++++++++++++++
  softmmu_template.h      | 261 +++++++++++++++++++++++++++++++++++-------------
  target-arm/translate.c  |  87 +++++++++++++++-
  tcg/arm/tcg-target.c    | 121 ++++++++++++++++------
  tcg/i386/tcg-target.c   | 136 +++++++++++++++++++++----
  tcg/tcg-be-ldst.h       |   1 +
  tcg/tcg-op.c            |  23 +++++
  tcg/tcg-op.h            |   3 +
  tcg/tcg-opc.h           |   4 +
  tcg/tcg.c               |   2 +
  tcg/tcg.h               |  20 ++++
  19 files changed, 910 insertions(+), 122 deletions(-)
  create mode 100644 softmmu_llsc_template.h

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [RFC v3 09/13] cpus.c: introduce simple callback support, (continued)
- [Qemu-devel] [RFC v3 11/13] Introduce exit_flush_req and tcg_excl_access_lock, Alvise Rigo, 2015/07/10
- [Qemu-devel] [RFC v3 10/13] Simple TLB flush wrap to use as exit callback, Alvise Rigo, 2015/07/10
- [Qemu-devel] [RFC v3 13/13] softmmu_template.h: move to multithreading, Alvise Rigo, 2015/07/10
- [Qemu-devel] [RFC v3 12/13] softmmu_llsc_template.h: move to multithreading, Alvise Rigo, 2015/07/10
- Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation, Mark Burton, 2015/07/10
  - Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation, alvise rigo, 2015/07/10
- Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation, Frederic Konrad <=
  - Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation, alvise rigo, 2015/07/10

Prev by Date: Re: [Qemu-devel] [PATCH v7 04/42] qemu_ram_block_from_host
Next by Date: Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation
Previous by thread: Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation
Next by thread: Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation
Index(es):
- Date
- Thread