[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex
From: |
Emilio G. Cota |
Subject: |
Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex |
Date: |
Wed, 24 Aug 2016 17:12:40 -0400 |
User-agent: |
Mutt/1.5.23 (2014-03-12) |
On Thu, Aug 18, 2016 at 08:38:47 -0700, Richard Henderson wrote:
> A couple of other notes, as I've thought about this some more.
Thanks for spending time on this.
I have a new patchset (will send as a reply to this e-mail in a few
minutes) that has good performance. Its main ideas:
- Use transactions that start on ldrex and finish on strex. On
an exception, end (instead of abort) the ongoing transaction,
if any. There's little point in aborting, since the subsequent
retries will end up in the same exception anyway. This means
the translation of the corresponding blocks might happen via
the fallback path. That's OK, given that subsequent executions
of the TBs will (likely) complete via HTM.
- For the fallback path, add a stop-the-world primitive that stops
all other CPUs, without requiring the calling CPU to exit the CPU loop.
Not breaking from the loop keeps the code simple--we can just
keep translating/executing normally, with the guarantee that
no other CPU can run until we're done.
- The fallback path of the transaction stops the world and then
continues execution (from ldrex) as the only running CPU.
- Only retry when the hardware hints that we may do so. This
ends up being rare (I can only get dozens of retries under
heavy contention, for instance with 'atomic_add-bench -r 1')
Limitations: for now user-mode only, and I have paid no attention
to paired atomics. Also, I'm making no checks for unusual (undefined?)
guest code, such as stray ldrex/strex thrown in there.
Performance optimizations like you suggest (e.g. starting a TB
on ldrex, or using TCG ops for beginning/ending the transaction)
could be implemented, but at least on Intel TSX (the only one I've
tried so far[*]), the transaction buffer seems big enough to not
make these optimizations a necessity.
[*] I tried running HTM primitives on the gcc compile farm's Power8,
but I get an illegal instruction fault on tbegin. I've filed
an issue here to report it: https://gna.org/support/?3369 ]
Some observations:
- The peak number of retries I see is for atomic_add-bench -r 1 -n 16
(on an 8-thread machine) at about ~90 retries. So I set the limit
to 100.
- The lowest success rate I've seen is ~98%, again for atomic_add-bench
under high contention.
Some numbers:
- atomic_add's performance is lower for HTM vs cmpxchg, although under
contention performance gets very similar. The reason for the perf
gap is that xbegin/xend takes more cycles than cmpxchg, especially
under little or no contention; this explains the large difference
for threads=1.
http://imgur.com/5kiT027
As a side note, contended transactions seem to scale worse than contended
cmpxchg when exploiting SMT. But anyway I wouldn't read much into
that.
- For more realistic workloads that gap goes away, as the relative impact
of cmpxchg or transaction delays is lower. For QHT, 1000 keys:
http://imgur.com/l6vcowu
And for SPEC (note that despite being single-threaded, SPEC executes
a lot of atomics, e.g. from mutexes and from forking):
http://imgur.com/W49YMhJ
Performance is essentially identical to that of cmpxchg, but of course
with HTM we get correct emulation.
Thanks for reading this far!
Emilio
- [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans, Alex Bennée, 2016/08/15
- Re: [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans, Peter Maydell, 2016/08/15
- Re: [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans, Emilio G. Cota, 2016/08/15
- [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex, Emilio G. Cota, 2016/08/15
- Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex, Richard Henderson, 2016/08/17
- Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex, Emilio G. Cota, 2016/08/17
- Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex, Emilio G. Cota, 2016/08/17
- Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex, Richard Henderson, 2016/08/17
- Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex, Richard Henderson, 2016/08/18
- Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex,
Emilio G. Cota <=
- [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST, Emilio G. Cota, 2016/08/24
- [Qemu-devel] [PATCH 3/8] rcu: add rcu_read_lock_held(), Emilio G. Cota, 2016/08/24
- [Qemu-devel] [PATCH 7/8] htm: add powerpc64 intrinsics, Emilio G. Cota, 2016/08/24
- [Qemu-devel] [PATCH 6/8] htm: add header to abstract Hardware Transactional Memory intrinsics, Emilio G. Cota, 2016/08/24
- [Qemu-devel] [PATCH 8/8] target-arm/a64: use HTM with stop-the-world fall-back path, Emilio G. Cota, 2016/08/24
- [Qemu-devel] [PATCH 2/8] cpu-exec: remove tb_lock from hot path, Emilio G. Cota, 2016/08/24
- [Qemu-devel] [PATCH 4/8] target-arm: helper fixup for paired atomics, Emilio G. Cota, 2016/08/24
- [Qemu-devel] [PATCH 5/8] linux-user: add stop-the-world to be called from CPU loop, Emilio G. Cota, 2016/08/24
Re: [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans, Alex Bennée, 2016/08/16