Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex

From:	Emilio G. Cota
Subject:	Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex
Date:	Wed, 17 Aug 2016 13:58:00 -0400
User-agent:	Mutt/1.5.23 (2014-03-12)

On Wed, Aug 17, 2016 at 10:22:05 -0700, Richard Henderson wrote:
> On 08/15/2016 08:49 AM, Emilio G. Cota wrote:
> >+void HELPER(xbegin)(CPUARMState *env)
> >+{
> >+    uintptr_t ra = GETPC();
> >+    int status;
> >+    int retries = 100;
> >+
> >+ retry:
> >+    status = _xbegin();
> >+    if (status != _XBEGIN_STARTED) {
> >+        if (status && retries) {
> >+            retries--;
> >+            goto retry;
> >+        }
> >+        if (parallel_cpus) {
> >+            cpu_loop_exit_atomic(ENV_GET_CPU(env), ra);
> >+        }
> >+    }
> >+}
> >+
> >+void HELPER(xend)(void)
> >+{
> >+    if (_xtest()) {
> >+        _xend();
> >+    } else {
> >+        assert(!parallel_cpus);
> >+        parallel_cpus = true;
> >+    }
> >+}
> >+
> 
> Interesting idea.
> 
> FWIW, there are two other extant HTM implementations: ppc64 and s390x.  As I
> recall, the s390 (but not the ppc64) transactions do not roll back the fp
> registers.  Which suggests that we need special support within the TCG
> proglogue.  Perhaps folding these operations into special TCG opcodes.

I'm not familiar with s390, but as long as the hardware implements 'strong 
atomicity'
["strong atomicity guarantees atomicity between transactions and 
non-transactional
code", see http://acg.cis.upenn.edu/papers/cal06_atomic_semantics.pdf ] then
this approach would work, in the sense that stores wouldn't have to
be instrumented.

Of course architecture issues like saving the fp registers as you mention for
s390 would have to be taken into account.

> I believe that power8 has HTM, and there's one of those in the gcc compile
> farm, so this should be relatively easy to try out.

Good point! I had forgotten about power8. So far my tests have been on a
4-core Skylake. I have an account on the gcc compile farm so I will make use
of it. The power8 machine in the farm has a lot of cores, so this is
pretty exciting.

> We increase the chances of success of the transaction if we minimize the
> amount of non-target code that's executed while the transaction is running.
> That suggests two things:
> 
> (1) that it would be doubly helpful to incorporate the transaction start
> directly into TCG code generation rather than as a helper and

This (and leaving the fallback path in a helper) is simple enough that even
I could do it :-)

> (2) that we should start a new TB upon encountering a load-exclusive, so
> that we maximize the chance of the store-exclusive being a part of the same
> TB and thus have *nothing* extra between the beginning and commit of the
> transaction.

I don't know how to do this. If it's easy to do, please let me know how
(for aarch64 at least, since that's the target I'm using).

I've run some more tests on the Intel machine, and noticed that failed
transactions are very common (up to 50% abort rate for some SPEC workloads,
and I count these aborts as "retrying doesn't help" kind of aborts), so
bringing that down should definitely help.

Another thing I found out is that abusing tcg_exec_step (as is right now)
for the fallback path is a bad idea: when there are many failed transactions,
performance drops dramatically (up to 5x overall slowdown). Turns out that
all this overhead comes from re-translating the code between ldrex/strex.
Would it be possible to cache this step-by-step code? If not, then an
alternative would be to have a way to stop the world *without* leaving
the CPU loop for the calling thread. I'm more comfortable doing the latter
due to my glaring lack of TCG competence.

Thanks,

                Emilio

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans, Alex Bennée, 2016/08/15
- Re: [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans, Peter Maydell, 2016/08/15
  - Re: [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans, Alex Bennée, 2016/08/15
- Re: [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans, Emilio G. Cota, 2016/08/15
  - [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex, Emilio G. Cota, 2016/08/15
    - Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex, Richard Henderson, 2016/08/17
    - Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex, Emilio G. Cota <=
    - Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex, Emilio G. Cota, 2016/08/17
    - Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex, Richard Henderson, 2016/08/17
    - Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex, Richard Henderson, 2016/08/18
    - Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex, Emilio G. Cota, 2016/08/24
    - [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST, Emilio G. Cota, 2016/08/24
    - [Qemu-devel] [PATCH 3/8] rcu: add rcu_read_lock_held(), Emilio G. Cota, 2016/08/24
    - [Qemu-devel] [PATCH 7/8] htm: add powerpc64 intrinsics, Emilio G. Cota, 2016/08/24
    - [Qemu-devel] [PATCH 6/8] htm: add header to abstract Hardware Transactional Memory intrinsics, Emilio G. Cota, 2016/08/24
    - [Qemu-devel] [PATCH 8/8] target-arm/a64: use HTM with stop-the-world fall-back path, Emilio G. Cota, 2016/08/24
    - [Qemu-devel] [PATCH 2/8] cpu-exec: remove tb_lock from hot path, Emilio G. Cota, 2016/08/24

Prev by Date: Re: [Qemu-devel] [Bug 1490611] Re: Using qemu >=2.2.1 to convert raw->VHD (fixed) adds extra padding to the result file, which Microsoft Azure rejects as invalid
Next by Date: [Qemu-devel] [PATCH for 2.7 0/2] block: fixes for deadlock in flush code
Previous by thread: Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex
Next by thread: Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex
Index(es):
- Date
- Thread