qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex


From: Emilio G. Cota
Subject: Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex
Date: Wed, 17 Aug 2016 13:58:00 -0400
User-agent: Mutt/1.5.23 (2014-03-12)

On Wed, Aug 17, 2016 at 10:22:05 -0700, Richard Henderson wrote:
> On 08/15/2016 08:49 AM, Emilio G. Cota wrote:
> >+void HELPER(xbegin)(CPUARMState *env)
> >+{
> >+    uintptr_t ra = GETPC();
> >+    int status;
> >+    int retries = 100;
> >+
> >+ retry:
> >+    status = _xbegin();
> >+    if (status != _XBEGIN_STARTED) {
> >+        if (status && retries) {
> >+            retries--;
> >+            goto retry;
> >+        }
> >+        if (parallel_cpus) {
> >+            cpu_loop_exit_atomic(ENV_GET_CPU(env), ra);
> >+        }
> >+    }
> >+}
> >+
> >+void HELPER(xend)(void)
> >+{
> >+    if (_xtest()) {
> >+        _xend();
> >+    } else {
> >+        assert(!parallel_cpus);
> >+        parallel_cpus = true;
> >+    }
> >+}
> >+
> 
> Interesting idea.
> 
> FWIW, there are two other extant HTM implementations: ppc64 and s390x.  As I
> recall, the s390 (but not the ppc64) transactions do not roll back the fp
> registers.  Which suggests that we need special support within the TCG
> proglogue.  Perhaps folding these operations into special TCG opcodes.

I'm not familiar with s390, but as long as the hardware implements 'strong 
atomicity'
["strong atomicity guarantees atomicity between transactions and 
non-transactional
code", see http://acg.cis.upenn.edu/papers/cal06_atomic_semantics.pdf ] then
this approach would work, in the sense that stores wouldn't have to
be instrumented.

Of course architecture issues like saving the fp registers as you mention for
s390 would have to be taken into account.

> I believe that power8 has HTM, and there's one of those in the gcc compile
> farm, so this should be relatively easy to try out.

Good point! I had forgotten about power8. So far my tests have been on a
4-core Skylake. I have an account on the gcc compile farm so I will make use
of it. The power8 machine in the farm has a lot of cores, so this is
pretty exciting.

> We increase the chances of success of the transaction if we minimize the
> amount of non-target code that's executed while the transaction is running.
> That suggests two things:
> 
> (1) that it would be doubly helpful to incorporate the transaction start
> directly into TCG code generation rather than as a helper and

This (and leaving the fallback path in a helper) is simple enough that even
I could do it :-)

> (2) that we should start a new TB upon encountering a load-exclusive, so
> that we maximize the chance of the store-exclusive being a part of the same
> TB and thus have *nothing* extra between the beginning and commit of the
> transaction.

I don't know how to do this. If it's easy to do, please let me know how
(for aarch64 at least, since that's the target I'm using).

I've run some more tests on the Intel machine, and noticed that failed
transactions are very common (up to 50% abort rate for some SPEC workloads,
and I count these aborts as "retrying doesn't help" kind of aborts), so
bringing that down should definitely help.

Another thing I found out is that abusing tcg_exec_step (as is right now)
for the fallback path is a bad idea: when there are many failed transactions,
performance drops dramatically (up to 5x overall slowdown). Turns out that
all this overhead comes from re-translating the code between ldrex/strex.
Would it be possible to cache this step-by-step code? If not, then an
alternative would be to have a way to stop the world *without* leaving
the CPU loop for the calling thread. I'm more comfortable doing the latter
due to my glaring lack of TCG competence.

Thanks,

                Emilio



reply via email to

[Prev in Thread] Current Thread [Next in Thread]