qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] RFC Multi-threaded TCG design document


From: Alex Bennée
Subject: Re: [Qemu-devel] RFC Multi-threaded TCG design document
Date: Mon, 15 Jun 2015 15:25:02 +0100

alvise rigo <address@hidden> writes:

> Hi Alex,
>
> Let me just add one comment.
>
<snip>
>>
>> Memory Barriers
>> ---------------
>>
>> Barriers (sometimes known as fences) provide a mechanism for software
>> to enforce a particular ordering of memory operations from the point
>> of view of external observers (e.g. another processor core). They can
>> apply to any memory operations as well as just loads or stores.
>>
>> The Linux kernel has an excellent write-up on the various forms of
>> memory barrier and the guarantees they can provide [1].
>>
>> Barriers are often wrapped around synchronisation primitives to
>> provide explicit memory ordering semantics. However they can be used
>> by themselves to provide safe lockless access by ensuring for example
>> a signal flag will always be set after a payload.
>>
>> DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
>>
>> This would enforce a strong load/store ordering so all loads/stores
>> complete at the memory barrier. On single-core non-SMP strongly
>> ordered backends this could become a NOP.
>
> I believe the main problem here is not just about translating guest
> barriers to host barriers, but also about adding barriers in the TCG
> generated code where they are needed i.e. when, in the guest code, the
> synchronization/memory barriers don't wrap atomic instructions.

Not all atomic instructions imply memory barriers. AIUI on ARMv8 you
only have explicit memory barriers is you use the
load-acquire/store-release variants of load/store exclusive. 

>
> To give a concrete example, let's suppose a case where we emulate an x86
> guest on ARM (on ARMv8 the situation should be not so complicated).  At
> some point TCG will be asked to translate a Linux spin_lock(), that
> eventually uses arch_spin_lock(). Simplifying a bit, what happens is
> along the lines of:
>
> - barrier() // meaning a compiler barrier
>   - atomic update of the (spin)lock value
> - barrier()
>
> The architecture dependent part is of course the "atomic update of the
> spinlock" implementation, which, on ARM, relies on ldrex/strex
> instructions and eventually issues a full hardware memory barrier (dmb).
> On the other hand, on x86, only the cmpxchg instructions is used
> (coupled with a memory compiler clobber), but no hardware full memory
> barrier is required because of a stronger memory model.  I'm pretty sure
> that the TCG code generated from spin_lock() will not be the same as the
> one present in an ARM kernel binary compiled with the latest
> GCC, but still, that full memory barrier is likely to be required also
> in the TCG generated code.
>
> Now the question could be: looking at the bare flow of asm x86
> instructions used to implement spin_lock(), how can we deduce that a dmb
> instruction has to be added after the atomic instructions?  Should we
> pair every guest atomic instruction with a dmb?

I don't think so. We should follow the guest processors semantics which
AIUI for x86 is cmpxchg does enforce memory ordering across cores if
prefixed with the LOCK prefix. At that point we can prefix the cmpxchg
TCG ops with our new tcg_dmb barrier.

Without the LOCK prefix we still guarantee an atomic update but without
any explicit synchronisation between the cores.

In practice Linux at least uses LOCK prefixed cmpxchg instructions in
its synchronisation code.

x86 code will still emit s/m/lfence instructions to ensure external
devices see memory accesses in the right order. These should certainly
cause memory barriers tcg ops to be emitted.

>
> Regards,
> alvise
>

-- 
Alex Bennée



reply via email to

[Prev in Thread] Current Thread [Next in Thread]