qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translati


From: Paolo Bonzini
Subject: Re: [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation
Date: Wed, 30 Sep 2015 15:20:59 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0


On 30/09/2015 10:14, alvise rigo wrote:
>> From 10000ft, both approaches rely on checking a flag during stores.
>> This is split between the TLB and the CPUState for Alvise's patches (in
>> order to exploit the existing fast-path checks), and entirely in the
>> radix tree for Emilio's.  However, the idea is the same.
>>
>> Now, the patch are okay for serial emulation, but I am not sure if it's
>> possible to do lock-free ll/sc emulation, because there is a race.
> 
> Do you mean to not use any locking mechanism at all at the emulation side?

Not using it in the fast path of the store, at least.

>>
>> If we check the flag before the store, the race is as follows:
>>
>>    CPU0        CPU1
>>    -------------------------------------------------------
>>    check flag
>>                load locked:
>>                   set flag
>>                   load value (normal load on CPU)
>>    store
>>                store conditional (normal store on CPU)
>>
>> where the sc doesn't fail.  For completeness, if we check it afterwards
> 
> Shouldn't this be prevented by the tcg_excl_access_lock in the
> patchseries based on mttcg (branch slowpath-for-atomic-v5-mttcg)?

No, the issue happens already in the fast path, i.e. when checking the TLB.

Have you ever heard of consensus numbers?  Basically, it's a tool to
prove that it is impossible to implement an algorithm X in a wait-free
manner.

For LL/SC/store, consider a simple case with two CPUs, one executing LL
and one executing a store.  This does not require any lock on the LL/SC
side, because there is only one CPU running that code.  It is also okay
if it requires a lock to synchronize between LL/store in the slow path.
 However, we want a wait-free fast path.  If we can formalize the fast
path as a consensus problem, consensus numbers let us prove whether it
can be done or not.

In fact, it's easy for the CPUs to use consensus to decide the outcome
of a subsequent SC instruction.  If the LL comes before the store, the
SC fails.  If the LL comes after the store, the SC succeeds.  Because
there's two CPUs, this consensus problem can be solved with any
primitive whose consensus number is >= 2.

Unfortunately atomic registers (i.e. memory cells, like QEMU's TLBs)
have consensus number 1.  You need test-and-set, compare-and-swap,
atomic writes to two memory locations or something like that.  All very
expensive stuff.

I attach a PDF with some pseudocode examples, and a Promela model of the
same.  (Yes, I was nerd-sniped).

>> If I'm right, we can still keep the opcodes and implement them with a
>> simple cmpxchg.  It would provide a nice generic tool to implement
>> atomic operations, and it will work correctly if the target has ll/sc.
>> However, ll/sc-on-cmpxchg (e.g., ARM-on-x86) would be susceptible to the
>> ABA problem.
> 
> This was one of my fears that led me to the ll/sc approach. I think it
> could be even more probable in emulation since we can't assume the
> distance in time between LLs and SCs to be small to avoid "aba"
> accesses.

In practice cmpxchg works because most primitives and lock-free data
structures are written against cmpxchg, not LL/SC.  ABA is avoided
through garbage collection, RCU or hazard pointers, without relying on
LL/SC semantics.

Again, your patches are still very useful to provide the abstraction.

Paolo

Attachment: llsc.pdf
Description: Adobe PDF document

Attachment: llsc.promela
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]