[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translati
From: |
Paolo Bonzini |
Subject: |
Re: [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation |
Date: |
Wed, 30 Sep 2015 06:44:32 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 |
On 24/09/2015 10:32, Alvise Rigo wrote:
> The implementation heavily uses the software TLB together with a new
> bitmap that has been added to the ram_list structure which flags, on a
> per-CPU basis, all the memory pages that are in the middle of a LoadLink
> (LL), StoreConditional (SC) operation. Since all these pages can be
> accessed directly through the fast-path and alter a vCPU's linked value,
> the new bitmap has been coupled with a new TLB flag for the TLB virtual
> address which forces the slow-path execution for all the accesses to a
> page containing a linked address.
Alvise, Emilio,
I have a doubt about your patches for ll/sc emulation, that I hope you
can clarify.
>From 10000ft, both approaches rely on checking a flag during stores.
This is split between the TLB and the CPUState for Alvise's patches (in
order to exploit the existing fast-path checks), and entirely in the
radix tree for Emilio's. However, the idea is the same.
Now, the patch are okay for serial emulation, but I am not sure if it's
possible to do lock-free ll/sc emulation, because there is a race.
If we check the flag before the store, the race is as follows:
CPU0 CPU1
-------------------------------------------------------
check flag
load locked:
set flag
load value (normal load on CPU)
store
store conditional (normal store on CPU)
where the sc doesn't fail. For completeness, if we check it afterwards
(which would be possible with Emilio's approach, though not for the
TLB-based one):
CPU0 CPU1
------------------------------------------------------
load locked
set bit
load value (normal load on CPU)
store
store conditional (normal store on CPU)
check flag
and again the sc doesn't fail.
Most solutions I can think of are impractical:
- hardware ll/sc in CPU1. x86 doesn't have it.
- hardware transactional memory in CPU0, checking the bit after the
store and abort the transaction (I think). HTM just doesn't exist.
- some kind of store-in-progress (SIP) flag that ll can test and force
failure of the corresponding sc. For example, each CPU could store a
(last_store_address, last_store_value) tuple. If the value that LL loads
disagrees with any CPU, the LL would direct the SC to fail. A store
would look like:
store value to last_store_value
smp_wmb()
store address to last_store_address
smp_mb()
load TLB or radix tree
The memory barrier orders the store to the SIP flag and the load from
the TLB, and is probably too expensive. :(
- some array of atomic global generation counts, incremented
unconditionally on every store and checked between ll and sc. Cacheline
bounce fiesta, hence extremely slow. :(
Tell me I'm wrong. :)
If I'm right, we can still keep the opcodes and implement them with a
simple cmpxchg. It would provide a nice generic tool to implement
atomic operations, and it will work correctly if the target has ll/sc.
However, ll/sc-on-cmpxchg (e.g., ARM-on-x86) would be susceptible to the
ABA problem.
Paolo
- Re: [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation,
Paolo Bonzini <=