qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translati


From: alvise rigo
Subject: Re: [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation
Date: Wed, 30 Sep 2015 10:14:21 +0200

Hi Paolo,

On Wed, Sep 30, 2015 at 6:44 AM, Paolo Bonzini <address@hidden> wrote:
>
>
> On 24/09/2015 10:32, Alvise Rigo wrote:
>> The implementation heavily uses the software TLB together with a new
>> bitmap that has been added to the ram_list structure which flags, on a
>> per-CPU basis, all the memory pages that are in the middle of a LoadLink
>> (LL), StoreConditional (SC) operation.  Since all these pages can be
>> accessed directly through the fast-path and alter a vCPU's linked value,
>> the new bitmap has been coupled with a new TLB flag for the TLB virtual
>> address which forces the slow-path execution for all the accesses to a
>> page containing a linked address.
>
> Alvise, Emilio,
>
> I have a doubt about your patches for ll/sc emulation, that I hope you
> can clarify.
>
> From 10000ft, both approaches rely on checking a flag during stores.
> This is split between the TLB and the CPUState for Alvise's patches (in
> order to exploit the existing fast-path checks), and entirely in the
> radix tree for Emilio's.  However, the idea is the same.
>
> Now, the patch are okay for serial emulation, but I am not sure if it's
> possible to do lock-free ll/sc emulation, because there is a race.

Do you mean to not use any locking mechanism at all at the emulation side?

>
> If we check the flag before the store, the race is as follows:
>
>    CPU0        CPU1
>    -------------------------------------------------------
>    check flag
>                load locked:
>                   set flag
>                   load value (normal load on CPU)
>    store
>                store conditional (normal store on CPU)
>
> where the sc doesn't fail.  For completeness, if we check it afterwards

Shouldn't this be prevented by the tcg_excl_access_lock in the
patchseries based on mttcg (branch slowpath-for-atomic-v5-mttcg)?
Consider also that CPU0 will always finish its store operation before
the transition "flag not set -> flag set" finishes.

> (which would be possible with Emilio's approach, though not for the
> TLB-based one):
>
>    CPU0        CPU1
>    ------------------------------------------------------
>                load locked
>                   set bit
>                   load value (normal load on CPU)
>    store
>                store conditional (normal store on CPU)
>    check flag
>
> and again the sc doesn't fail.
>
> Most solutions I can think of are impractical:
>
> - hardware ll/sc in CPU1. x86 doesn't have it.
>
> - hardware transactional memory in CPU0, checking the bit after the
> store and abort the transaction (I think).  HTM just doesn't exist.
>
> - some kind of store-in-progress (SIP) flag that ll can test and force
> failure of the corresponding sc.  For example, each CPU could store a
> (last_store_address, last_store_value) tuple. If the value that LL loads
> disagrees with any CPU, the LL would direct the SC to fail.  A store
> would look like:
>
>          store value to last_store_value
>          smp_wmb()
>          store address to last_store_address
>          smp_mb()
>          load TLB or radix tree
>
> The memory barrier orders the store to the SIP flag and the load from
> the TLB, and is probably too expensive. :(

Umm, I agree with you that this could be too expensive.

>
> - some array of atomic global generation counts, incremented
> unconditionally on every store and checked between ll and sc.  Cacheline
> bounce fiesta, hence extremely slow. :(
>
> Tell me I'm wrong. :)
>
> If I'm right, we can still keep the opcodes and implement them with a
> simple cmpxchg.  It would provide a nice generic tool to implement
> atomic operations, and it will work correctly if the target has ll/sc.
> However, ll/sc-on-cmpxchg (e.g., ARM-on-x86) would be susceptible to the
> ABA problem.

This was one of my fears that led me to the ll/sc approach. I think it
could be even more probable in emulation since we can't assume the
distance in time between LLs and SCs to be small to avoid "aba"
accesses.
We could solve this issue once again with an access counter, but this
would require to increment it in the fast path, which will be kill the
performance.

Regards,
alvise

>
> Paolo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]