qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_li


From: Alex Bennée
Subject: Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
Date: Thu, 10 Sep 2015 17:19:38 +0100

alvise rigo <address@hidden> writes:

> Hi Paolo,
>
> A brief update on this. I have a first implementation of the idea you
> proposed, though it's not working really well. The failing rate of SCs
> for some reason is very high.

Due to high memory contention on the EXCL page?

> Instead of trying to fix it, I came up with this alternative design:
> we still use 8 bits per page and we group the smp_cpus in clusters of
> maximum size ceiling(smp_cpus / 8).
> Given a byte describing a certain page, its first bit will represents
> the vCPUs of index 0, 9, 17, ... and so on for the other bits.
>
> Note that systems with eight vCPUs or less will not be affected by
> this simplified model.
>
> Now, unless I'm missing some advantages of your proposal except the
> less aggressive memory consumption, I would go with this design.

BTW have you had a look at Emilio's series? I'm about half way through
reviewing it but I'm looking forward to seeing how his implementation
differs from yours. Maybe there are some ideas in there that will
inspire?

>
> Regards,
> alvise
>
> On Wed, Aug 12, 2015 at 2:36 PM, Paolo Bonzini <address@hidden> wrote:
>>
>>
>> On 12/08/2015 09:31, alvise rigo wrote:
>>> I think that tlb_flush_entry is not enough, since in theory another
>>> vCPU could have a different TLB address referring the same phys
>>> address.
>>
>> You're right, this is a TLB so it's virtually-indexed. :(  I'm not sure
>> what happens on ARM, since it has a virtually indexed (VIVT or VIPT)
>> cache, but indeed it would be a problem when implementing e.g. CMPXCHG
>> using the TCG ll/sc ops.
>>
>> I'm a bit worried about adding such a big bitmap.  It's only used on
>> TCG, but it is also allocated on KVM and on KVM you can have hundreds
>> of VCPUs.  Wasting 200 bits per guest memory page (i.e. ~0.6% of guest
>> memory) is obviously not a great idea. :(
>>
>> Perhaps we can use a bytemap instead:
>>
>> - 0..253 = TLB_EXCL must be set in all VCPUs except CPU n.  A VCPU that
>> loads the TLB for this vaddr does not have to set it.
>>
>> - 254 = TLB_EXCL must be set in all VCPUs.  A VCPU that
>> loads the TLB for this vaddr has to set it.
>>
>> - 255 = TLB_EXCL not set in at least two VCPUs
>>
>> Transitions:
>>
>> - ll transitions: anything -> 254
>>
>> - sc transitions: 254 -> current CPU_ID
>>
>> - TLB_EXCL store transitions: 254 -> current CPU_ID
>>
>> - tlb_st_page transitions: CPU_ID other than current -> 255
>>
>> The initial value is 255 on SMP guests, 0 on UP guests.
>>
>> The algorithms are very similar to yours, just using this approximate
>> representation.
>>
>> ll algorithm:
>>   llsc_value = bytemap[vaddr]
>>   if llsc_value == CPU_ID
>>      do nothing
>>   elseif llsc_value < 254
>>      flush TLB of CPU llsc_value
>>   elseif llsc_value == 255
>>      flush all TLBs
>>   set TLB_EXCL
>>   bytemap[vaddr] = 254
>>   load
>>
>> tlb_set_page algorithm:
>>   llsc_value = bytemap[vaddr]
>>   if llsc_value == CPU_ID or llsc_value == 255
>>      do nothing
>>   else if llsc_value == 254
>>      set TLB_EXCL
>>   else
>>      # two CPUs without TLB_EXCL
>>      bytemap[vaddr] = 255
>>
>> TLB_EXCL slow path algorithm:
>>    if bytemap[vaddr] == 254
>>       bytemap[vaddr] = CPU_ID
>>    else
>>       # two CPUs without TLB_EXCL
>>       bytemap[vaddr] = 255
>>    clear TLB_EXCL in this CPU
>>    store
>>
>> sc algorithm:
>>    if bytemap[vaddr] == CPU_ID or bytemap[vaddr] == 254
>>       bytemap[vaddr] = CPU_ID
>>       clear TLB_EXCL in this CPU
>>       store
>>       succeed
>>    else
>>       fail
>>
>> clear algorithm:
>>    if bytemap[vaddr] == 254
>>       bytemap[vaddr] = CPU_ID
>>
>> The UP case is optimized because bytemap[vaddr] will always be 0 or 254.
>>
>> The algorithm requires the LL to be cleared e.g. on exceptions.
>> Paolo
>>
>>> alvise
>>>
>>> On Tue, Aug 11, 2015 at 6:32 PM, Paolo Bonzini <address@hidden> wrote:
>>>>
>>>>
>>>> On 11/08/2015 18:11, alvise rigo wrote:
>>>>>>> Why flush the entire cache (I understand you mean TLB)?
>>>>> Sorry, I meant the TLB.
>>>>> If for each removal of an exclusive entry we set also the bit to 1, we
>>>>> force the following LL to make a tlb_flush() on every vCPU.
>>>>
>>>> What if you only flush one entry with tlb_flush_entry (on every vCPU)?
>>>>
>>>> Paolo

-- 
Alex Bennée



reply via email to

[Prev in Thread] Current Thread [Next in Thread]