qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] outlined TLB lookup on x86


From: Xin Tong
Subject: Re: [Qemu-devel] outlined TLB lookup on x86
Date: Wed, 22 Jan 2014 09:28:48 -0600

On Wed, Nov 27, 2013 at 8:12 PM, Richard Henderson <address@hidden> wrote:
> On 11/27/2013 08:41 PM, Xin Tong wrote:
>> I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64 on
>> x86-64 machine, potentially for better instruction cache performance, I have 
>> a
>> few  questions.
>>
>> 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are 
>> generated
>> when tcg_out_tb_finalize is called. And when a TLB lookup misses, it jumps to
>> the generated slow path and slow path refills the TLB, then load/store and
>> jumps to the next emulated instruction. I am wondering is it easy to outline
>> the code for the slow path.
>
> Hard.  There's quite a bit of code on that slow path that's unique to the
> surrounding code context -- which registers contain inputs and outputs, where
> to continue after slow path.
>
> The amount of code that's in the TB slow path now is approximately minimal, as
> far as I can see.  If you've got an idea for improvement, please share.  ;-)
>
>
>> I am thinking when a TLB misses, the outlined TLB
>> lookup code should generate a call out to the qemu_ld/st_helpers[opc &
>> ~MO_SIGN] and rewalk the TLB after its refilled ? This code is off the 
>> critical
>> path, so its not as important as the code when TLB hits.
>
> That would work for true TLB misses to RAM, but does not work for memory 
> mapped
> I/O.
>
>> 2. why not use a TLB or bigger size?  currently the TLB has 1<<8 entries. the
>> TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i
>> measured this using Intel PIN. so even the miss rate is low (say 3%) the
>> overall time spent in the cpu_x86_handle_mmu_fault is still signifcant.
>
> I'd be interested to experiment with different TLB sizes, to see what effect
> that has on performance.  But I suspect that lack of TLB contexts mean that we
> wind up flushing the TLB more often than real hardware does, and therefore a
> larger TLB merely takes longer to flush.
>
> But be aware that we can't simply make the change universally.  E.g. ARM can
> use an immediate 8-bit operand during the TLB lookup, but would have to use
> several insns to perform a 9-bit mask.


Hi Richard

I've done some experiments on increasing the size of the tlb.
increasing the size of the tlb from 256 entries to 4096 entries gives
significant performance improvement on the specint2006 benchmarks on
qemu-system-x86_64 running on a x86_64 linux machine . i am in the
process of exploring more tlb sizes and will post the data after i am
done.

Can you tell me whether ARM is the only architecture that requires
special treatment for increasing tlb size beyond 256 entries so that i
can whip up a patch to the QEMU mainline.

Thank you,
Xin
>
>>  I am
>> thinking the tlb may need to be organized in a set associative fashion to
>> reduce conflict miss, e.g. 2 way set associative to reduce the miss rate. or
>> have a victim tlb that is 4 way associative and use x86 simd instructions to 
>> do
>> the lookup once the direct-mapped tlb misses. Has anybody done any work on 
>> this
>> front ?
>
> Even with SIMD, I don't believe you could make the fast-path of a set
> associative lookup fast.  This is the sort of thing for which you really need
> the dedicated hardware of the real TLB.  Feel free to prove me wrong with 
> code,
> of course.
>
>
> r~



reply via email to

[Prev in Thread] Current Thread [Next in Thread]