|
From: | Xin Tong |
Subject: | Re: [Qemu-devel] outlined TLB lookup on x86 |
Date: | Wed, 27 Nov 2013 19:56:05 -0800 |
On 11/27/2013 08:41 PM, Xin Tong wrote:Hard. There's quite a bit of code on that slow path that's unique to the
> I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64 on
> x86-64 machine, potentially for better instruction cache performance, I have a
> few questions.
>
> 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are generated
> when tcg_out_tb_finalize is called. And when a TLB lookup misses, it jumps to
> the generated slow path and slow path refills the TLB, then load/store and
> jumps to the next emulated instruction. I am wondering is it easy to outline
> the code for the slow path.
surrounding code context -- which registers contain inputs and outputs, where
to continue after slow path.
The amount of code that's in the TB slow path now is approximately minimal, as
far as I can see. If you've got an idea for improvement, please share. ;-)
That would work for true TLB misses to RAM, but does not work for memory mapped
> I am thinking when a TLB misses, the outlined TLB
> lookup code should generate a call out to the qemu_ld/st_helpers[opc &
> ~MO_SIGN] and rewalk the TLB after its refilled ? This code is off the critical
> path, so its not as important as the code when TLB hits.
I/O.
I'd be interested to experiment with different TLB sizes, to see what effect
> 2. why not use a TLB or bigger size? currently the TLB has 1<<8 entries. the
> TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i
> measured this using Intel PIN. so even the miss rate is low (say 3%) the
> overall time spent in the cpu_x86_handle_mmu_fault is still signifcant.
that has on performance. But I suspect that lack of TLB contexts mean that we
wind up flushing the TLB more often than real hardware does, and therefore a
larger TLB merely takes longer to flush.
But be aware that we can't simply make the change universally. E.g. ARM can
use an immediate 8-bit operand during the TLB lookup, but would have to use
several insns to perform a 9-bit mask.
Even with SIMD, I don't believe you could make the fast-path of a set
> I am
> thinking the tlb may need to be organized in a set associative fashion to
> reduce conflict miss, e.g. 2 way set associative to reduce the miss rate. or
> have a victim tlb that is 4 way associative and use x86 simd instructions to do
> the lookup once the direct-mapped tlb misses. Has anybody done any work on this
> front ?
associative lookup fast. This is the sort of thing for which you really need
the dedicated hardware of the real TLB. Feel free to prove me wrong with code,
of course.
r~
[Prev in Thread] | Current Thread | [Next in Thread] |