qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] outlined TLB lookup on x86


From: Xin Tong
Subject: Re: [Qemu-devel] outlined TLB lookup on x86
Date: Tue, 21 Jan 2014 08:22:27 -0600

Hi

I have found that adding a small (8-entry) fully associative victim
TLB (http://en.wikipedia.org/wiki/Victim_Cache) before the refill path
(page table walking) improves the performance of QEMU x86_64 system
emulation mode significantly on the specint2006 benchmarks. This is
primarily due to the fact that the primary TLB is directly mapped and
suffer from conflict misses. I have this implemented on QEMU trunk and
would like to contribute this back to QEMU. Where should i start ?

Xin

On Tue, Dec 17, 2013 at 8:22 PM, Xin Tong <address@hidden> wrote:
> why is QEMU TLB organized based on the modes, e.g. on x86 there are 3
> modes. what i think is that there may be conflicts between virtual
> addresses and physical addresses. organizing it by modes guarantees
> that QEMU does not hit a physical address translation entry when in
> user mode and vice versa ?
>
> Thank you,
> Xin
>
> On Tue, Dec 17, 2013 at 10:52 PM, Xin Tong <address@hidden> wrote:
>> On Sun, Dec 8, 2013 at 2:54 AM, Xin Tong <address@hidden> wrote:
>>>
>>>
>>>
>>> On Thu, Nov 28, 2013 at 8:12 AM, LluĂ­s Vilanova <address@hidden> wrote:
>>>>
>>>> Xin Tong writes:
>>>>
>>>> > Hi LIuis
>>>> > we can probably generate vector intrinsics using the tcg, e.g. add
>>>> > support to
>>>> > tcg to emit vector instructions directly in code cache
>>>>
>>>> There was some discussion long ago about adding vector instructions to
>>>> TCG, but
>>>> I don't remember what was the conclusion.
>>>>
>>>> Also remember that using vector instructions will "emulate" a
>>>> low-associativity
>>>> TLB; don't know how much better than a 1-way TLB will that be, though.
>>>>
>>>>
>>>> > why would a larger TLB make some operations slower, the TLB is a
>>>> > direct-mapped
>>>> > hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is
>>>> > always
>>>> > used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)).
>>>>
>>>> It would make TLB invalidations slower (e.g., see 'tlb_flush' in
>>>> "cputlb.c"). And right now QEMU performs full TLB invalidations more
>>>> frequently
>>>> than the equivalent HW needs to, although I suppose that should be
>>>> quantified
>>>> too.
>>
>> I see QEMU executed ~1M instructions per context switch for
>> qemu-system-x86_64. Is this because of the fact that the periodical
>> time interval interrupt is delivered in real time while QEMU is
>> significantly slower than real hw ?
>>
>> Xin
>>
>>>>
>>> you are right LIuis. QEMU does context switch quite more often that real hw,
>>> this is probably primarily due to the fact that QEMU is magnitude slower
>>> than real hw.  I am wondering where timer is emulated in QEMU system-x86_64.
>>> I imagine the guest OS must program the timers to do interrupt for context
>>> switches.
>>>
>>> Another question, what happens when a vcpu is stuck in an infinite loop ?
>>> QEMU must need an timer interrupt somewhere as well ?
>>>
>>> Is my understanding correct ?
>>>
>>> Xin
>>>>
>>>>
>>>> Lluis
>>>>
>>>> --
>>>>  "And it's much the same thing with knowledge, for whenever you learn
>>>>  something new, the whole world becomes that much richer."
>>>>  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>>>>  Tollbooth
>>>
>>>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]