qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 0/4] target-ppc: create TCG slots for registers


From: Aurelien Jarno
Subject: Re: [Qemu-devel] [PATCH 0/4] target-ppc: create TCG slots for registers based on CPU
Date: Sun, 29 Mar 2009 15:34:53 +0200
User-agent: Mutt/1.5.18 (2008-05-17)

On Sat, Mar 28, 2009 at 05:18:34PM -0700, Nathan Froyd wrote:
> On Sat, Mar 28, 2009 at 11:54:43PM +0100, Aurelien Jarno wrote:
> > On Sat, Mar 28, 2009 at 02:30:13PM -0700, Nathan Froyd wrote:
> > > I am not a TCG expert, but there are several loops in TCG over all
> > > globals and it seems like those loops would go faster if they didn't
> > > have to consider registers that would never be touched.  If this patch
> > > series makes no difference in TCG's performance, then I'd be glad to
> > > have an explanation of why that's the case.
> > 
> > Do you actually have run a benchmark with those changes? TCG is
> > sometimes a bit strange, and some optimizations does not change the
> > execution speed, while others improve it a lot. It is very difficult to
> > predict what will give a gain or not.
> > 
> > Suggestions of benchmarks: gzip/bzip2 on a big file using user emulation
> > or a compilation in system emulation.
> 
> Benchmarking?  Pffft. ;)
> 
> A benchmarking session with qemu-ppc and bzip2/bunzip2 on ~400MB files
> and a 603e emulated CPU suggests that these changes are not terribly
> beneficial (maybe 1% improvement, if that).  I don't imagine that a
> similarly stressful benchmark in system emulation would be much
> different.  Consider the patch series withdrawn.
> 

I have done a few profiling on qemu-system-ppc and qemu-system-mips. You
are actually right that the loop on the TCG variables lists takes time. 
This is mainly due to the call of save_globals() for TCG functions marked 
as TCG_OPF_CALL_CLOBBER.

However it looks like it should be better to address this comment first
before trying to reduce the number of TCG variables:

            /* XXX: for load/store we could do that only for the slow path
               (i.e. when a memory callback is called) */

However for the PowerPC target, what really kills the performance is the
call to ppc_store_sr(), basically done by the Linux kernel for each
context switch. In the chip the SR register selection is done before the
TLB, while we emulated both the SR and the TLB with the QEMU TLB, this
means we have to do a tlb_flush(env, 1) each time. This is time
expensive, and also kills the performance as it has to be filled again.

-- 
Aurelien Jarno                          GPG: 1024D/F1BCDB73
address@hidden                 http://www.aurel32.net




reply via email to

[Prev in Thread] Current Thread [Next in Thread]