qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 0/5] tcg conditional set, round 4


From: Laurent Desnogues
Subject: Re: [Qemu-devel] [PATCH 0/5] tcg conditional set, round 4
Date: Wed, 23 Dec 2009 11:28:58 +0100

On Tue, Dec 22, 2009 at 3:46 PM, Laurent Desnogues
<address@hidden> wrote:
> On Tue, Dec 22, 2009 at 1:02 AM, Richard Henderson <address@hidden> wrote:
>> On 12/21/2009 03:08 PM, Laurent Desnogues wrote:
>>>
>>> If you wanted to use movcond, you'd have to make
>>> cond + move a special case...
>>
>> You'd certainly want the ARM front-end to use movcond more often than that.
>>  For instance:
>>
>>  addeq r1,r2,r3
>> -->
>>  add_i32 tmp,r2,r3
>>  movcond_i32 r1,ZF,0,tmp,r1,eq
>>
>> You'd want to continue to use a branch around if the instruction has side
>> effects like cpu fault (e.g. load, store) or updating flags.
>>
>> It ought not be very hard to arrange for something like
>>
>>  if (cond != 0xe) {
>>    if (may_use_movcond(insn)) {
>>      s->condlabel = -1;
>>      /* Save the true destination register.  */
>>      s->conddest = cpu_R[dest];
>>      /* Implement the instruction into a temporary.  */
>>      cpu_R[dest] = tcg_temp_new();
>>    } else {
>>      s->condlabel = gen_new_label();
>>      ArmConditional cmp = gen_test_cc(cond ^ 1);
>>      tcg_gen_brcondi_i32(cmp.cond, cmp.reg, 0, s->condlabel);
>>    }
>>    s->condjmp = 1;
>>  }
>>
>>  // ... implement the instruction as we currently do.
>>
>>  if (s->condjmp) {
>>    if (s->condlabel == -1) {
>>      /* Conditionally move the temporary result into the
>>         true destination register.  */
>>      ArmConditional cmp = gen_test_cc(cond);
>>      tcg_gen_movcond_i32(cmp.cond, s->conddest, cmp.reg, 0,
>>                          cpu_R[dest], s->conddest);
>>      tcg_temp_free(cpu_R[dest]);
>>      /* Restore the true destination register.  */
>>      cpu_R[dest] = s->conddest;
>>    } else {
>>      tcg_set_label(d->condlabel);
>>    }
>>  }
>
> I agree, that looks nice.  But I'll let you dig into ARM instruction
> encoding and see how to implement may_use_movcond and
> getting the correct dest to save is not that cheap (and before
> you get back to me, yes, you could only consider a small
> subset of the instructions for which you want to do that :-).
>
> There's a point I have kept on insisting on that you keep on
> not answering :-)  How does all of that perform in practice?
> We can discuss forever, as long as it isn't measured, we are
> just guessing.

So I did measure it.  Your code isn't correct:  you can't replace
the dest reg with a new temp since that would break instructions
such as:

    addeq r0,r0,#1

All conditional data processing processing instructions are
using movcond.

Note my version of QEMU is ARM specific and contains
several things that aren't in mainline:

   - Aurelien TCG optimizations (constant propagation and
     copy analysis)
   - lazy block context flags update
   - no temp flush on ld/st
   - most helpers for non-SIMD/VFP instructions are
     replaced with TCG code (using setcond for flag setting)
   - no signal handling.

This version of qemu-arm is about 2x faster than mainline.

Env:
 - HW: E6400
 - OS: CentOS 5.4 64-bit
 - gcc: 4.1.2
 - bench: SPEC2k gcc with expr.i input set

With movcond:
Translation buffer state:
gen code size       4262752/33449984
TB count            35084/524288
TB avg target size  18 max=592 bytes
TB avg host size    121 bytes (expansion ratio: 6.7)
cross page TB count 0 (0%)
direct jump count   20388 (58%) (2 jumps=17211 49%)

Statistics:
TB invalidate count 0
JIT cycles          628700508 (0.262 s at 2.4 GHz)
translated TBs      35084
avg ops/TB          28.2 max=554
deleted ops/TB      5.09 (178672)
avg temps/TB        28.88 max=54
total in  TB size   639924 avg 18.2
total out TB size   4009329 avg 114.3
cycles/op           635.1
cycles/in byte      982.5
cycles/out byte     156.8
  gen_interm time   13.2%
  gen_code time     86.8%
const/code time     9.9%
liveness/code time  13.9%

real    0m15.944s
user    0m15.512s
sys     0m0.070s

Without movcond:
Translation buffer state:
gen code size       4308640/33449984
TB count            35093/524288
TB avg target size  18 max=592 bytes
TB avg host size    122 bytes (expansion ratio: 6.7)
cross page TB count 0 (0%)
direct jump count   20388 (58%) (2 jumps=17211 49%)

Statistics:
TB invalidate count 0
JIT cycles          673085430 (0.280 s at 2.4 GHz)
translated TBs      35093
avg ops/TB          27.9 max=556
deleted ops/TB      4.79 (168080)
avg temps/TB        28.77 max=34
total in  TB size   640804 avg 18.3
total out TB size   4056125 avg 115.6
cycles/op           686.8
cycles/in byte      1050.4
cycles/out byte     165.9
  gen_interm time   12.8%
  gen_code time     87.2%
const/code time     9.3%
liveness/code time  11.5%

real    0m15.974s
user    0m15.586s
sys     0m0.060s

Notes:
  - the change in number of TB's is expected for gcc
    since it outputs timing stats at the end
  - the cycles spent in TCG are very inaccurate (I never
    found them very useful...)
  - I slightly changed the movcond x86_64 generation
    not to generate useless mov when dest = vtrue = valse
    which happened at least once (due to copy analysis).

All in all not much gain.

For the sake of completeness some stats about movcond usage:
  number of TB with movcond        : 2919
  max number of movcond in a TB    : 24
  total number of movcond generated: 5738
  total number of movcond executed : 163892368

Of course that's a single point, but one that is spending a rather
big percentage of time its time in generated code;  oprofile output:

21403    60.4570  anon (tgid:17348 range:0x601be000-0x621bf000)
qemu-arm-32              anon (tgid:17348 range:0x601be000-0x621bf000)
6751     19.0695  qemu-arm-32              qemu-arm-32              cpu_arm_exec
2230      6.2991  anon (tgid:17348 range:0x6224a000-0x6224b000)
qemu-arm-32              anon (tgid:17348 range:0x6224a000-0x6224b000)
303       0.8559  qemu-arm-32              qemu-arm-32              tcg_gen_code
65        0.1836  qemu-arm-32              qemu-arm-32              cpu_loop
61        0.1723  qemu-arm-32              qemu-arm-32
page_check_range
53        0.1497  libpthread-2.5.so        qemu-arm-32
__pthread_cleanup_upto
49        0.1384  qemu-arm-32              qemu-arm-32              temp_save
36        0.1017  qemu-arm-32              qemu-arm-32
gen_intermediate_code

Doesn't look encouraging, but I like the reduction in generated
code size.

Of course before drawing any conclusion, we need more
measures, especially one with some QEMU system.


Laurent




reply via email to

[Prev in Thread] Current Thread [Next in Thread]