Re: [Qemu-devel] [PATCH 00/12] tcg: Improve register allocation for call

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 00/12] tcg: Improve register allocation for call

From:	Emilio G. Cota
Subject:	Re: [Qemu-devel] [PATCH 00/12] tcg: Improve register allocation for calls
Date:	Wed, 28 Nov 2018 17:15:30 -0500
User-agent:	Mutt/1.9.4 (2018-02-28)

On Tue, Nov 27, 2018 at 21:38:22 -0800, Richard Henderson wrote:
> The intent here is to remove several move insns putting the
> function arguments into the proper place.  I'm hoping that
> this will solve the skylake regression with spec2006, as
> seen with the ool softmmu patch set.
> 
> Emilio, all of this is present on my tcg-next-for-4.0 branch.

Thanks for this.

Unfortunately, it doesn't seem to help, performance-wise.

I've benchmarked this on three different machines: Sandy
Bridge, Haswell and Skylake. The average slowdown vs.
the baseline is ~0%, ~5%, and ~10%, respectively.

So it seems the more modern the microarchitecture, the more
severe the slowdown (this is consistent with the assumption
that processors are getting better at caching over time).

Here are all the bar charts:

  https://imgur.com/a/k7vmjVd

- baseline: tcg-next-for-4.0's parent from master, i.e.
  4822f1e ("Merge remote-tracking branch
  'remotes/kraxel/tags/fixes-31-20181127-pull-request'
  into staging", 2018-11-27)

- ool: dc93c4a ("tcg/ppc: Use TCG_TARGET_NEED_LDST_OOL_LABELS",
  2018-11-27)

- ool-regs: a9bac58 ("tcg: Record register preferences during
  liveness", 2018-11-27)

I've also looked at hardware event counts on Skylake for
the above three commits. It seems that the indirection of
the (very) frequent ool calls/rets are what cause the large
reduction in IPC (results for bootup + hmmer):

- baseline:
   291,451,142,426      instructions              #    2.94  insn per cycle     
      (71.45%)
    99,050,829,190      cycles                                                  
      (71.49%)
     2,678,751,743      br_inst_retired.near_call                               
      (71.43%)
     2,674,367,278      br_inst_retired.near_return                             
      (71.42%)
    34,065,079,963      branches                                                
      (57.09%)
       161,441,496      branch-misses             #    0.47% of all branches    
      (57.17%)
      29.916874137 seconds time elapsed

- ool:
   312,368,465,806      instructions              #    2.79  insn per cycle     
      (71.45%)
   111,863,014,212      cycles                                                  
      (71.31%)
    11,751,151,140      br_inst_retired.near_call                               
      (71.30%)
    11,736,770,191      br_inst_retired.near_return                             
      (71.41%)
        24,660,597      br_misp_retired.near_call                               
      (71.49%)
    52,096,512,558      branches                                                
      (57.28%)
       176,951,727      branch-misses             #    0.34% of all branches    
      (57.20%)
      33.285149773 seconds time elapsed

- ool-regs:
   309,253,149,588      instructions              #    2.71  insn per cycle     
      (71.47%)
   113,938,069,597      cycles                                                  
      (71.50%)
    11,735,199,530      br_inst_retired.near_call                               
      (71.51%)
    11,725,686,909      br_inst_retired.near_return                             
      (71.54%)
        24,885,204      br_misp_retired.near_call                               
      (71.46%)
    52,768,150,694      branches                                                
      (56.97%)
       184,421,824      branch-misses             #    0.35% of all branches    
      (57.03%)
      33.867122498 seconds time elapsed 

The additional branches are all from call/ret. I double-checked the generated
code and these are all well-matched (no jmp's instead of ret's), so
I don't think we can optimize anything there; it seems to me that this
is just a code size vs. speed trade-off.

ool-regs has even lower IPC, but it also uses less instructions, which
mitigates the slowdown due to lower IPC. The bottleneck in the ool
calls/rets remains, which explains why there isn't much to
be gained from the lower number of insns.

Let me know if you want me to do any other data collection.

Thanks,

                Emilio

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [PATCH 03/12] tcg: Add preferred_reg argument to temp_sync, (continued)
- [Qemu-devel] [PATCH 03/12] tcg: Add preferred_reg argument to temp_sync, Richard Henderson, 2018/11/28
- [Qemu-devel] [PATCH 04/12] tcg: Add preferred_reg argument to tcg_reg_alloc_do_movi, Richard Henderson, 2018/11/28
- [Qemu-devel] [PATCH 05/12] tcg: Add output_pref to TCGOp, Richard Henderson, 2018/11/28
- [Qemu-devel] [PATCH 06/12] tcg: Improve register allocation for matching constraints, Richard Henderson, 2018/11/28
- [Qemu-devel] [PATCH 10/12] tcg: Split out more subroutines from liveness_pass_1, Richard Henderson, 2018/11/28
- [Qemu-devel] [PATCH 12/12] tcg: Record register preferences during liveness, Richard Henderson, 2018/11/28
- [Qemu-devel] [PATCH 11/12] tcg: Add TCG_OPF_BB_EXIT, Richard Henderson, 2018/11/28
- [Qemu-devel] [PATCH 08/12] tcg: Reindent parts of liveness_pass_1, Richard Henderson, 2018/11/28
- [Qemu-devel] [PATCH 09/12] tcg: Rename and adjust liveness_pass_1 helpers, Richard Henderson, 2018/11/28
- [Qemu-devel] [PATCH 07/12] tcg: Dump register preference info with liveness, Richard Henderson, 2018/11/28
- Re: [Qemu-devel] [PATCH 00/12] tcg: Improve register allocation for calls, Emilio G. Cota <=
  - Re: [Qemu-devel] [PATCH 00/12] tcg: Improve register allocation for calls, Richard Henderson, 2018/11/29
    - Re: [Qemu-devel] [PATCH 00/12] tcg: Improve register allocation for calls, Emilio G. Cota, 2018/11/29
    - Re: [Qemu-devel] [PATCH 00/12] tcg: Improve register allocation for calls, Emilio G. Cota, 2018/11/29
    - Re: [Qemu-devel] [PATCH 00/12] tcg: Improve register allocation for calls, Laurent Desnogues, 2018/11/30
    - Re: [Qemu-devel] [PATCH 00/12] tcg: Improve register allocation for calls, Emilio G. Cota, 2018/11/30

Prev by Date: [Qemu-devel] [Bug 1805445] Re: QEMU arm virt machine was stopped by STMFD command while debug process
Next by Date: Re: [Qemu-devel] [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation interrupt mode
Previous by thread: [Qemu-devel] [PATCH 07/12] tcg: Dump register preference info with liveness
Next by thread: Re: [Qemu-devel] [PATCH 00/12] tcg: Improve register allocation for calls
Index(es):
- Date
- Thread