[Qemu-devel] [PATCH 09/10] target/i386: optimize indirect branches with

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] [PATCH 09/10] target/i386: optimize indirect branches with

From:	Emilio G. Cota
Subject:	[Qemu-devel] [PATCH 09/10] target/i386: optimize indirect branches with TCG's jr op
Date:	Tue, 11 Apr 2017 21:17:29 -0400

Speed up indirect branches by adding a helper to look for the
TB in tb_jmp_cache. The helper returns either the corresponding
host address or NULL.

Measurements:

-             NBench, x86_64-linux-user. Host: Intel i7-4790K @ 4.00GHz
                             Y axis: Speedup over 95b31d70

     1.1x+-+-------------------------------------------------------------+-+
         |          jr             $$                                      |
    1.08x+-+......  jr+inline      %%  ..................................+-+
         |                                                                 |
         | $$$                                                             |
    1.06x+-$.$............................%%%............................+-+
         | $ $%%                          % %                              |
    1.04x+-$.$.%..........................%.%............................+-+
         | $ $ %                        $$$ %                  $$$         |
         | $ $ %              %%%       $ $ %                  $ $%%       |
    1.02x+-$.$.%.........%%%.$$.%.......$.$.%...%%%...%%.......$.$.%.$$$%%-+
         | $ $ %         % % $$ % $$$   $ $ % $$$ %   %% $$$%% $ $ % $ $ % |
       1x+-$.$B%R$$$ARGRA%H%T$$P%j$+$%%i$e$.%.$.$.%.$$$%.$.$.%.$.$.%.$.$.%-+
         | $ $ % $ $%% $$$ % $$ % $ $ % $ $ % $ $ % $ $% $ $ % $ $ % $ $ % |
    0.98x+-$.$.%.$.$.%.$.$.%.$$.%.$.$.%.$.$.%.$.$.%.$.$%.$.$.%.$.$.%.$.$.%-+
         | $ $ % $ $ % $ $ % $$ % $ $ % $ $ % $ $ % $ $% $ $ % $ $ % $ $ % |
         | $ $ % $ $ % $ $ % $$ % $ $ % $ $ % $ $ % $ $% $ $ % $ $ % $ $ % |
    0.96x+-$.$.%.$.$.%.$.$.%.$$.%.$.$.%.$.$.%.$.$.%.$.$%.$.$.%.$.$.%.$.$.%-+
         +-$$$%%-$$$%%-$$$%%-$$%%-$$$%%-$$$%%-$$$%%-$$$%-$$$%%-$$$%%-$$$%%-+
        ASSIGNMBITFIELFOFP_EMULATHUFFMANLU_DECOMPNEURNUMERICSTRING_SOhmean
  png: http://imgur.com/Jxj4hBd

The fact that NBench is not very sensitive to changes here is a
little surprising, especially given the significant improvements for
ARM shown in the previous commit. I wonder whether the compiler is doing
a better job compiling the x86_64 version (I'm using gcc 5.4.0), or I'm simply
missing some i386 instructions to which the jr optimization should
be applied.

     specINT 2006 (test set), x86_64-linux-user. Host: Intel i7-4790K @ 4.00GHz
                             Y axis: Speedup over 95b31d70

     1.3x+-+-------------------------------------------------------------+-+
         |          jr+inline $$                                           |
    1.25x+-+.............................................................+-+
         |                                                                 |
     1.2x+-+.............................................................+-+
         |                                                                 |
         |                     +++                 +++                     |
    1.15x+-+...................$$$.................$$$...................+-+
         |                     $ $                 $:$                     |
     1.1x+-+...................$.$.................$.$...........$$$$....+-+
         |           +++       $ $                 $ $       +++ $++$      |
    1.05x+-+.........$$$$......$.$.................$.$...........$..$....+-+
         |           $  $      $ $  $$$            $ $ $$$$ $$$$ $  $ $$$$ |
         | $$$$  +++ $  $ +++  $ $  $ $  +++  $$$  $ $ $  $ $++$ $  $ $  $ |
       1x+-$BA$G$$$$_$EM$_$$$$.$.$..$.$..$$$..$.$..$.$.$..$.$..$.$..$.$..$-+
         | $  $ $  $ $  $ $  $ $ $  $ $  $ $  $ $  $ $ $  $ $  $ $  $ $  $ |
    0.95x+-$..$.$..$.$..$.$..$.$.$..$.$..$.$..$.$..$.$.$..$.$..$.$..$.$..$-+
         | $  $ $  $ $  $ $  $ $ $  $ $  $ $  $ $  $ $ $  $ $  $ $  $ $  $ |
     0.9x+-$$$$-$$$$-$$$$-$$$$-$$$--$$$--$$$--$$$--$$$-$$$$-$$$$-$$$$-$$$$-+
           astarbzip2gcc gobmh264rehmlibquantumcfomneperlbensjxalancbhmean
  png: http://imgur.com/63Ncmx8

That is a 4.4% hmean perf improvement.

-  specINT 2006 (train set), x86_64-linux-user. Host: Intel i7-4790K @ 4.00GHz
                             Y axis: Speedup over 95b31d70

    1.4x+-+--------------------------------------------------------------+-+
        |        jr  $$                                                    |
        |                                                                  |
    1.3x+-+..............................................................+-+
        |                                                                  |
        |                                                                  |
    1.2x+-+......................................................$$$$....+-+
        |                      +++                     $$$$  :   $++$      |
        |                     $$$$                $$$$ $  $  :   $  $      |
    1.1x+-+...................$..$................$..$.$..$.$$$$.$..$....+-+
        |                     $  $                $  $ $  $ $: $ $  $ +++  |
        |  +++       +++  +++ $  $ $$$$  +++      $  $ $  $ $: $ $  $ $$$$ |
      1x+-$$$$GRAPH_$$$$_$$$$.$..$.$..$.$$$$......$..$.$..$.$..$.$..$.$..$-+
        | $++$ $$$$ $  $ $++$ $  $ $  $ $  $      $  $ $  $ $  $ $  $ $  $ |
        | $  $ $  $ $  $ $  $ $  $ $  $ $  $      $  $ $  $ $  $ $  $ $  $ |
    0.9x+-$..$.$..$.$..$.$..$.$..$.$..$.$..$......$..$.$..$.$..$.$..$.$..$-+
        | $  $ $  $ $  $ $  $ $  $ $  $ $  $ $$$$ $  $ $  $ $  $ $  $ $  $ |
        | $  $ $  $ $  $ $  $ $  $ $  $ $  $ $  $ $  $ $  $ $  $ $  $ $  $ |
    0.8x+-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-+
          astarbzip2 gcc gobmh264rehmlibquantmcfomneperlbensjexalancbhmean
  png: http://imgur.com/hd0BhU6

That is, a 4.39 % hmean improvement for jr+inline, i.e. this commit.
(4.5% for noinline). Peak improvement is 20% for xalancbmk.

-    specINT 2006 (test set), x86_64-softmmu. Host: Intel i7-4790K @ 4.00GHz
                             Y axis: Speedup over 95b31d70

     1.3x+-+-------------------------------------------------------------+-+
         |         cross    $$                                             |
    1.25x+-+.....  jr       %%  .........................................+-+
         |         cross+jr @@                                      :      |
     1.2x+-+.............................................................+-+
         |                                                       :  :      |
         |             +++                                       :  :      |
    1.15x+-+...........@@................................................+-+
         |           $$@@ $$++                    +++            : @@      |
     1.1x+-+.........$$@@.$$@@.....................................@@....+-+
         |           $$@@ $$@@                    $$ :   @@@  +++$$@@      |
    1.05x+-+.........$$@@.$$@@...@@address@hidden@.....$$@@....+-+
         |        +++$$%@ $$@@  address@hidden: $$@ @++@@ address@hidden@@+|
         |  +@@+++@@+$$%@ $$@@address@hidden  ::@@ ::@@$$@@@$$% @$$@@ 
address@hidden@@ |
       
address@hidden@address@hidden@address@hidden@address@hidden@address@hidden@address@hidden@@address@hidden@-+
         |+$$%@ $$%@ $$%@ address@hidden address@hidden@ 
address@hidden@address@hidden@$$% @$$@@ address@hidden@ |
    
address@hidden@address@hidden@address@hidden@address@hidden@address@hidden@address@hidden@address@hidden@-+
         | $$%@ $$%@ $$%@ address@hidden address@hidden address@hidden 
address@hidden address@hidden@$$% @$$%@ $$%@ $$%@ |
     
address@hidden@address@hidden@address@hidden@address@hidden@$$%@@$$%@@address@hidden@address@hidden
           astabzip2 gcc gobmh264rehmlibquantumcfomneperlbensjexalanchmean
  png: http://imgur.com/IV9UtSa

Here we see how jr works best when combined with cross -- jr by itself is
disappointingly around baseline performance. I attribute this to the frequent
page invalidations and/or TLB flushes (I'm running Ubuntu 16.04 as the guest,
so there are many processes), which lowers the maximum attainable hit rate in
tb_jmp_cache.

Overall the greatest hmean improvement comes from cross+jr though.

-      specINT 2006 (train set), x86_64-softmmu. Host: Intel i7-4790K @ 4.00GHz
                             Y axis: Speedup over 95b31d70

    1.25x+-+-------------------------------------------------------------+-+
         |        cross+inline    $$                                       |
         |        cross+jr+inline %%                     +++      +++      |
     1.2x+-+.............................................................+-+
         |                                         :      :      +++       |
    1.15x+-+.......................................................%%....+-+
         |            ::   +++                    $$$  $$$%      $$$%      |
         |           $$%%++%%%                    $:$  $+$% +++  $:$%      |
     1.1x+-+.........$$.%.$$.%....................$.$..$.$%......$.$%....+-+
         |      +++  $$+%+$$ %+++++ :+++          $ $: $ $%  :%% $+$% +++  |
    1.05x+-+....$$...$$.%.$$.%......$$............$.$%.$.$%.$$$%.$.$%.$$%%-+
         |      $$%% $$ % $$ % $$%% $$:   +++     $ $% $ $% $:$% $ $% $$+% |
         |      $$+% $$ % $$ % $$:%+$$%%+++: +++  $ $%+$ $% $:$% $ $% $$ % |
       1x+-$$$AR$$A%G$$P%_$$M%_$$o%s$$r%$$$%%e....$.$%.$.$%.$.$%.$.$%.$$.%-+
         | $+$% $$ % $$ % $$ %+$$+% $$:%$:$+%$$$++$ $% $ $% $ $% $ $% $$ % |
    0.95x+-$.$%.$$.%.$$.%.$$.%.$$.%.$$.%$.$.%$.$..$.$%.$.$%.$.$%.$.$%.$$.%-+
         | $ $% $$ % $$ % $$ % $$ % $$ %$ $ %$+$% $ $% $ $% $ $% $ $% $$ % |
         | $ $% $$ % $$ % $$ % $$ % $$ %$ $ %$ $% $ $% $ $% $ $% $ $% $$ % |
     0.9x+-$$$%-$$%%-$$%%-$$%%-$$%%-$$%%$$$%%$$$%-$$$%-$$$%-$$$%-$$$%-$$%%-+
           astabzip2 gcc gobmh264rehmlibquantumcfomneperlbensjexalanchmean
  png: http://imgur.com/CBMxrBH

This is the larger "train" set of SPECint06. Here cross+jr comes slightly
below cross, but it's within the noise margins (I didn't run this many
times, since it takes several hours).

Signed-off-by: Emilio G. Cota <address@hidden>
---
 target/i386/helper.h      |  1 +
 target/i386/misc_helper.c | 11 +++++++++++
 target/i386/translate.c   | 42 +++++++++++++++++++++++++++++++++---------
 3 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/target/i386/helper.h b/target/i386/helper.h
index dceb343..f7e9f9c 100644
--- a/target/i386/helper.h
+++ b/target/i386/helper.h
@@ -2,6 +2,7 @@ DEF_HELPER_FLAGS_4(cc_compute_all, TCG_CALL_NO_RWG_SE, tl, tl, 
tl, tl, int)
 DEF_HELPER_FLAGS_4(cc_compute_c, TCG_CALL_NO_RWG_SE, tl, tl, tl, tl, int)
 
 DEF_HELPER_2(cross_page_check, i32, env, tl)
+DEF_HELPER_2(get_hostptr, ptr, env, tl)
 
 DEF_HELPER_3(write_eflags, void, env, tl, i32)
 DEF_HELPER_1(read_eflags, tl, env)
diff --git a/target/i386/misc_helper.c b/target/i386/misc_helper.c
index a41daed..5d50ab0 100644
--- a/target/i386/misc_helper.c
+++ b/target/i386/misc_helper.c
@@ -642,3 +642,14 @@ uint32_t helper_cross_page_check(CPUX86State *env, 
target_ulong vaddr)
 {
     return !!tb_from_jmp_cache(env, vaddr);
 }
+
+void *helper_get_hostptr(CPUX86State *env, target_ulong vaddr)
+{
+    TranslationBlock *tb;
+
+    tb = tb_from_jmp_cache(env, vaddr);
+    if (unlikely(tb == NULL)) {
+        return NULL;
+    }
+    return tb->tc_ptr;
+}
diff --git a/target/i386/translate.c b/target/i386/translate.c
index ffc8ccc..aab5c13 100644
--- a/target/i386/translate.c
+++ b/target/i386/translate.c
@@ -2521,7 +2521,8 @@ static void gen_bnd_jmp(DisasContext *s)
    If INHIBIT, set HF_INHIBIT_IRQ_MASK if it isn't already set.
    If RECHECK_TF, emit a rechecking helper for #DB, ignoring the state of
    S->TF.  This is used by the syscall/sysret insns.  */
-static void gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf)
+static void
+gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf, TCGv jr)
 {
     gen_update_cc_op(s);
 
@@ -2542,6 +2543,22 @@ static void gen_eob_worker(DisasContext *s, bool 
inhibit, bool recheck_tf)
         tcg_gen_exit_tb(0);
     } else if (s->tf) {
         gen_helper_single_step(cpu_env);
+    } else if (jr) {
+#ifdef TCG_TARGET_HAS_JR
+        TCGLabel *label = gen_new_label();
+        TCGv_ptr ptr = tcg_temp_local_new_ptr();
+        TCGv vaddr = tcg_temp_new();
+
+        tcg_gen_ld_tl(vaddr, cpu_env, offsetof(CPUX86State, segs[R_CS].base));
+        tcg_gen_add_tl(vaddr, vaddr, jr);
+        gen_helper_get_hostptr(ptr, cpu_env, vaddr);
+        tcg_temp_free(vaddr);
+        tcg_gen_brcondi_ptr(TCG_COND_EQ, ptr, NULL, label);
+        tcg_gen_jr(ptr);
+        tcg_temp_free_ptr(ptr);
+        gen_set_label(label);
+#endif
+        tcg_gen_exit_tb(0);
     } else {
         tcg_gen_exit_tb(0);
     }
@@ -2552,13 +2569,18 @@ static void gen_eob_worker(DisasContext *s, bool 
inhibit, bool recheck_tf)
    If INHIBIT, set HF_INHIBIT_IRQ_MASK if it isn't already set.  */
 static void gen_eob_inhibit_irq(DisasContext *s, bool inhibit)
 {
-    gen_eob_worker(s, inhibit, false);
+    gen_eob_worker(s, inhibit, false, NULL);
 }
 
 /* End of block, resetting the inhibit irq flag.  */
 static void gen_eob(DisasContext *s)
 {
-    gen_eob_worker(s, false, false);
+    gen_eob_worker(s, false, false, NULL);
+}
+
+static void gen_jr(DisasContext *s, TCGv dest)
+{
+    gen_eob_worker(s, false, false, dest);
 }
 
 /* generate a jump to eip. No segment change must happen before as a
@@ -4985,7 +5007,7 @@ static target_ulong disas_insn(CPUX86State *env, 
DisasContext *s,
             gen_push_v(s, cpu_T1);
             gen_op_jmp_v(cpu_T0);
             gen_bnd_jmp(s);
-            gen_eob(s);
+            gen_jr(s, cpu_T0);
             break;
         case 3: /* lcall Ev */
             gen_op_ld_v(s, ot, cpu_T1, cpu_A0);
@@ -5003,7 +5025,8 @@ static target_ulong disas_insn(CPUX86State *env, 
DisasContext *s,
                                       tcg_const_i32(dflag - 1),
                                       tcg_const_i32(s->pc - s->cs_base));
             }
-            gen_eob(s);
+            tcg_gen_ld_tl(cpu_tmp4, cpu_env, offsetof(CPUX86State, eip));
+            gen_jr(s, cpu_tmp4);
             break;
         case 4: /* jmp Ev */
             if (dflag == MO_16) {
@@ -5011,7 +5034,7 @@ static target_ulong disas_insn(CPUX86State *env, 
DisasContext *s,
             }
             gen_op_jmp_v(cpu_T0);
             gen_bnd_jmp(s);
-            gen_eob(s);
+            gen_jr(s, cpu_T0);
             break;
         case 5: /* ljmp Ev */
             gen_op_ld_v(s, ot, cpu_T1, cpu_A0);
@@ -5026,7 +5049,8 @@ static target_ulong disas_insn(CPUX86State *env, 
DisasContext *s,
                 gen_op_movl_seg_T0_vm(R_CS);
                 gen_op_jmp_v(cpu_T1);
             }
-            gen_eob(s);
+            tcg_gen_ld_tl(cpu_tmp4, cpu_env, offsetof(CPUX86State, eip));
+            gen_jr(s, cpu_tmp4);
             break;
         case 6: /* push Ev */
             gen_push_v(s, cpu_T0);
@@ -7143,7 +7167,7 @@ static target_ulong disas_insn(CPUX86State *env, 
DisasContext *s,
         /* TF handling for the syscall insn is different. The TF bit is  
checked
            after the syscall insn completes. This allows #DB to not be
            generated after one has entered CPL0 if TF is set in FMASK.  */
-        gen_eob_worker(s, false, true);
+        gen_eob_worker(s, false, true, NULL);
         break;
     case 0x107: /* sysret */
         if (!s->pe) {
@@ -7158,7 +7182,7 @@ static target_ulong disas_insn(CPUX86State *env, 
DisasContext *s,
                checked after the sysret insn completes. This allows #DB to be
                generated "as if" the syscall insn in userspace has just
                completed.  */
-            gen_eob_worker(s, false, true);
+            gen_eob_worker(s, false, true, NULL);
         }
         break;
 #endif
-- 
2.7.4

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [PATCH 02/10] exec-all: inline tb_from_jmp_cache, (continued)
- [Qemu-devel] [PATCH 02/10] exec-all: inline tb_from_jmp_cache, Emilio G. Cota, 2017/04/11
- [Qemu-devel] [PATCH 06/10] tcg: add brcondi_ptr, Emilio G. Cota, 2017/04/11
- [Qemu-devel] [PATCH 03/10] target/arm: optimize cross-page block chaining in softmmu, Emilio G. Cota, 2017/04/11
- [Qemu-devel] [PATCH 04/10] target/i386: optimize cross-page block chaining in softmmu, Emilio G. Cota, 2017/04/11
- [Qemu-devel] [PATCH 05/10] tcg: add jr opcode, Emilio G. Cota, 2017/04/11
  - Re: [Qemu-devel] [PATCH 05/10] tcg: add jr opcode, Paolo Bonzini, 2017/04/13
- [Qemu-devel] [PATCH 10/10] tb-hash: improve tb_jmp_cache hash function in user mode, Emilio G. Cota, 2017/04/11
  - Re: [Qemu-devel] [PATCH 10/10] tb-hash: improve tb_jmp_cache hash function in user mode, Paolo Bonzini, 2017/04/11
    - Re: [Qemu-devel] [PATCH 10/10] tb-hash: improve tb_jmp_cache hash function in user mode, Emilio G. Cota, 2017/04/12
- [Qemu-devel] [PATCH 08/10] target/arm: optimize indirect branches with TCG's jr op, Emilio G. Cota, 2017/04/11
- [Qemu-devel] [PATCH 09/10] target/i386: optimize indirect branches with TCG's jr op, Emilio G. Cota <=
  - Re: [Qemu-devel] [PATCH 09/10] target/i386: optimize indirect branches with TCG's jr op, Paolo Bonzini, 2017/04/11
    - Re: [Qemu-devel] [PATCH 09/10] target/i386: optimize indirect branches with TCG's jr op, Emilio G. Cota, 2017/04/12
    - Re: [Qemu-devel] [PATCH 09/10] target/i386: optimize indirect branches with TCG's jr op, Paolo Bonzini, 2017/04/14
- Re: [Qemu-devel] [PATCH 00/10] TCG optimizations for 2.10, Alex Bennée, 2017/04/12

Prev by Date: [Qemu-devel] [PATCH 08/10] target/arm: optimize indirect branches with TCG's jr op
Next by Date: Re: [Qemu-devel] [PATCH v4] migration/block: use blk_pwrite_zeroes for each zero cluster
Previous by thread: [Qemu-devel] [PATCH 08/10] target/arm: optimize indirect branches with TCG's jr op
Next by thread: Re: [Qemu-devel] [PATCH 09/10] target/i386: optimize indirect branches with TCG's jr op
Index(es):
- Date
- Thread