qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] [PATCH v1 14/14] hostfloat: support float32_to_float64


From: Emilio G. Cota
Subject: [Qemu-devel] [PATCH v1 14/14] hostfloat: support float32_to_float64
Date: Wed, 21 Mar 2018 16:11:49 -0400

Performance improvement for SPEC06fp for the last few commits:

                               qemu-aarch64 SPEC06fp (test set) speedup over 
QEMU f6d81cdec8
                                      Host: Intel(R) Core(TM) i7-6700K CPU @ 
4.00GHz
                                            error bars: 95% confidence interval

    5 
+-+---+-----+----+-----+-----+-----+-----+----+-----+-----+-----+----+-----+-----+-----+-----+----+-----+---+-+
  4.5 
+-+..........................+&&+...........................................................................+-+
  3.5 
+-+................+++.......@@&...............+++............................................+++dsub
       +-+
  2 3 
+-+....+++.++++++%%&=+......+@@&....+++...==+..&&=..........................................++&=+++++++
     +-+
    2 
+-+..%%@&address@hidden&=.......+%@&..%%@&address@hidden&=.++&&+.......++&=+.+++++.......+&&=.%%@&address@hidden
 address@hidden&=+-+
  1.5 
+-+++$%@&address@hidden&=##$%&**#$@&**#%@&address@hidden&=##%@&**#+&&address@hidden@=+++&&=##$@&**#%@&address@hidden@=*#$%&=+-+
  0 1 
+-+**#%@&address@hidden&=*#$%&**#$@&**#%@&address@hidden&=*#$@&**#$@&address@hidden@=*#$%&=*#$@&**#%@&address@hidden@=*#$%&=+-+
    0 
+-+**#%@&address@hidden&=*#$%&**#$@&**#%@&address@hidden&=*#$@&**#$@&address@hidden@=*#$%&=*#$@&**#%@&address@hidden&=+-+
  
410.bw416.gam433.434.z435.436.cac437.lesli444.447.de450.so453454.ca459.GemsF465.tont470.lb4482.sph+f32f64ean
  png: https://imgur.com/5BErNz7

That is, a final geomean speedup of 2.21X.

The floating point workloads from nbench show similar improvements:

                                       qemu-aarch64 NBench score; higher is 
better
                                     Host: Intel(R) Core(TM) i7-6700K CPU @ 
4.00GHz

  16 
+-+-------------------+---------------------+----------------------+---------------------+-------------------+-+
  14 
+-+..............................................====**............@@@&&&==**................................+-+
  12 
+-+.........................................@@@@&&address@hidden@..&.=.*..................+before
       +-+
  10 address@hidden@.&address@hidden@..&.=.*............@@@&&&==***ub       +-+
   8 address@hidden&address@hidden@..&address@hidden@..&+= +*ul       +-+
   6 
+-+...................@@@@&&address@hidden&address@hidden&address@hidden&+= 
+*iv       +-+
   4 address@hidden&address@hidden&address@hidden&address@hidden&+= +*ma       
+-+
   2 address@hidden&address@hidden&address@hidden&address@hidden&+=+s*rt       
+-+
   0 
+-+---------****##$$$%%@@@&&===**--***##$$$%%@@@&&===**--***###$$%%%@@&&&==**--***###$$%%%@@&&&==***mp-------+-+
                    FOURIER            NEURAL NET       LU DECOMPOSITION        
         gmean      +f32f64
  png: https://imgur.com/KjLHumh

That is, a ~2.6X speedup. [error bars here are just the standard deviation of
just a few measurements; this explains the noisy results.]

Results for the i386 target are very similar; the only major
difference is that they're much more sensitive to the multiplication
optimization, since the i386 target does not currently use floatX_muladd
(aka fma).

Below are the x86_64 SPEC06fp results, although note that they are from
a development branch, so each bar does not match the patches in this,
and the final numbers might be slightly different from those you'd
get with these patches.

                               qemu-x86_64 SPEC06fp (train set) speedup over 
QEMU f6d81cdec8
                                      Host: Intel(R) Core(TM) i7-6700K CPU @ 
4.00GHz
                                            error bars: 95% confidence interval

    4 
+-+---+-----+----+-----+-----+%%---+-----+----+-----+-----+-----+----+-----+-----+-----+-----+----+-----+---+-+
  3.5 
+-+..........................$$%............................................................................+-+
    3 
+-+............**$$$......+**#$%............**$$++..................................+add+sub++%%+sq+++
      +-+
  2.5 
+-+..+++.**##$%**#+$%......**#$%..+$$%..++%%**#$%%.............+++.**#$$%........$$%+**$$%+###$%is#$$%
  $$%%+-+
  1.5 
+-+***#$%**.#$%**#.$%..$$%+**#$%***#$%**##$%**#$.%**#$%+++$$%***#$%**#+$%..$$++**#$%+fas$%path$%ul(0$%**#$
 %+-+
    1 
+-+*+*#$%**+#$%**#+$%**#$%+**#$%*+*#$%**+#$%**#$+%**#$%-**#$%*+*#$%**#+$%**#$%%**#$%+**+f%2
 to %4+div%**#$+%+-+
  0.5 
+-+*.*#$%**.#$%**#.$%**#$%.**#$%*.*#$%**.#$%**#$.%**#$%.**#$%*.*#$%**#.$%**#$.%**#$%.**#$%**.#$%**#.$%**#$.%+-+
    0 
+-+***#$%**##$%**#$$%**#$%-**#$%***#$%**##$%**#$%%**#$%-**#$%***#$%**#$$%**#$%%**#$%-**#$%**##$%**#$$%**#$%%+-+
  
410.bw416.gam433.434.z435.436.cac437.lesli444.447.de450.so453454.ca459.GemsF465.tont470.lb4482.sphinxgeomean
  png: https://imgur.com/MfvTb3H

Two points are worth mentioning:

- Special-casing 0-inputs for multiplication pays off handsomely (the same
  thing happens for FMA for targets that use it). I was surprised to
  see that some benchmarks (e.g. GemsFDTD) compute >99% of their
  multiplications with at least one operand being Zero (and this is
  without flush-to-zero!).

- Avoiding comparisons via the host FPU (i.e. using soft_t ## _is_normal()
  instead of glibc's isnormal()) gives a small speedup.

Finally, the same results using native execution time as the baseline,
where we plot the slowdown instead of the speedup.
We bring down the slowdown of SPEC06fp w.r.t. native from ~21X to ~10X:

                         qemu-x86_64 SPEC06fp (train set) slowdown over native 
(lower is better)
                                     Host: Intel(R) Core(TM) i7-6700K CPU @ 
4.00GHz
                                           error bars: 95% confidence interval

  90 
+-+---+-----+-----+----+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+----+-----+-----+---+-+
  80 
+-+.......................+**................................................................................+-+
  70 
+-+........................**........................................................+
          before       +-+
  50 
+-+........................**........................................................+add+sub+mul+sqrt
       +-+
  40 
+-+......+++...............**................................+++.....................+
  +integer isinf       +-+
  30 address@hidden path mul(0++**    +-+
  10 
address@hidden@**$$@@address@hidden@address@hidden@address@hidden@address@hidden@address@hidden
 to @address@hidden@+-+
   0 
address@hidden@address@hidden@address@hidden@address@hidden@address@hidden@address@hidden@address@hidden@address@hidden@address@hidden
 
410.bw416.game433434.z435.436.cac437.leslie444.447.d450.so453.454.ca459.GemsF465.tont470.l48482.sphinxgeomean
  png: https://imgur.com/iTmVkJL

All png's shown above can be found here: https://imgur.com/a/YSxxR

Signed-off-by: Emilio G. Cota <address@hidden>
---
 include/fpu/hostfloat.h |  2 ++
 include/fpu/softfloat.h |  2 +-
 fpu/hostfloat.c         | 14 ++++++++++++++
 fpu/softfloat.c         |  2 +-
 4 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/fpu/hostfloat.h b/include/fpu/hostfloat.h
index aa555f6..79e9b6c 100644
--- a/include/fpu/hostfloat.h
+++ b/include/fpu/hostfloat.h
@@ -29,4 +29,6 @@ float64 float64_sqrt(float64 a, float_status *status);
 int float64_compare(float64 a, float64 b, float_status *s);
 int float64_compare_quiet(float64 a, float64 b, float_status *s);
 
+float64 float32_to_float64(float32, float_status *status);
+
 #endif /* HOSTFLOAT_H */
diff --git a/include/fpu/softfloat.h b/include/fpu/softfloat.h
index cb57942..b0a4d75 100644
--- a/include/fpu/softfloat.h
+++ b/include/fpu/softfloat.h
@@ -334,7 +334,7 @@ int64_t float32_to_int64(float32, float_status *status);
 uint64_t float32_to_uint64(float32, float_status *status);
 uint64_t float32_to_uint64_round_to_zero(float32, float_status *status);
 int64_t float32_to_int64_round_to_zero(float32, float_status *status);
-float64 float32_to_float64(float32, float_status *status);
+float64 soft_float32_to_float64(float32, float_status *status);
 floatx80 float32_to_floatx80(float32, float_status *status);
 float128 float32_to_float128(float32, float_status *status);
 
diff --git a/fpu/hostfloat.c b/fpu/hostfloat.c
index 139e419..b635839 100644
--- a/fpu/hostfloat.c
+++ b/fpu/hostfloat.c
@@ -326,3 +326,17 @@ GEN_FPU_SQRT(float64_sqrt, float64, double, sqrt)
 GEN_FPU_COMPARE(float32_compare, float32, float)
 GEN_FPU_COMPARE(float64_compare, float64, double)
 #undef GEN_FPU_COMPARE
+
+float64 float32_to_float64(float32 a, float_status *status)
+{
+    if (likely(float32_is_normal(a))) {
+        float f = *(float *)&a;
+        double r = f;
+
+        return *(float64 *)&r;
+    } else if (float32_is_zero(a)) {
+        return float64_set_sign(float64_zero, float32_is_neg(a));
+    } else {
+        return soft_float32_to_float64(a, status);
+    }
+}
diff --git a/fpu/softfloat.c b/fpu/softfloat.c
index 1a32216..cf8d6ec 100644
--- a/fpu/softfloat.c
+++ b/fpu/softfloat.c
@@ -3149,7 +3149,7 @@ float128 uint64_to_float128(uint64_t a, float_status 
*status)
 | Arithmetic.
 *----------------------------------------------------------------------------*/
 
-float64 float32_to_float64(float32 a, float_status *status)
+float64 soft_float32_to_float64(float32 a, float_status *status)
 {
     flag aSign;
     int aExp;
-- 
2.7.4




reply via email to

[Prev in Thread] Current Thread [Next in Thread]