Re[2]: [lmi] rounding unit test under Linux/x86-64

lmi
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re[2]: [lmi] rounding unit test under Linux/x86-64

From:	Vadim Zeitlin
Subject:	Re[2]: [lmi] rounding unit test under Linux/x86-64
Date:	Fri, 21 May 2010 18:12:08 +0200
On Fri, 26 Mar 2010 00:52:50 +0000 Greg Chicares <address@hidden> wrote:

GC> We could simply use '-mfpmath=387', couldn't we?

 No, unfortunately we can't do this because the rounding functions are
still in libc (or libm) and still use SSE instructions and so are not
affected by our tweaking of x87 control word. IOW the rounding test still
fails with -mfpmath=387.

GC> However, with SSE, there's no extended precision anyway, so there'd
GC> be nothing useful for fesetprec() to control...so fegetprec() is all
GC> we can hope for, and it should be available in any new compiler. For
GC> legacy compilers, we can just keep the 80x87 asm.
GC> 
GC> > But I
GC> > really think that if we're interested in fixing this at all,
GC> 
GC> Yes, controlling rounding direction is imperative, and I would prefer
GC> to control precision as well.

 Considering that we can't control precision when using SSE I wonder if we
want to keep fenv_precision() functions at all and risk getting different
results on different machines for the code using it or if we should make
them private and only use them in fenv_initialize() and fenv_validate()?

 Currently fenv_precision() is only used in math_functors_test.cpp so IMHO
it wouldn't be a big loss to remove it. As usual, I prefer not having an
unnecessary/unused function at all rather than trying to understand how
should it behave in SSE-and-not-x87 build. After all, removing a function
is the only sure way to guarantee that there are no bugs in it.


GC> Perhaps someday when you have the machine cycles to spare you could
GC> compile with '-mfpmath=sse2' and '-mfpmath=387' separately and post
GC> speed measurements.

 I ran lmi_cli self test 20 times after for each of the following builds of
LMI under Debian Linux using g++ 4.3.2:

- 387/32: 32-bit mode build with default flags.
- SSE/32: 32-bit mode build with -mfpmath=sse -msse2.
- 387/64: 64-bit mode build with -mfpmath=387.
- SSE/64: 64-bit mode build with default flags.

All builds also used -O2. The full results are in Appendix A below but the
gist of it is that there are no significant differences between 387 and SSE
versions. OTOH 64 bit versions have ~20% better minimal times (which, I
think, are the most relevant among the numbers I measured as higher numbers
presumably indicate some contention on the system resources as the machine
used for tests isn't completely idle) so it might be beneficial to produce
64 bit LMI versions in the future for performance reasons. Also notice that
32 bit x87 builds were more likely to take significantly longer than
minimal time than the other ones (which explains their poor average). This
presumably happens because other stuff on the system uses the FPU but not
SSE but this is just guessing, of course.


 FWIW I also used test_math_functors for its speed tests but they are
pretty useless because the compiler managed to optimize mete0() completely:

(gdb) disassemble mete0
Dump of assembler code for function _Z5mete0v:
0x000000000040abd0 <_Z5mete0v+0>:       mov    rax,0x3f6ad187a99ae58c
0x000000000040abda <_Z5mete0v+10>:      mov    QWORD PTR [rsp-0x8],rax
0x000000000040abdf <_Z5mete0v+15>:      mov    rax,0x3fe33ba7eb52e57c
0x000000000040abe9 <_Z5mete0v+25>:      mov    QWORD PTR [rsp-0x8],rax
0x000000000040abee <_Z5mete0v+30>:      mov    rax,0x3fa40c5871b7c8bf
0x000000000040abf8 <_Z5mete0v+40>:      mov    QWORD PTR [rsp-0x8],rax
0x000000000040abfd <_Z5mete0v+45>:      mov    rax,0x3f9e63dd8e598f27
0x000000000040ac07 <_Z5mete0v+55>:      mov    QWORD PTR [rsp-0x8],rax
0x000000000040ac0c <_Z5mete0v+60>:      ret
End of assembler dump.

So the "slow" method is much faster than the "fast" one: the numbers are
~65ns for mete0() and ~700ns for mete1(). But the really strange thing is
that mete0() takes longer to run (~72ns instead of 65ns) with 387 FP math
even though the disassembly code is identical in both builds. I really,
really don't know how is this possible but I checked and rechecked
everything and there doesn't seem to be anything wrong in my test procedure
and I consistently get the same results. I also get the same results for
mete1() which is ~700ns with SSE and ~720ns without but this might be due
to the difference in the code used in 2 cases (although I'm not even sure
if it's different, I'm just not sure that it's the same). But in mete0()
case I really don't know what could be going on here. My only hypothesis is
that the FP initialization code working directly with x87 control word does
something to slow all the rest of the code down but I don't see what could
it be. Ah, and just to make this even more confusing, this effect is only
present in 64 bit builds, not 32 bit ones. OTOH 32 bit builds are much
slower here because of all pushing to and popping from the stack compared
to just moving values directly into registers in 64 bit build, see Appendix
B.

 To finish with test_math_functors(), I'd also like to notice that while
the results are exactly the same whether SSE is used or not in optimized
build and also between SSE and x87 version of the debug build without
optimizations, they're *not* the same between debug and optimized builds,
see Appendix C. The difference only appears for pow() and it's in 14 and
17th decimal digit for double and long double precision results
respectively so it's hardly significant but it's present. The good news is
that there are at least no differences between the 32 and 64 bit builds.

GC> Some errors might be reported with SSE (due to lower precision),
GC> but that ought not to affect the usefulness of the timings.

 No errors were reported in any of 387/SSE debug/optimized 32/64bit builds.


GC> >  Please let me know if you'd like me to enable LMI_IEC_559 and retest with
GC> > it.
GC> 
GC> Yes, would you please do that?

 The smallest possible changes I needed to make this compile were:

1. Disable FENV_ACCESS pragma for g++ up to and including 4.5: this simply
   resulted in an #error in fenv_lmi.hpp which prevented me from
   continuing.

2. Remove #error in fenv_initialize() in non-MinGW case as setting
   precision is not supported by IEC 559 and we have no choice but to
   assume that we don't need it anyhow. I wonder if we want to keep the
   fesetenv(FE_PC64_ENV) call for MinGW?

3. Move declaration of e_ieee754_rounding from fenv_lmi_x86.hpp which is
   now not even included when using IEC 559 to fenv_lmi.hpp itself as this
   enum is not x87-specific. Also remove the duplicated declaration of the
   same enum from round_test.cpp.

4. Disable all the tests using x87 control word and fenv_validate() in
   fenv_lmi_test.cpp when using IEC 559.

5. Disable the call to fenv_validate() in fenv_guard.cpp when not using
   x87.

After doing this everything compiles and round, fenv_lmi and math_functors
tests pass (the latter required more changes discussed in a separate
thread).

 Now this was the minimal set of changes but not necessarily the best one.
I think we need to make the code more clear. First, I propose to add
LMI_FP_XXX preprocessor constants and use them instead of LMI_X86 as we can
now use either IEC 559 or x87-specific code under x86. I.e. I suggest to
define either LMI_FP_X86 or LMI_FP_IEC_559 which would be mutually
exclusive unlike LMI_IEC_559 and LMI_X86 currently.

 Second, we need to decide what to do about fenv_precision() in non-x87
case as discussed above.

 Third, the same question about fenv_validate(). Here I guess we do need to
keep it but just always return true from it when not using x87. Or check
the rounding mode? To be honest I'm not sure about this function utility
outside of the specific circumstances that motivated its introduction. The
decision about what to do with fenv_guard() also depends on this.


 I'd love to get answers to these questions before making the final patch.
I'd also like to know if you're interested in testing Windows 64 bit build
or if this should be postponed.

 Thanks,
VZ


Appendix A: Summary of results of lmi_cli --selftest:

        Min     Max     Avg     Median
387/32: 0.107   0.267   0.169   0.111
SSE/32: 0.109   0.146   0.121   0.118
387/64: 0.087   0.210   0.124   0.088
SSE/64: 0.084   0.227   0.138   0.086


Appendix B: Summary of test_math_functors assay_speed() (average, in ns,
for mete0() and mete1() respectively):

387/32:  430/1280
SSE/32:  430/1320
387/64:   72/ 720
SSE/64:   65/ 700


Appendix C: Summary of test_math_functors sample_results() output:

SSE2 -O2:
    long double precision, expm1 and log1p
      -0.0039841072324612947578
    long double precision, pow
      -0.0039841072324612800126
    double precision, expm1 and log1p
      -0.0039841072324612938904
    double precision, pow
      -0.0039841072324612800126
387 -O2:
    long double precision, expm1 and log1p
      -0.0039841072324612947578
    long double precision, pow
      -0.0039841072324612800126
    double precision, expm1 and log1p
      -0.0039841072324612938904
    double precision, pow
      -0.0039841072324612800126
SSE -O0:
    long double precision, expm1 and log1p
      -0.0039841072324612947578
    long double precision, pow
      -0.0039841072324612817473
    double precision, expm1 and log1p
      -0.0039841072324612938904
    double precision, pow
      -0.0039841072324439119612
387 -O0:
    long double precision, expm1 and log1p
      -0.0039841072324612947578
    long double precision, pow
      -0.0039841072324612817473
    double precision, expm1 and log1p
      -0.0039841072324612938904
    double precision, pow
      -0.0039841072324439119612
[Prev in Thread]
Current Thread
[Next in Thread]
Re[2]: [lmi] rounding unit test under Linux/x86-64, Vadim Zeitlin <=
Prev by Date: Re[2]: [lmi] Historical product files
Next by Date: Re: [lmi] Creating end-user packages for msw
Previous by thread: [lmi] Historical product files
Next by thread: [lmi] Unexpected output from 'svn update'
Index(es):
- Date
- Thread