qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v1 08/14] hostfloat: support float32/64 addition


From: Emilio G. Cota
Subject: Re: [Qemu-devel] [PATCH v1 08/14] hostfloat: support float32/64 addition and subtraction
Date: Tue, 27 Mar 2018 14:08:53 -0400
User-agent: Mutt/1.5.24 (2015-08-30)

On Tue, Mar 27, 2018 at 12:41:18 +0100, Alex Bennée wrote:
> 
> Emilio G. Cota <address@hidden> writes:
> 
> > On Thu, Mar 22, 2018 at 14:41:05 +0800, Richard Henderson wrote:
> > (snip)
> >> Another thought re all of the soft_is_normal || soft_is_zero checks that 
> >> you're
> >> performing.  I think it would be nice if we could work with
> >> float*_unpack_canonical so that we don't have to duplicate work.  E.g.
> >>
> >> /* Return true for float_class_normal && float_class_zero.  */
> >> static inline bool is_finite(FloatClass c) { return c <= float_class_zero; 
> >> }
> >>
> >> float32 float32_add(float32 a, float32 b, float_status *s)
> >> {
> >>   FloatClass a_cls = float32_classify(a);
> >>   FloatClass b_cls = float32_classify(b);
> >
> > Just looked at this. It can be done, although it comes at the
> > price of some performance for fp-bench -o add:
> > 180 Mflops vs. 196 Mflops, i.e. a 8% slowdown. That is with
> > adequate inlining etc., otherwise perf is worse.
> >
> > I'm not convinced that we can gain much in simplicity to
> > justify the perf impact. Yes, we'd simplify canonicalize(),
> > but we'd probably need a float_class_denormal[*], which
> > would complicate everything else.
> >
> > I think it makes sense to keep some inlines that work on
> > the float32/64's directly.
> >
> >>   if (is_finite(a_cls) && is_finite(b_cls) && ...) {
> >>       /* do hardfp thing */
> >>   }
> >
> > [*] Taking 0, denormals and normals would be OK from correctness,
> > but we really don't want to compute ops with denormal inputs on
> > the host; it is very likely that the output will also be denormal,
> > and we'll end up deferring to soft-fp anyway to avoid
> > computing whether the underflow exception has occurred,
> > which is expensive.
> >
> >>   pa = float32_unpack(a, ca, s);
> >>   pb = float32_unpack(b, cb, s);
> >>   pr = addsub_floats(pa, pb, s, false);
> >>   return float32_round_pack(pr, s);
> >> }
> >
> > It pays off to have two separate functions (add & sub) for the
> > slow path. With soft_f32_add/sub factored out:
> >
> > $ taskset -c 0 x86_64-linux-user/qemu-x86_64 tests/fp-bench -o add
> > 197.53 MFlops
> >
> > With the above four lines (pa...return) as an else branch:
> > 169.16 MFlops
> >
> > BTW flattening makes things worse (150.63 MFlops).
> 
> That's disappointing. Did you look at the generated code? Because the
> way we are abusing __flatten__ to effectively make a compile time
> template you would hope it could pull the relevant classify bits to
> before the hard float branch and do the rest later if needed.
> 
> Everything was inline or in softfloat.c for this test right?

Yes. It's just that the classify bits are more expensive than
the alternative. This is fair enough when you look at classify().

                E.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]