Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC

From:	BALATON Zoltan
Subject:	Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
Date:	Wed, 19 Feb 2020 16:35:07 +0100 (CET)
User-agent:	Alpine 2.22 (BSF 395 2020-01-19)

Hello,

On Tue, 18 Feb 2020, Programmingkid wrote:

On Feb 18, 2020, at 12:10 PM, BALATON Zoltan <address@hidden> wrote:
While other targets take advantage of using host FPU to do floating
point computations, this was disabled for PPC target because always
clearing exception flags before every FP op made it slightly slower
than emulating everyting with softfloat. To emulate some FPSCR bits,
clearing of fp_status may be necessary (unless these could be handled
e.g. using FP exceptions on host but there's no API for that in QEMU
yet) but preserving at least the inexact flag makes hardfloat usable
and faster than softfloat. Since most clients don't actually care
about this flag, we can gain some speed trading some emulation
accuracy.

This patch implements a simple way to keep the inexact flag set for
hardfloat while still allowing to revert to softfloat for workloads
that need more accurate albeit slower emulation. (Set hardfloat
property of CPU, i.e. -cpu name,hardfloat=false for that.) There may
still be room for further improvement but this seems to increase
floating point performance. Unfortunately the softfloat case is slower
than before this patch so this patch only makes sense if the default
is also set to enable hardfloat.

Because of the above this patch at the moment is mainly for testing
different workloads to evaluate how viable would this be in practice.
Thus, RFC and not ready for merge yet.

Signed-off-by: BALATON Zoltan <address@hidden>
---
v2: use different approach to avoid needing if () in
helper_reset_fpstatus() but this does not seem to change overhead
much, also make it a single patch as adding the hardfloat option is
only a few lines; with this we can use same value at other places where
float_status is reset and maybe enable hardfloat for a few more places
for a little more performance but not too much. With this I got:


<snip>

Thank you for working on this. It is about time we have a better FPU.

Thank you for testing it. I think it would be great if we could come upwith some viable approach to improve this before the next freeze.

I applied your patch over David Gibson's ppc-for-5.0 branch. It applied cleanly 
and compiled easily.

I've heard some preliminary results from others that there's also adifference between v1 and v2 of the patch in performance where v1 may befaster for same cases so if you (or someone else) want and have time youcould experiment with different versions and combinations as well to findthe one that's best on all CPUs. Basically we have these parts:

1. Change target/ppc/fpu_helper.c::helper_reset_fpstatus() to forcefloat_flag_inexact on in case hadfloat is enabled, I've tried twoapproaches for this:


a. In v1 added an if () in the function

b. In v2 used a variable from env set earlier (I've hoped this may befaster but maybe it's not, testing and explanation is welcome)

2. Also change places where env->fp_status is copied to a local tstat andthen reset (I think this is done to accumulate flags from multiple FP opsthat would individually reset env->fp_status or some other reason, maybethis could be avoided if we reset fp_status less often but that would needmore understanding of the FP emulation that I don't have so I did nottry to clean that up yet).

If v2 is really slower than v1 then I'm not sure is it because alsochanging places with tstat or because of the different approach inhelper_reset_fpstatus() so you could try combinations of these as well.

Tests were done on a Mac OS 10.4.3 VM. The CPU was set to G3.

What was the host CPU and OS this was tested on? Please always share CPUinfo and host OS when sharing bechmark results so they are somewhatcomparable. It also depends on CPU features for vector instrucions atleast so without CPU info the results could not be understood.

I think G3 does not have AltiVec/VMX so maybe testing with G4 would bebetter to also test those ops unless there's a reason to only test G3.I've tested with G4 both FPU only and FPU+VMX code on Linux host withi7-9700K CPU @ 3.60GHz as was noted in the original cover letter but maybe I'va also forgotten some details so I list it here again.

I did several tests and here are my results:

With hard float:
- The USB audio device does not produce any sound.

I've heard this could also be due to some other problem not directlyrelated to FPU, maybe there's a problem with USB/OHCI emulation as wellbecause problems with that were reported but it's interesting why you getdifferent results changing FPU related stuff. I think OSX uses floatsamples so probably does use FPU for processing sound and may rely on somepecularity of the hardware as it was probably optimised for Applehardware. It would be interesting to find out how FPU stuff is related tothis but since it's broken anyway probably not a show stopper at themoment.

- Converting a MIDI file to AAC in iTunes happens at 0.4x (faster than soft 
float :) ).

Does resulting file match? As a simple test I've verified md5sum of theresulting mp3 with the lame benchmark I've tried just to find any bigerrors. Even if it does not prove that nothing broke, it shuold detect ifsomething breaks badly. However that was WAV->MP3 where results were same,although the VMX build did produce different result than FPU only but didso consistently for multiple runs. With MIDI there might be slight timingdifference that could cause different audio results so you should firstverify if doing the conversion multiple times does produce the same resultat all without any patch first.

For my FPSCR test program, 21 tests failed. The high number is becausethe inexact exception is being set for situations it should not be setfor.

Since we force the inexact flag to be set to enable hardfloat this isexpected. More interesting is if apart from this are there any differencein the results compared to the soffloat case (that may also be host CPUdependent I think). Do you have more detailed info on the errors anddifferences found?

Some of the problems with inexact may be fixed by not always forcing theflag on but just not clearing it. As I undersood other targets do that soit starts with softfloat but the first time the inexact flag is set itwill start using hardfloat as long as the guest does not clear this flag.Probably this is done to automatically detect code that needs the flag andassume it's not used when it's not touched. Since PPC also has an inexactflag just for previous FP op (the FI bit) apart from the usual cumulativeflag, the client could read that instead of clearing the cumulative flagso we can't detect guest usage this way, teherefore we might as well breakinexact completely to always use hardfloat and need to manually enable itfor guests that we know need it. I'm not sure however if forcing theinexact flag would lead to unwanted FP exceptions as well so this may alsoneed to be made conditional on the enabled/disabled status of inexact FPexceptions. Does anyone have more info on this?

With soft float: - Some sound can be heard from the USB audio device. Itisn't good sounding. I had to force quit Quicktime player because itstopped working.
- Converting a MIDI file to AAC in iTunes happens at 0.3x (slower than hard 
float).
- 13 tests failed with my FPSCR test program.
This patch is a good start. I'm not worried about the Floating PointStatus and Control Register flags being wrong since hardly any softwarebothers to check them. I think more optimizations can happen by

I don't know if guest code checks fpscr and what flags it cares about.Also don't know if it's a fact that these are not used but maybe if wetest with more guest codes we can find out. That's why I'd like to atleast have an option to test with hardfloat. Unfortunately enablinghardfloat without also making it default would make it slower so if we gothis way we should make sure we can also enable hardfloat as default.

simplifying the FPU. As it is now it makes a lot of calls per operation.

Question is if those calls are really needed to emulate PPC FPU or if notwhy would they be there? If the FPU is really that much different so allthese calls are needed then there's not much to simplify (but maybe therecould be some optimisations possible). This would need someone tounderstand the current code in full first that probably we don't currently(ar least I don't for sure so can't really make changes either). Anothermore viable approach is to pick a small part and follow through with thatand try to clean up and optimise that small part only. The exception andfpscr handling is one such part, another could be round_canonical() thatseems to be high on profiles I've taken. Maybe this could be done byreading and understading docs only on the small part picked that may beeasier than getting everything first. I wonder if such smaller tasks couldbe defined and given out as GSoC or other volunteer projects?


Regards,
BALATON Zoltan

[Prev in Thread]

Current Thread

[Next in Thread]

[RFC PATCH v2] target/ppc: Enable hardfloat for PPC, BALATON Zoltan, 2020/02/18
- Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, BALATON Zoltan, 2020/02/18
- Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, Programmingkid, 2020/02/18
  - Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, BALATON Zoltan <=
    - Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, Howard Spoelstra, 2020/02/19
    - Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, BALATON Zoltan, 2020/02/19
    - Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, Howard Spoelstra, 2020/02/20
    - Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, Programmingkid, 2020/02/24
    - Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, BALATON Zoltan, 2020/02/25
    - Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, Programmingkid, 2020/02/26
    - Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, BALATON Zoltan, 2020/02/26
    - R: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, luigi burdo, 2020/02/26
    - R: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, Dino Papararo, 2020/02/26
    - Re: R: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, Alex Bennée, 2020/02/26

Prev by Date: Re: [PATCH v2 09/22] migration/block-dirty-bitmap: relax error handling in incoming part
Next by Date: Re: [PATCH v2 14/22] qemu-iotests/199: better catch postcopy time
Previous by thread: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
Next by thread: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
Index(es):
- Date
- Thread