Re: TCG performance on PPC64

qemu-ppc

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: TCG performance on PPC64

From:	Mark Cave-Ayland
Subject:	Re: TCG performance on PPC64
Date:	Wed, 18 May 2022 15:11:16 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.0

On 18/05/2022 14:16, Matheus K. Ferst wrote:

Hi,

Since we started working with QEMU on PPC, we've noticed that
emulating PPC64 VMs is faster in x86_64 than PPC64 itself, even when compared withx86 machines that are slower in other workloads (like building QEMU or the Linuxkernel).
We thought it would be related to the TCG backend, which would be better optimized onx86. As a first approach to better understand the problem, I ran some boot tests withFedora Cloud Base 35-1.2[1] on both platforms. Using the command line
./qemu-system-ppc64 -name Fedora-Cloud-Base-35-1.2.ppc64le -smp 2 -m 2G -vga none-nographic -serial pipe:Fedora-Cloud-Base-35-1.2.ppc64le -monitorunix:Fedora-Cloud-Base-35-1.2.ppc64le.mon,server,nowait -devicevirtio-net,netdev=vmnic -netdev user,id=vmnic -cdrom fedora-cloud-init.iso -cpuPOWER10 -accel tcg -device virtio-scsi-pci -drivefile=Fedora-Cloud-Base-35-1.2.ppc64le.temp.qcow2,if=none,format=qcow2,id=hd0 -devicescsi-hd,drive=hd0 -boot c
in a POWER9 DD2.2 and an Intel Xeon E5-2687W, a simple bash script reads the ".out"pipe until the "fedora login:" string is found and then issues a "system_powerdown"through QEMU monitor. The ."temp.qcow2" file is backed by the original Fedora imageand deleted at the end of the test, so every boot is fresh. Running the test 10 timesgave us 235.26 ± 6.27 s on PPC64 and 192.92 ± 4.53 s on x86_64, i.e., TCG is ~20%slower in the POWER9.
As a second step, I wondered if this gap would be the same when emulating otherarchitectures on PPC64, so I used the same version of Fedora Cloud for aarch64[2] ands390x[3], using the following command lines:
./qemu-system-aarch64 -name Fedora-Cloud-Base-35-1.2.aarch64 -smp 2 -m 2G -vga none-nographic -serial pipe:Fedora-Cloud-Base-35-1.2.aarch64 -monitorunix:Fedora-Cloud-Base-35-1.2.aarch64.mon,server,nowait -devicevirtio-net,netdev=vmnic -netdev user,id=vmnic -cdrom fedora-cloud-init.iso -machinevirt -cpu max -accel tcg -device virtio-scsi-pci -drivefile=Fedora-Cloud-Base-35-1.2.aarch64.temp.qcow2,if=none,format=qcow2,id=hd0 -devicescsi-hd,drive=hd0 -boot c -bios ./pc-bios/edk2-aarch64-code.fd
and
./qemu-system-s390x -name Fedora-Cloud-Base-35-1.2.s390x -smp 2 -m 2G -vga none-nographic -serial pipe:Fedora-Cloud-Base-35-1.2.s390x -monitorunix:Fedora-Cloud-Base-35-1.2.s390x.mon,server,nowait -device virtio-net,netdev=vmnic-netdev user,id=vmnic -cdrom fedora-cloud-init.iso -machine s390-ccw-virtio -cpu max-accel tcg -hda Fedora-Cloud-Base-35-1.2.s390x.temp.qcow2 -boot c
With 50 runs, we got:

+---------+---------------------------------+
|         |               Host              |
|  Guest  +----------------+----------------+
|         |      PPC64     |     x86_64     |
+---------+----------------+----------------+
| PPC64   |  194.72 ± 7.28 |  162.75 ± 8.75 |
| aarch64 |  501.89 ± 9.98 | 586.08 ± 10.55 |
| s390x   | 294.10 ± 21.62 | 223.71 ± 85.30 |
+---------+----------------+----------------+
The difference with an s390x guest is around ~30%, with a greater variability onx86_64 that I couldn't find the source. However, POWER9 emulates aarch64 faster thanthis Xeon.
The particular workload of the guest could distort this result since in the firstboot Cloud-Init will create user accounts, generate SSH keys, etc. If the aarch64guest uses many vector instructions for this initial setup, that might explain why anolder Xeon would be slower here.
As a final test, I changed the images to have a normal user account already createdand unlocked, disabled Cloud-Init, downloaded bc-1.07 sources[4][5], installed itsbuild dependencies[6], and changed the test script to login, extract, configure,build, and shutdown the guest. I also added an aarch64 compatible machine (Apple M1w/ 10 cores) to our test setup. Running 100 iterations gave us the following results:
+---------+----------------------------------------------------+
|         |                        Host                        |
|  Guest  +-----------------+-----------------+----------------+
|         |      PPC64      |     x86_64      |     aarch64    |
+---------+-----------------+-----------------+----------------+
| PPC64   |  429.82 ± 11.57 |   352.34 ± 8.51 | 180.78 ± 42.02 |
| aarch64 | 1029.78 ± 46.01 | 1207.98 ± 80.49 |  487.50 ± 7.54 |
| s390x   |  589.97 ± 86.67 |  411.83 ± 41.88 | 221.86 ± 79.85 |
+---------+-----------------+-----------------+----------------+
The pattern with PPC64 vs. x86_64 remains: PPC64/s390x guests are ~20%/~30% slower onPOWER9, but the aarch64 VM is slower on this Xeon. If the PPC backend can performbetter than the x86 when emulating some architectures, I guess that improvingPPC64-on-PPC64 emulation isn't "just" TCG backend optimization but a more complexproblem to tackle.
What would be different in aarch64 emulation that yields a better performance on ourPOWER9? - I suppose that aarch64 has more instructions with GVec implementations than PPC64and s390x, so maybe aarch64 guests can better use host-vector instructions? - Looking at the flame graphs of each test (attached), I can see that tb_gen_codetakes proportionally less time of aarch64 emulation than PPC64 and s390x, so it mightbe that decodetree is faster? - There is more than TCG at play, so perhaps the differences can be betterexplained by VirtIO performance or something else?
Currently, Leandro Lupori is working to improve TLB invalidation[7], Victor Colombois working to enable hardfpu in some scenarios, and I'm reviewing some older helpersthat can use GVec or easily implemented inline. We're also planning to add some PowerISA v3.1 instructions to the TCG backend, but it's probably better to test onhardware if our changes are doing any good, and we don't have access to a POWER10 yet.
Are there any other known performance problems for TCG on PPC64 that we shouldinvestigate?

Perhaps related to your comment above about "something else", Talospace reports thatrunning TCG on a Talos II (Power 9) is actually quicker for running a MacOS installunder qemu-system-ppc compared with KVM-PR [1]. That's certainly odd, and at oddswith running KVM-PR on my G4 Mac Mini which gives a huge performance boost over TCGhere. That might suggest IO has a noticeable effect on qemu-system-ppc(64) here.

Slightly more back on topic, TCG can itself emulate certain instructions that aren'timplemented directly in the backends: e.g. compare tcg/i386/tcg-target.h withtcg/ppc/tcg-target.h. Perhaps there is a "hot instruction" which is used a lot duringkernel boot which is being emulated rather than implemented directly in the TCG PPCbackend compared with i386?

Finally another comment from Richard about vector instruction use from [2]: "As anaside, this does suggest to me that target/ppc might be well served in moving theppc_vsr_t members of CPUPPCState earlier, so that this offset is smaller". Presumablythis is because calculating smaller offsets can be done using fewer instructions?However I suppose this would only have an effect on vector-heavy workloads.



ATB,

Mark.

[1] https://www.talospace.com/2018/08/making-your-talos-ii-into-power-mac.html
[2] https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05622.html

[Prev in Thread]

Current Thread

[Next in Thread]

TCG performance on PPC64, Matheus K. Ferst, 2022/05/18
- Re: TCG performance on PPC64, Daniel Henrique Barboza, 2022/05/18
- Re: TCG performance on PPC64, Cédric Le Goater, 2022/05/18
  - Re: TCG performance on PPC64, Matheus K. Ferst, 2022/05/19
- Re: TCG performance on PPC64, Mark Cave-Ayland <=
  - Re: TCG performance on PPC64, Richard Henderson, 2022/05/18
  - Re: TCG performance on PPC64, Matheus K. Ferst, 2022/05/23
- Re: TCG performance on PPC64, Richard Henderson, 2022/05/18
  - Re: TCG performance on PPC64, Matheus K. Ferst, 2022/05/19
- Re: TCG performance on PPC64, David Gibson, 2022/05/19
  - Re: TCG performance on PPC64, Matheus K. Ferst, 2022/05/26
    - Re: TCG performance on PPC64, David Gibson, 2022/05/30

Prev by Date: Re: [PATCH 03/18] block: Change blk_{pread,pwrite}() param order
Next by Date: Re: TCG performance on PPC64
Previous by thread: Re: TCG performance on PPC64
Next by thread: Re: TCG performance on PPC64
Index(es):
- Date
- Thread