Hi,
Since we started working with QEMU on PPC, we've noticed that
emulating PPC64 VMs is faster in x86_64 than PPC64 itself, even when compared with
x86 machines that are slower in other workloads (like building QEMU or the Linux
kernel).
We thought it would be related to the TCG backend, which would be better optimized on
x86. As a first approach to better understand the problem, I ran some boot tests with
Fedora Cloud Base 35-1.2[1] on both platforms. Using the command line
./qemu-system-ppc64 -name Fedora-Cloud-Base-35-1.2.ppc64le -smp 2 -m 2G -vga none
-nographic -serial pipe:Fedora-Cloud-Base-35-1.2.ppc64le -monitor
unix:Fedora-Cloud-Base-35-1.2.ppc64le.mon,server,nowait -device
virtio-net,netdev=vmnic -netdev user,id=vmnic -cdrom fedora-cloud-init.iso -cpu
POWER10 -accel tcg -device virtio-scsi-pci -drive
file=Fedora-Cloud-Base-35-1.2.ppc64le.temp.qcow2,if=none,format=qcow2,id=hd0 -device
scsi-hd,drive=hd0 -boot c
in a POWER9 DD2.2 and an Intel Xeon E5-2687W, a simple bash script reads the ".out"
pipe until the "fedora login:" string is found and then issues a "system_powerdown"
through QEMU monitor. The ."temp.qcow2" file is backed by the original Fedora image
and deleted at the end of the test, so every boot is fresh. Running the test 10 times
gave us 235.26 ± 6.27 s on PPC64 and 192.92 ± 4.53 s on x86_64, i.e., TCG is ~20%
slower in the POWER9.
As a second step, I wondered if this gap would be the same when emulating other
architectures on PPC64, so I used the same version of Fedora Cloud for aarch64[2] and
s390x[3], using the following command lines:
./qemu-system-aarch64 -name Fedora-Cloud-Base-35-1.2.aarch64 -smp 2 -m 2G -vga none
-nographic -serial pipe:Fedora-Cloud-Base-35-1.2.aarch64 -monitor
unix:Fedora-Cloud-Base-35-1.2.aarch64.mon,server,nowait -device
virtio-net,netdev=vmnic -netdev user,id=vmnic -cdrom fedora-cloud-init.iso -machine
virt -cpu max -accel tcg -device virtio-scsi-pci -drive
file=Fedora-Cloud-Base-35-1.2.aarch64.temp.qcow2,if=none,format=qcow2,id=hd0 -device
scsi-hd,drive=hd0 -boot c -bios ./pc-bios/edk2-aarch64-code.fd
and
./qemu-system-s390x -name Fedora-Cloud-Base-35-1.2.s390x -smp 2 -m 2G -vga none
-nographic -serial pipe:Fedora-Cloud-Base-35-1.2.s390x -monitor
unix:Fedora-Cloud-Base-35-1.2.s390x.mon,server,nowait -device virtio-net,netdev=vmnic
-netdev user,id=vmnic -cdrom fedora-cloud-init.iso -machine s390-ccw-virtio -cpu max
-accel tcg -hda Fedora-Cloud-Base-35-1.2.s390x.temp.qcow2 -boot c
With 50 runs, we got:
+---------+---------------------------------+
| | Host |
| Guest +----------------+----------------+
| | PPC64 | x86_64 |
+---------+----------------+----------------+
| PPC64 | 194.72 ± 7.28 | 162.75 ± 8.75 |
| aarch64 | 501.89 ± 9.98 | 586.08 ± 10.55 |
| s390x | 294.10 ± 21.62 | 223.71 ± 85.30 |
+---------+----------------+----------------+
The difference with an s390x guest is around ~30%, with a greater variability on
x86_64 that I couldn't find the source. However, POWER9 emulates aarch64 faster than
this Xeon.
The particular workload of the guest could distort this result since in the first
boot Cloud-Init will create user accounts, generate SSH keys, etc. If the aarch64
guest uses many vector instructions for this initial setup, that might explain why an
older Xeon would be slower here.
As a final test, I changed the images to have a normal user account already created
and unlocked, disabled Cloud-Init, downloaded bc-1.07 sources[4][5], installed its
build dependencies[6], and changed the test script to login, extract, configure,
build, and shutdown the guest. I also added an aarch64 compatible machine (Apple M1
w/ 10 cores) to our test setup. Running 100 iterations gave us the following results:
+---------+----------------------------------------------------+
| | Host |
| Guest +-----------------+-----------------+----------------+
| | PPC64 | x86_64 | aarch64 |
+---------+-----------------+-----------------+----------------+
| PPC64 | 429.82 ± 11.57 | 352.34 ± 8.51 | 180.78 ± 42.02 |
| aarch64 | 1029.78 ± 46.01 | 1207.98 ± 80.49 | 487.50 ± 7.54 |
| s390x | 589.97 ± 86.67 | 411.83 ± 41.88 | 221.86 ± 79.85 |
+---------+-----------------+-----------------+----------------+
The pattern with PPC64 vs. x86_64 remains: PPC64/s390x guests are ~20%/~30% slower on
POWER9, but the aarch64 VM is slower on this Xeon. If the PPC backend can perform
better than the x86 when emulating some architectures, I guess that improving
PPC64-on-PPC64 emulation isn't "just" TCG backend optimization but a more complex
problem to tackle.
What would be different in aarch64 emulation that yields a better performance on our
POWER9?
- I suppose that aarch64 has more instructions with GVec implementations than PPC64
and s390x, so maybe aarch64 guests can better use host-vector instructions?
- Looking at the flame graphs of each test (attached), I can see that tb_gen_code
takes proportionally less time of aarch64 emulation than PPC64 and s390x, so it might
be that decodetree is faster?
- There is more than TCG at play, so perhaps the differences can be better
explained by VirtIO performance or something else?
Currently, Leandro Lupori is working to improve TLB invalidation[7], Victor Colombo
is working to enable hardfpu in some scenarios, and I'm reviewing some older helpers
that can use GVec or easily implemented inline. We're also planning to add some Power
ISA v3.1 instructions to the TCG backend, but it's probably better to test on
hardware if our changes are doing any good, and we don't have access to a POWER10 yet.
Are there any other known performance problems for TCG on PPC64 that we should
investigate?