qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: TCG performance on PPC64


From: Richard Henderson
Subject: Re: TCG performance on PPC64
Date: Wed, 18 May 2022 07:44:54 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.0

On 5/18/22 06:16, Matheus K. Ferst wrote:
As a final test, I changed the images to have a normal user account already created and unlocked, disabled Cloud-Init, downloaded bc-1.07 sources[4][5], installed its build dependencies[6], and changed the test script to login, extract, configure, build, and shutdown the guest. I also added an aarch64 compatible machine (Apple M1 w/ 10 cores) to our test setup. Running 100 iterations gave us the following results:

+---------+----------------------------------------------------+
|         |                        Host                        |
|  Guest  +-----------------+-----------------+----------------+
|         |      PPC64      |     x86_64      |     aarch64    |
+---------+-----------------+-----------------+----------------+
| PPC64   |  429.82 ± 11.57 |   352.34 ± 8.51 | 180.78 ± 42.02 |
| aarch64 | 1029.78 ± 46.01 | 1207.98 ± 80.49 |  487.50 ± 7.54 |
| s390x   |  589.97 ± 86.67 |  411.83 ± 41.88 | 221.86 ± 79.85 |
+---------+-----------------+-----------------+----------------+

These are some weird results. Particularly the aarch64 host ones -- I'm really surprised that it's that much faster than the x86_64 at anything. Oh, the E5-2687W was discontinued 7 years ago. So I'll just put that down to age.

What would be different in aarch64 emulation that yields a better performance 
on our POWER9?

That is a very good question.

 - I suppose that aarch64 has more instructions with GVec implementations than PPC64 and s390x, so maybe aarch64 guests can better use host-vector instructions?

No, there's very little gvec in a kernel boot cycle.  Not none, but very little.

 - Looking at the flame graphs of each test (attached), I can see that tb_gen_code takes proportionally less time of aarch64 emulation than PPC64 and s390x, so it might be that decodetree is faster?

No. (1) aarch64 base instructions aren't using decodetree, (2) the existing ppc and s390 decode is pretty well architected; decodetree is not particularly optimized, it's simply meant to be more readable.

Looking at the aarch64-on-ppc64 graph, I see that PAC encryption is taking up a huge proportion of your runtime. Probably gcc has done a better job with those routines for ppc64 host. You may want to run the aarch64 guest tests again with -cpu max,pauth=off.

Otherwise, the flame graph columns are too narrow to actually read, for me.


r~



reply via email to

[Prev in Thread] Current Thread [Next in Thread]