qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Slowness with multi-thread TCG?


From: Alex Bennée
Subject: Re: Slowness with multi-thread TCG?
Date: Tue, 28 Jun 2022 16:12:40 +0100
User-agent: mu4e 1.7.27; emacs 28.1.50

Frederic Barrat <fbarrat@linux.ibm.com> writes:

> On 28/06/2022 13:25, Matheus K. Ferst wrote:
>> On 27/06/2022 15:25, Frederic Barrat wrote:
>>> [ Resending as it was meant for the qemu-ppc list ]
>>>
>>> Hello,
>>>
>>> I've been looking at why our qemu powernv model is so slow when booting
>>> a compressed linux kernel, using multiple vcpus and multi-thread tcg.
>>> With only one vcpu, the decompression time of the kernel is what it is,
>>> but when using multiple vcpus, the decompression is actually slower. And
>>> worse: it degrades very fast with the number of vcpus!
>>>
>>> Rough measurement of the decompression time on a x86 laptop with
>>> multi-thread tcg and using the qemu powernv10 machine:
>>> 1 vcpu => 15 seconds
>>> 2 vcpus => 45 seconds
>>> 4 vcpus => 1 min 30 seconds
>>>
>>> Looking in details, when the firmware (skiboot) hands over execution to
>>> the linux kernel, there's one main thread entering some bootstrap code
>>> and running the kernel decompression algorithm. All the other secondary
>>> threads are left spinning in skiboot (1 thread per vpcu). So on paper,
>>> with multi-thread tcg and assuming the system has enough available
>>> physical cpus, I would expect the decompression to hog one physical cpu
>>> and the time needed to be constant, no matter the number of vpcus.
<snip>
>>>
>>> Ironically, the behavior seen with single thread tcg is what I would
>>> expect: 1 thread decompressing in 15 seconds, all the other threads
>>> spinning for that same amount of time, all sharing the same physical
>>> cpu, so it all adds up nicely: I see 60 seconds decompression time with
>>> 4 vcpus (4x15). Which means multi-thread tcg is slower by quite a bit.
>>> And single thread tcg hogs one physical cpu of the laptop vs. 4 physical
>>> cpus for the slower multi-thread tcg.
>>>
>>> Does anybody have an idea of what might happen or have suggestion to
>>> keep investigating?
>>> Thanks for your help!
>>>
>>>    Fred
>>>
>>>
>> Hi Frederic,
>> I did some boot time tests recently and didn't notice this behavior.
>> Could you share your QEMU command line with us? Did you build QEMU
>> with any debug option or sanitizer enabled?
>
>
> You should be able to see it with:
>
> qemu-system-ppc64 -machine powernv10 -smp 4 -m 4G -nographic -bios
> <path to skiboot.lid> -kernel <path to compresses kernel>   -initrd
> <path to initd>  -serial mon:stdio
>
>
> -smp is what matters.
>
> When simplifying the command line above, I noticed something
> interesting: the problem doesn't show using the skiboot.lid shipped
> with qemu! I'm using something closer to the current upstream head and
> the idle code (the for loop in my initial mail) had been reworked in
> between. So, clearly, the way the guest code is written matters. But
> that doesn't explain it.
>
> I'm using a kernel in debug mode, so it's pretty big and that's why I
> was using a compressed image. The compressed image is about 8 MB.

If the debug mode on PPC enables live patching of kernel functions for
instrumentation that can certainly slow things down. You would see that
in tcg_optimize appearing in the perf log and "info jit" showing
constantly growing translation buffers.

>
> The initrd shouldn't matter, the issue is seen during kernel
> decompression, before the init ram is used.
>
> I can share my binaries if you'd like. Especially a recent version of
> skiboot showing the problem.
>
>   Fred


-- 
Alex Bennée



reply via email to

[Prev in Thread] Current Thread [Next in Thread]