[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Slowness with multi-thread TCG?
From: |
Alex Bennée |
Subject: |
Re: Slowness with multi-thread TCG? |
Date: |
Tue, 28 Jun 2022 23:17:37 +0100 |
User-agent: |
mu4e 1.7.27; emacs 28.1.50 |
Frederic Barrat <fbarrat@linux.ibm.com> writes:
> On 28/06/2022 17:12, Alex Bennée wrote:
>> Frederic Barrat <fbarrat@linux.ibm.com> writes:
>>
>>> On 28/06/2022 13:25, Matheus K. Ferst wrote:
>>>> On 27/06/2022 15:25, Frederic Barrat wrote:
>>>>> [ Resending as it was meant for the qemu-ppc list ]
>>>>>
>>>>> Hello,
>>>>>
>>>>> I've been looking at why our qemu powernv model is so slow when booting
>>>>> a compressed linux kernel, using multiple vcpus and multi-thread tcg.
>>>>> With only one vcpu, the decompression time of the kernel is what it is,
>>>>> but when using multiple vcpus, the decompression is actually slower. And
>>>>> worse: it degrades very fast with the number of vcpus!
>>>>>
>>>>> Rough measurement of the decompression time on a x86 laptop with
>>>>> multi-thread tcg and using the qemu powernv10 machine:
>>>>> 1 vcpu => 15 seconds
>>>>> 2 vcpus => 45 seconds
>>>>> 4 vcpus => 1 min 30 seconds
>>>>>
>>>>> Looking in details, when the firmware (skiboot) hands over execution to
>>>>> the linux kernel, there's one main thread entering some bootstrap code
>>>>> and running the kernel decompression algorithm. All the other secondary
>>>>> threads are left spinning in skiboot (1 thread per vpcu). So on paper,
>>>>> with multi-thread tcg and assuming the system has enough available
>>>>> physical cpus, I would expect the decompression to hog one physical cpu
>>>>> and the time needed to be constant, no matter the number of vpcus.
>> <snip>
>>>>>
>>>>> Ironically, the behavior seen with single thread tcg is what I would
>>>>> expect: 1 thread decompressing in 15 seconds, all the other threads
>>>>> spinning for that same amount of time, all sharing the same physical
>>>>> cpu, so it all adds up nicely: I see 60 seconds decompression time with
>>>>> 4 vcpus (4x15). Which means multi-thread tcg is slower by quite a bit.
>>>>> And single thread tcg hogs one physical cpu of the laptop vs. 4 physical
>>>>> cpus for the slower multi-thread tcg.
>>>>>
>>>>> Does anybody have an idea of what might happen or have suggestion to
>>>>> keep investigating?
>>>>> Thanks for your help!
>>>>>
>>>>> Fred
>>>>>
>>>>>
>>>> Hi Frederic,
>>>> I did some boot time tests recently and didn't notice this behavior.
>>>> Could you share your QEMU command line with us? Did you build QEMU
>>>> with any debug option or sanitizer enabled?
>>>
>>>
>>> You should be able to see it with:
>>>
>>> qemu-system-ppc64 -machine powernv10 -smp 4 -m 4G -nographic -bios
>>> <path to skiboot.lid> -kernel <path to compresses kernel> -initrd
>>> <path to initd> -serial mon:stdio
>>>
>>>
>>> -smp is what matters.
>>>
>>> When simplifying the command line above, I noticed something
>>> interesting: the problem doesn't show using the skiboot.lid shipped
>>> with qemu! I'm using something closer to the current upstream head and
>>> the idle code (the for loop in my initial mail) had been reworked in
>>> between. So, clearly, the way the guest code is written matters. But
>>> that doesn't explain it.
>>>
>>> I'm using a kernel in debug mode, so it's pretty big and that's why I
>>> was using a compressed image. The compressed image is about 8 MB.
You can use split debug to avoid keeping the symbol in the final
vmimage. Or are there other debugging options turned on?
>> If the debug mode on PPC enables live patching of kernel functions
>> for
>> instrumentation that can certainly slow things down. You would see that
>> in tcg_optimize appearing in the perf log and "info jit" showing
>> constantly growing translation buffers.
>
>
> The part where I'm seeing the huge slowdown is not quite in kernel
> yet. Only one thread is in bootstrap code decompressing the real
> kernel. All the other threads are still spinning in firmware.
>
> Anyway, I've run perf. I couldn't figure out how to trigger the
> recording only around the decompression part with the slowdown. So I
> booted with 4 cpus to make it really slow, expecting the initial steps
> of the boot, which happen quickly enough, would be dwarfed by the time
> spent while one thread is decompressing the kernel (the part where I
> see the huge slowdown). I'd say the recording was taken with ~80% of
> the time in the interesting part. Here is what I got:
>
>
> 12,62% qemu-system-ppc [kernel.kallsyms] [k]
> syscall_exit_to_user_mode
> 6,93% qemu-system-ppc [kernel.kallsyms] [k]
> syscall_return_via_sysret
> 5,64% qemu-system-ppc [kernel.kallsyms] [k]
> __entry_text_start
> 3,93% qemu-system-ppc libc.so.6 [.]
> pthread_mutex_lock@@GLIBC_2.2.5
> 3,21% qemu-system-ppc libc.so.6 [.]
> __GI___pthread_mutex_unlock_usercnt
> 3,12% qemu-system-ppc libc.so.6 [.]
> __GI___lll_lock_wait
> 2,60% qemu-system-ppc qemu-system-ppc64 [.]
> cpu_handle_interrupt
> 2,55% qemu-system-ppc [kernel.kallsyms] [k] futex_wake
> 2,43% qemu-system-ppc [kernel.kallsyms] [k]
> native_queued_spin_lock_slowpath
> 1,97% qemu-system-ppc [kernel.kallsyms] [k] _raw_spin_lock
> 1,89% qemu-system-ppc qemu-system-ppc64 [.]
> qemu_mutex_lock_impl
> 1,83% qemu-system-ppc qemu-system-ppc64 [.] tb_lookup
> 1,71% qemu-system-ppc [kernel.kallsyms] [k]
> __get_user_nocheck_4
> 1,55% qemu-system-ppc qemu-system-ppc64 [.]
> hreg_compute_hflags_value
> 1,46% qemu-system-ppc [kernel.kallsyms] [k] futex_q_lock
> 1,39% qemu-system-ppc [kernel.kallsyms] [k] futex_q_unlock
> 1,23% qemu-system-ppc [kernel.kallsyms] [k]
> audit_reset_context.part.0.constprop.0
> 1,14% qemu-system-ppc qemu-system-ppc64 [.]
> object_class_dynamic_cast_assert
> 1,09% qemu-system-ppc qemu-system-ppc64 [.]
> qemu_mutex_unlock_impl
> 1,02% qemu-system-ppc qemu-system-ppc64 [.]
> object_dynamic_cast_assert
> 1,00% qemu-system-ppc [kernel.kallsyms] [k] __x64_sys_futex
>
>
> Any known pattern here? There seems to be some contention with the
> mutex/futex call, but it's not obvious to me what it is.
If you run the sync-profiler (via the HMP "sync-profile on") you can
then get a breakdown of which mutex's are being held and for how long
("info sync-profile").
> I was also pointed to enabling gprof in qemu. I'll look into it.
gprof will likely change the behaviour due to overhead.
>
> Thanks!
>
> Fred
>
>
>>>
>>> The initrd shouldn't matter, the issue is seen during kernel
>>> decompression, before the init ram is used.
>>>
>>> I can share my binaries if you'd like. Especially a recent version of
>>> skiboot showing the problem.
>>>
>>> Fred
>>
--
Alex Bennée
- Slowness with multi-thread TCG?, Frederic Barrat, 2022/06/27
- Re: Slowness with multi-thread TCG?, Alex Bennée, 2022/06/27
- Re: Slowness with multi-thread TCG?, Matheus K. Ferst, 2022/06/28
- Re: Slowness with multi-thread TCG?, Frederic Barrat, 2022/06/28
- Re: Slowness with multi-thread TCG?, Alex Bennée, 2022/06/28
- Re: Slowness with multi-thread TCG?, Frederic Barrat, 2022/06/28
- Re: Slowness with multi-thread TCG?,
Alex Bennée <=
- Re: Slowness with multi-thread TCG?, Frederic Barrat, 2022/06/29
- Re: Slowness with multi-thread TCG?, Alex Bennée, 2022/06/29
- Re: Slowness with multi-thread TCG?, Matheus K. Ferst, 2022/06/29
- Re: Slowness with multi-thread TCG?, Alex Bennée, 2022/06/29
- Re: Slowness with multi-thread TCG?, Cédric Le Goater, 2022/06/29