[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [External] : Re: [RFC PATCH 0/1] ACPI: Fix missing CPU hotplug/hotun
From: |
Igor Mammedov |
Subject: |
Re: [External] : Re: [RFC PATCH 0/1] ACPI: Fix missing CPU hotplug/hotunplug events with > 255 vCPUs |
Date: |
Tue, 10 Dec 2024 13:51:33 +0100 |
On Tue, 10 Dec 2024 01:22:24 +0000
Eric Mackay <eric.mackay@oracle.com> wrote:
> On Mon, 9 Dec 2024 15:36:06 +0100
> Igor Mammedov imammedo@redhat.com<mailto:imammedo@redhat.com> wrote:
>
> > On Tue, 3 Dec 2024 16:56:35 -0800
> > Eric Mackay eric.mackay@oracle.com<mailto:eric.mackay@oracle.com> wrote:
> >
> >> ACPI hotplug with 255 or less vCPUs can use the legacy CPU hotplug
> >> interface, which does
> >> not support hotunplug. If it's available, hotunplug will use the modern
> >> CPU hotplug interface.
> >> This creates a situation where hotplug and hotunplug are using different
> >> interfaces, but
> >> the end result is still functional. CPUs can be hotplugged and
> >> hotunplugged at will.
> >
> > only one kind of interface should be in use. And for quite a while firmware
> > shipped
> > with QEMU is using QEMU provided ACPI tables, which in turn switch
> > interface to
> > modern hotplug when guest OS loads ACPI tables, see:
> > if (opts.has_legacy_cphp) {
> > method = aml_method("_INI", 0, AML_SERIALIZED);
> > or firmware does the switch (OVMF usecase) before guest OS.
> >
> > if any hotplug action happened before guest switched hotplug favour,
> > it will trigger SCI, which guest OS should process it (incl. switching
> > to modern interface and sending necessary device check/remove events)
> > once ACPI tables are processed.
> >
> >> With > 255 vCPUs, both hotplug and hotunplug will use the modern ACPI
> >> interface.
> >> There is no priority or rules of mutual exclusion defined in this
> >> interface,
> >> and the behavior in the guest is implementation-defined.
> >>
> >> Unfortunately, it is possible to have both a hotplug and hotunplug event
> >> pending
> >> for the same vCPU. When the guest processes its pending events, it may see
> >> the
> >> hotplug but ignore the hotunplug.
> >>
> >> The most recent event is likely to reflect the desired state of the
> >> system, so
> >> ignoring the hotunplug event in this scenario is unacceptable.
> >
> > QEMU has delivered both events as it was intended (and in order plug then
> > unplug).
> > I'd say that ignoring events is a guest bug.
> >
> > With unplug, potentially if fixing guest is not an option
> > one can try retry device_del to trigger unplug event again.
> >
> > PS:
> > I vaguely recall linux kernel wiping GPE state on boot during
> > initialization, but that is guest problem and should be fixed
> > on guest side.
> > (unexpectedly enough, Windows handles GPE state as expected)
>
> Consider the scenario where a hotplug is requested, the hotplug bit is then
> set.
> ACPI event is sent to the guest. Before guest can act on it (either timing or
> perhaps the guest is paused), unplug is requested for that same CPU, and the
> unplug bit is now set. ACPI event is sent to guest again. The guest now wakes
> up or becomes able to perform work, sees there is a pending ACPI event. The
> guest now requests to read ACPI_CPU_FLAGS_OFFSET_RW, and sees both plug and
> unplug are set. How does it know which to do first?
looking at the code some more, it looks like 2 parties are at fault
1. linux kernel clearing GPE status bits on boot without processing events,
(if it had processed pending CPU hotplug event, that would have cleared all
insert flags which are NOPs at this point)
2. a bug in CPU_SCAN_METHOD AML code, which doesn't take remove event branch
if there is insert event on the same CPU. So #1 effectively hides unplug
until user triggers the next (un)plug event.
Unplug request is just a request and if it didn't happen for some reason
the user shall repeat it.
However it doesn't excuse QEMU behavior which effectively doesn't process
pending remove event in the same scan loop.
> >> Repro: This can be seen in practice when a q35 VM is started in qemu, with
> >> maxcpus=260 or something above 255, and a few statically allocated CPUs.
> >> Example smp line: "-smp 4,sockets=2,dies=1,cores=65,threads=2,maxcpus=260"
> >> Then use libvirt to add vCPUs up to the maximum before continuing the VM.
> >> After
> >> the guest starts up, delete one of the vCPUs. 'lscpu' in the guest should
> >> still
> >> show 260.
> >
> > above should work especially with unplug event happening after guest OS has
> > booted.
> > Lets simplify reproducer to QEMU only (i.e. getting rid of libvirt part)
> > and see
> > where it leads us.
>
> The guest has not processed the hotplug events that took place before the
> guest
> OS booted. There is nothing to tell the guest OS that it needs to check ACPI
> event flags, or to clear the hotplug bits since the CPUs are already present.
> When the unplug comes, this is the first time the guest OS knows to check for
> hotplug/unplug events. When it reads ACPI_CPU_FLAGS_OFFSET_RW it will see that
> hotplug is still set for all the CPUs that came online before it booted, and
> it
> will recognize they are already present. For the CPU that a unplug was
> requested, the hotplug bit was never cleared prior, so the
> ACPI_CPU_FLAGS_OFFSET_RW read will show both hotplug and unplug flags set.
> It's
> now implementation-defined in the guest what it wants to do, or whether it
> assigns priority to one event type or the other.
>
> Agreed, the libvirt part shouldn't be important. I can come up with
> instructions for QEMU-only repro.
one way to reproduce it would be use qmeu with uefi,
boot till grub prompt and then do device_del and let guest to finish boot.
The 2nd device_del works as workaround.
> >> The proposed solution is to enforce mutual exclusion between the hotplug
> >> and hotunplug
> >> bits in the modern ACPI interface. Setting a new pending event will clear
> >> a previously
> >> pending event of the opposite type, thus preserving only the most recently
> >> requested
> >> state.
> >
> > Once VM is in running state, clearing flags from QEMU side is not safe.
> > Effectively it would introduce a race with guest code.
>
> Yes, I see what you mean. Agreed this is not safe.
>
> > Also I'm not sure that flags are the issue here,
> > Capturing some CPUHP traces on QEMU side and logs from guest side
> > might help to clarify what's going on.
>
> Will work on providing those traces and logs
>
> >> Eric Mackay (1):
> >> ACPI: Enforce mutual exclusion betwen CPU insertion and removal events
> >>
> >> hw/acpi/cpu.c | 36 ++++++++++++++++++++++++++++++++----
> >> include/hw/acpi/cpu.h | 4 ++++
> >> 2 files changed, 36 insertions(+), 4 deletions(-)
> >>