On Mon, 9 Dec 2024 15:36:06 +0100
Igor Mammedov
imammedo@redhat.com wrote:
> On Tue, 3 Dec 2024 16:56:35 -0800
> Eric Mackay
eric.mackay@oracle.com wrote:
>
>> ACPI hotplug with 255 or less vCPUs can use the legacy CPU hotplug interface, which does
>> not support hotunplug. If it's available, hotunplug will use the modern CPU hotplug interface.
>> This creates a situation where hotplug and hotunplug are using different interfaces, but
>> the end result is still functional. CPUs can be hotplugged and hotunplugged at will.
>
> only one kind of interface should be in use. And for quite a while firmware shipped
> with QEMU is using QEMU provided ACPI tables, which in turn switch interface to
> modern hotplug when guest OS loads ACPI tables, see:
> if (opts.has_legacy_cphp) {
> method = aml_method("_INI", 0, AML_SERIALIZED);
> or firmware does the switch (OVMF usecase) before guest OS.
>
> if any hotplug action happened before guest switched hotplug favour,
> it will trigger SCI, which guest OS should process it (incl. switching
> to modern interface and sending necessary device check/remove events)
> once ACPI tables are processed.
>
>> With > 255 vCPUs, both hotplug and hotunplug will use the modern ACPI interface.
>> There is no priority or rules of mutual exclusion defined in this interface,
>> and the behavior in the guest is implementation-defined.
>>
>> Unfortunately, it is possible to have both a hotplug and hotunplug event pending
>> for the same vCPU. When the guest processes its pending events, it may see the
>> hotplug but ignore the hotunplug.
>>
>> The most recent event is likely to reflect the desired state of the system, so
>> ignoring the hotunplug event in this scenario is unacceptable.
>
> QEMU has delivered both events as it was intended (and in order plug then unplug).
> I'd say that ignoring events is a guest bug.
>
> With unplug, potentially if fixing guest is not an option
> one can try retry device_del to trigger unplug event again.
>
> PS:
> I vaguely recall linux kernel wiping GPE state on boot during
> initialization, but that is guest problem and should be fixed
> on guest side.
> (unexpectedly enough, Windows handles GPE state as expected)
Consider the scenario where a hotplug is requested, the hotplug bit is then set.
ACPI event is sent to the guest. Before guest can act on it (either timing or
perhaps the guest is paused), unplug is requested for that same CPU, and the
unplug bit is now set. ACPI event is sent to guest again. The guest now wakes
up or becomes able to perform work, sees there is a pending ACPI event. The
guest now requests to read ACPI_CPU_FLAGS_OFFSET_RW, and sees both plug and
unplug are set. How does it know which to do first?
>> Repro: This can be seen in practice when a q35 VM is started in qemu, with
>> maxcpus=260 or something above 255, and a few statically allocated CPUs.
>> Example smp line: "-smp 4,sockets=2,dies=1,cores=65,threads=2,maxcpus=260"
>> Then use libvirt to add vCPUs up to the maximum before continuing the VM. After
>> the guest starts up, delete one of the vCPUs. 'lscpu' in the guest should still
>> show 260.
>
> above should work especially with unplug event happening after guest OS has booted.
> Lets simplify reproducer to QEMU only (i.e. getting rid of libvirt part) and see
> where it leads us.
The guest has not processed the hotplug events that took place before the guest
OS booted. There is nothing to tell the guest OS that it needs to check ACPI
event flags, or to clear the hotplug bits since the CPUs are already present.
When the unplug comes, this is the first time the guest OS knows to check for
hotplug/unplug events. When it reads ACPI_CPU_FLAGS_OFFSET_RW it will see that
hotplug is still set for all the CPUs that came online before it booted, and it
will recognize they are already present. For the CPU that a unplug was
requested, the hotplug bit was never cleared prior, so the
ACPI_CPU_FLAGS_OFFSET_RW read will show both hotplug and unplug flags set. It's
now implementation-defined in the guest what it wants to do, or whether it
assigns priority to one event type or the other.
Agreed, the libvirt part shouldn't be important. I can come up with instructions for QEMU-only repro.
>> The proposed solution is to enforce mutual exclusion between the hotplug and hotunplug
>> bits in the modern ACPI interface. Setting a new pending event will clear a previously
>> pending event of the opposite type, thus preserving only the most recently requested
>> state.
>
> Once VM is in running state, clearing flags from QEMU side is not safe.
> Effectively it would introduce a race with guest code.
Yes, I see what you mean. Agreed this is not safe.
> Also I'm not sure that flags are the issue here,
> Capturing some CPUHP traces on QEMU side and logs from guest side
> might help to clarify what's going on.
Will work on providing those traces and logs
>> Eric Mackay (1):
>> ACPI: Enforce mutual exclusion betwen CPU insertion and removal events
>>
>> hw/acpi/cpu.c | 36 ++++++++++++++++++++++++++++++++----
>> include/hw/acpi/cpu.h | 4 ++++
>> 2 files changed, 36 insertions(+), 4 deletions(-)
>>