[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH V5 02/12] cpus: stop vm in suspended state
|
From: |
Steven Sistare |
|
Subject: |
Re: [PATCH V5 02/12] cpus: stop vm in suspended state |
|
Date: |
Tue, 28 Nov 2023 08:26:08 -0500 |
|
User-agent: |
Mozilla Thunderbird |
On 11/22/2023 11:21 AM, Peter Xu wrote:
> On Wed, Nov 22, 2023 at 09:38:06AM +0000, Daniel P. Berrangé wrote:
>> On Mon, Nov 20, 2023 at 04:44:50PM -0500, Peter Xu wrote:
>>> On Mon, Nov 20, 2023 at 03:55:54PM -0500, Steven Sistare wrote:
>>>> If we drop force, then all calls to vm_stop will completely stop the
>>>> suspended state, eg an hmp "stop" command. This causes two problems.
>>>> First, that is a change in user-visible behavior for something that
>>>> currently works,
>>>
>>> IMHO it depends on what should be the correct behavior. IOW, when VM is in
>>> SUSPENDED state and then the user sends "stop" QMP command, what should we
>>> expect?
>>
>> I would say that from a mgmtm app POV "stop" is initiating a state
>> transition, from RUN_STATE_RUNNING to RUN_STATE_PAUSED and "cont"
>> is doing the reverse from PAUSED to RUNNING.
>>
>> It is a little more complicated than that as there are some other
>> states like INMIGRATE that are conceptually equiv to RUNNING,
>> and states where the transition simply doesn't make sense.
>
> In the qemu impl, INMIGRATE is, imho, more equiv of PAUSED - do_vm_stop()
> mostly ignores every state except RUNNING (putting bdrv operations aside).
> IOW, anything besides "running" is treated as "not running".
>
> But then Paolo fixed that in 1e9981465f ("qmp: handle stop/cont in
> INMIGRATE state"), wiring that to autostart.
>
> Now we seem to find that "suspended" should actually fall within (where
> "vm" is running, but "vcpu" is not), and it seems we should treat "vm" and
> "vcpu" differently.
>
>>
>> So my question is if we're in "SUSPENDED" and someone issues "stop",
>> what state do we go into, and perhaps more importantly what state
>> do we go to in a subsequent "cont".
>
> I think we must stop the "vm", not only the "vcpu". I discussed this bit
> in the other thread more or less: it's because qemu_system_wakeup_request()
> can be called in many places, e.g. acpi_pm_tmr_timer().
>
> It means even after the VM is "stop"ped by the admin QMP (where qmp_stop()
> ignores SUSPENDED, keep the "vm" running), it can silently got waken up
> without admin even noticing it. I'm not sure what Libvirt will behave if
> it suddenly receives a QAPI_EVENT_WAKEUP randomly after a "stop".
>
>>
>> If you say SUSPENDED ---(stop)---> PAUSED ---(cont)---> SUSPENDED
>> then we create a problem, because the decision for the transition
>> out of PAUSED needs memory of the previous state.
>
> Right, that's why I think we at least need one more boolean to remember the
> suspended state, then when we switch from any STOP states into any RUN
> states, we know where to go. Here STOP states I meant anything except
> RUNNING and SUSPENDED, while RUN -> RUNNING or SUSPENDED.
>
>>
>>> My understanding is we should expect to fully stop the VM, including the
>>> ticks, for example. Keeping the ticks running even after QMP "stop"
>>> doesn't sound right, isn't it?
>>
>> The "stop" QMP command is documented as
>>
>> "Stop all guest VCPU execution"
>>
>> the devil is in the detail though, and we've not documented any detail.
>>
>> Whether or not timers keep running across stop/cont I think can be
>> argued to be an impl detail, as long as the headline goal "vcpus
>> don't execute" is satisfied.
>
> "stop" was since qemu v0.14, so I guess we can't blame the missing of
> details or any form of inaccuracy.. Obviously we do more than "stop vCPU
> executions" in the current implementation.
>
> But after we reach a consensus on how we should fix the current suspended
> problem, we may want to update the documentation to start containing more
> information.
>
>>
>>>> vs the migration code where we are fixing brokenness.
>>>
>>> This is not a migration-only bug if above holds, IMO.
>>>
>>>> Second, it does not quite work, because the state becomes
>>>> RUN_STATE_PAUSED, so the suspended state is forgotten, and the hmp "cont"
>>>> will try to set the running state. I could fix that by introducing a new
>>>> state RUN_STATE_SUSPENDED_STOPPED, but again it is a user-visible change
>>>> in existing behavior. (I even implemented that while developing, then I
>>>> realized it was not needed to fix the migration bugs.)
>>>
>>> Good point.
>>
>> We have added new guest states periodically. It is a user visible
>> change, but you could argue that it is implementing a new feature
>> ie the ability to "stop" a "suspended" guest, and so is justified.
>>
>> S3 is so little used in virt, so I'm not surprised we're finding
>> long standing edge cases that have never been thought about before.
>>
>>> Now with above comments, what's your thoughts on whether we should change
>>> the user behavior? My answer is still a yes.
>>>
>>> Maybe SUSPENDED should not be a RunState at all? SUSPENDED is guest visible
>>> behavior, while something like QMP "stop" is not guest visible. Maybe we
>>> should remember it separately?
>>
>> Yes, every time I look at this area I come away thinking that
>> the RunState enum is a mess, overloading too many different
>> concepts onto the same single field.
>>
>> Specifically "SUSPENDED" vs "RUNNING" is a reflection of guest
>> state (ie whether or not the VM is in S3), but pretty much all
>> the others are a reflection of QEMU host state. I kind of feel
>> that SUSPENDED (S3) probably shouldn't have been a RunState at
>> all. I'd probably put guest-panicked into a separate thing too.
>>
>> But we're stuck with what we have.
>
> IMO compatibility is only necessary if at least the existing code is
> running well. But now I see it a major flaw in suspended state and I can't
> see how it can go right if with current code base.. My concern is instead
> that after suspended will be used more (e.g., assuming CPR will rely on it)
> we can have more chance to confuse/oops a mgmt app like Libvirt, like I
> described above.
>
> In summary, I think a current solution to me is only to fix at least
> suspended state for good, by:
>
> - adding vm_suspended boolean to remember machine RUNNING / SUSPENDED
> state. When "cont" we need to switch to either RUNNING / SUSPENDED
> depending on the boolean.
>
> - keeping SUSPENDED state as RunState (for compatibility, otherwise we'll
> need another interface to fetch that boolean anyway), even though not
> query-able during any !RUNNING && !SUSPENDED state.. hopefully not a
> big deal
>
> - enrich ducumentation of qmp_stop/qmp_cont to describe what they really do
>
> - (with suspended working all right...) fix migration of SUSPENDED state
>
> I don't expect a lot of code changes is needed, maybe even less than the
> current series (because we don't need special knob to differenciate
> migration or non-migration callers of do_vm_stop()). IMHO this series is
> already doing some of that but just decided to ignore outside-migration
> states for suspeneded.
>
> We may want to add some test cases though to verify the suspended state
> transitions (maybe easier to put into migration-test with the new ACPI
> guest code), but optional.
FYI, here is a brief update before today's meeting. I have implemented this and
I am testing libvirt and its various save + restore commands, when the guest is
suspended running (RUN_STATE_SUSPENDED), and suspended stopped (RUN_STATE_PAUSED
with vm_was_suspended = true). There are a few failures, and I am still
investigating
to see whether they can be fixed in qemu, or need a fix in libvirt.
I will send more details later.
- Steve
- Re: [PATCH V5 02/12] cpus: stop vm in suspended state, (continued)
- Re: [PATCH V5 02/12] cpus: stop vm in suspended state, Peter Xu, 2023/11/20
- Re: [PATCH V5 02/12] cpus: stop vm in suspended state, Steven Sistare, 2023/11/20
- Re: [PATCH V5 02/12] cpus: stop vm in suspended state, Peter Xu, 2023/11/20
- Re: [PATCH V5 02/12] cpus: stop vm in suspended state, Steven Sistare, 2023/11/21
- Re: [PATCH V5 02/12] cpus: stop vm in suspended state, Peter Xu, 2023/11/21
- Re: [PATCH V5 02/12] cpus: stop vm in suspended state, Daniel P . Berrangé, 2023/11/22
- Re: [PATCH V5 02/12] cpus: stop vm in suspended state, Peter Xu, 2023/11/22
- Re: [PATCH V5 02/12] cpus: stop vm in suspended state,
Steven Sistare <=
[PATCH V5 01/12] cpus: refactor vm_stop, Steve Sistare, 2023/11/13
[PATCH V5 03/12] cpus: pass runstate to vm_prepare_start, Steve Sistare, 2023/11/13
[PATCH V5 05/12] migration: preserve suspended runstate, Steve Sistare, 2023/11/13
[PATCH V5 09/12] tests/qtest: option to suspend during migration, Steve Sistare, 2023/11/13