[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-ppc] [PATCH v3 2/5] ppc: spapr: Handle "ibm, nmi-register" and
From: |
David Gibson |
Subject: |
Re: [Qemu-ppc] [PATCH v3 2/5] ppc: spapr: Handle "ibm, nmi-register" and "ibm, nmi-interlock" RTAS calls |
Date: |
Thu, 28 Sep 2017 13:58:15 +1000 |
User-agent: |
Mutt/1.9.0 (2017-09-02) |
On Wed, Sep 27, 2017 at 05:23:51PM +0530, Aravinda Prasad wrote:
>
>
> On Wednesday 27 September 2017 12:45 PM, David Gibson wrote:
> > On Thu, Sep 21, 2017 at 02:39:06PM +0530, Aravinda Prasad wrote:
> >>
> >>
> >> On Tuesday 22 August 2017 07:38 AM, David Gibson wrote:
> >>
> >> [ . . . ]
> >>
> >>>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >>>>>> index 46012b3..eee8d33 100644
> >>>>>> --- a/include/hw/ppc/spapr.h
> >>>>>> +++ b/include/hw/ppc/spapr.h
> >>>>>> @@ -123,6 +123,12 @@ struct sPAPRMachineState {
> >>>>>> * occurs during the unplug process. */
> >>>>>> QTAILQ_HEAD(, sPAPRDIMMState) pending_dimm_unplugs;
> >>>>>>
> >>>>>> + /* State related to "ibm,nmi-register" and "ibm,nmi-interlock"
> >>>>>> calls */
> >>>>>> + target_ulong guest_machine_check_addr;
> >>>>>> + bool mc_in_progress;
> >>>>>> + int mc_cpu;
> >>>>>
> >>>>> mc_cpu isn't actually used yet in this patch. In any case it and
> >>>>> mc_in_progress could probably be folded together, no?
> >>>>
> >>>> It is possible to fold mc_cpu and mc_in_progress together with the
> >>>> convention that if it is set to -1 mc is not in progress otherwise it is
> >>>> set to the CPU handling the mc.
> >>>>
> >>>>>
> >>>>> These values will also need to be migrated, AFAICT.
> >>>>
> >>>> I am thinking of how to handle the migration when machine check handling
> >>>> is in progress. Probably wait for machine check handling to complete
> >>>> before migrating as the error could be irrelevant once migrated to a new
> >>>> hardware. If that is the case we don't need to migrate these values.
> >>>
> >>> Ok.
> >>
> >> This is what I think about handling machine check during migration based
> >> on my understanding of the VM migration code.
> >>
> >> There are two possibilities here. First, migration can be initiated
> >> while the machine check handling is in progress. Second, A machine check
> >> error can happen when the migration is in progress.
> >>
> >> To handle the first case we can add migrate_add_blocker() call when we
> >> start handling the machine check error and issue migrate_del_blocker()
> >> when done. I think this should solve the issue.
> >>
> >> The second case is bit tricky. The migration has already started and
> >> hence migrate_add_blocker() call will fail. We also cannot wait till the
> >> completion of the migration to handle machine check error as the VM's
> >> data could be corrupt.
> >>
> >> Machine check errors should not be an issue when the migration is in the
> >> RAM copy phase as VM is still active with vCPUs running. The problem is
> >> when we hit a machine check when the migration is about to complete. For
> >> example,
> >>
> >> 1. vCPU2 hits a machine check error during migration.
> >>
> >> 2. KVM causes VM exit on vCPU2 and the NIP of vCPU2 is changed to the
> >> guest registered machine check handler.
> >>
> >> 3. The migration_completion() issues vm_stop() and hence either vCPU2 is
> >> never scheduled again on the source hardware or vCPU2 is preempted while
> >> executing the machine check handler.
> >>
> >> 4. vCPU2 is resumed on the target hardware and either starts or
> >> continues processing the machine check error. This could be a problem as
> >> these errors are specific to the source hardware. For instance, when the
> >> the guest issues memory poisoning upon such error, a clean page on the
> >> target hardware is poisoned while the corrupt page on source hardware is
> >> not poisoned.
> >>
> >> The second case of hitting machine check during the final phase of
> >> migration is rare but wanted to check what others think about it.
> >
> > So, I've had a bit of a think about this. I don't recall if these
> > fwnmi machine checks are expected on guest RAM, or guest IO addresses.
>
> It is expected on guest RAM. I am not sure about guest IO address.
>
> >
> > 1) If RAM
> >
> > What exactly is the guest's notification for? Even without
> > migration, the host's free to move guest memory around in host
> > memory, so it seems any hardware level poking should be done on the
> > host side.
>
> If the error is a correctable error, then host takes care of it by
> moving the page to a different location, the guest need not be and will
> not be notified. Guest will be notified if host is not able to fully
> recover. Hence we hit FWNMI in guest when RAM errors are not recovered
> by the host.
Ok.
> > Is it just to notify the guest that we weren't able to fully recover
> > on the host side and that page may contain corrupted data? If
> > that's so then it seems resuming the handling on the destination is
> > still right. It may be new good RAM, but the contents we migrated
> > could still be corrupt from the machine check event on the source.
>
> Yes. This is what I am doing in my v5 patch set which I am about to
> post. Additionally I block migration when processing machine check errors.
>
> >
> > 2) If IO
> >
> > AFAICT this could only happen with VFIO passthrough devices.. but it
> > shouldn't be possible to migrate if there are any of those.
> >
>
> I am not very sure about IO errors.
Ok. It sounds like that's not the primary case you're interested, so
I guess we can ignore it for now.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature