qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-ppc] [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support


From: Sam Bobroff
Subject: Re: [Qemu-ppc] [Qemu-devel] [PATCH v3 0/4] target-ppc: Add FWNMI support in qemu for powerKVM guests
Date: Wed, 2 Sep 2015 16:34:01 +1000
User-agent: Mutt/1.5.23 (2014-03-12)

On Tue, Sep 01, 2015 at 04:37:51PM +0530, Aravinda Prasad wrote:
> 
> 
> On Monday 10 August 2015 09:35 AM, Sam Bobroff wrote:
> > On Sun, Aug 09, 2015 at 03:53:02PM +0200, Alexander Graf wrote:
> >>
> >>
> >> On 07.08.15 05:37, Sam Bobroff wrote:
> >>> Hello Aravinda and all,
> >>>
> >>> On Wed, Jul 08, 2015 at 01:58:13PM +0530, Aravinda Prasad wrote:
> >>>> On Friday 03 July 2015 11:31 AM, David Gibson wrote:
> >>>>> On Thu, Jul 02, 2015 at 07:11:52PM +1000, Alexey Kardashevskiy wrote:
> >>>>>> On 04/02/2015 03:46 PM, David Gibson wrote:
> >>>>>>> On Thu, Apr 02, 2015 at 03:28:11PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>> On 11/19/2014 04:48 PM, Aravinda Prasad wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tuesday 11 November 2014 08:54 AM, David Gibson wrote:
> >>>>>>>>>
> >>>>>>>>> [..]
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> So, this may not still be possible depending on whether the KVM 
> >>>>>>>>>> side
> >>>>>>>>>> of this is already merged, but it occurs to me that there's a 
> >>>>>>>>>> simpler
> >>>>>>>>>> way.
> >>>>>>>>>>
> >>>>>>>>>> Rather than mucking about with having to update the hypervisor on 
> >>>>>>>>>> the
> >>>>>>>>>> RTAS location, they have qemu copy the code out of RTAS, patch it 
> >>>>>>>>>> and
> >>>>>>>>>> copy it back into the vector, you could instead do this:
> >>>>>>>>>>
> >>>>>>>>>>   1. Make KVM instead of immediately delivering a 0x200 for a guest
> >>>>>>>>>> machine check, cause a special exit to qemu.
> >>>>>>>>>>
> >>>>>>>>>>   2. Have the register-nmi RTAS call store the guest side MC 
> >>>>>>>>>> handler
> >>>>>>>>>> address in the spapr structure, but perform no actual guest code
> >>>>>>>>>> patching.
> >>>>>>>>>>
> >>>>>>>>>>   3. Allocate the error log buffer independently from the RTAS 
> >>>>>>>>>> blob,
> >>>>>>>>>> so qemu always knows where it is.
> >>>>>>>>>>
> >>>>>>>>>>   4. When qemu gets the MC exit condition, instead of going via a
> >>>>>>>>>> patched 0x200 vector, just directly set the guest register state 
> >>>>>>>>>> and
> >>>>>>>>>> jump straight into the guest side MC handler.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Before I proceed further I would like to know what others think 
> >>>>>>>>> about
> >>>>>>>>> the approach proposed above (except for step 3 - as per PAPR the 
> >>>>>>>>> error
> >>>>>>>>> log buffer should be part of RTAS blob and hence we cannot have 
> >>>>>>>>> error
> >>>>>>>>> log buffer independent of RTAS blob).
> >>>>>>>>>
> >>>>>>>>> Alex, Alexey, Ben: Any thoughts?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Any updates about FWNMI? Thanks
> >>>>>>>
> >>>>>>> Huh.. I'd completely forgotten about this.  Aravinda, can you repost
> >>>>>>> your latest work on this?
> >>>>>>
> >>>>>>
> >>>>>> Aravinda disappeared...
> >>>>>
> >>>>> Ok, well someone who cares about FWNMI is going to have to start
> >>>>> sending something, or it won't happen.
> >>>>
> >>>> I am yet to work on the new approach proposed above. I will start
> >>>> looking into that this week.
> >>>
> >>> The RTAS call being discussed in this thread actually has two vectors to 
> >>> patch
> >>> (System Reset and Machine Check), and the patches so far only address the
> >>> Machine Check part. I've been looking at filling in the System Reset part 
> >>> and
> >>> that will mean basing my code on top of this set.  I would like to keep 
> >>> the
> >>> same style of solution for both vectors, so I'd like to get the discussion
> >>> started again :-)
> >>>
> >>> So (1) do we use a trampoline in guest memory, and if so (2) how is the
> >>> trampoline code handled?
> >>>
> >>> (1) It does seem simpler to me to deliver directly to the handler, but I'm
> >>> worried about a few things:
> >>>
> >>> If a guest were to call ibm,nmi-register and then kexec to a new kernel 
> >>> that
> >>> does not call ibm,nmi-register, would the exception cause a jump to a 
> >>> stale
> >>> address?
> >>
> >> Probably - how does that get handled today with pHyp? Does pHyp just
> >> override the actual exception vector code and thus the kexec'ed code
> >> path gets overwritten?
> > 
> > Yes. According to PAPR, when ibm,nmi-register is called the guest
> > "relinquishes" the whole 256 bytes of vector at 0x100 and 0x200 to the
> > hypervisor. It never mentions a way to get them back but it does jump via 
> > the
> > vector so if a guest were to rewrite it, it should work the way we expect.
> > 
> >> I don't remember the original patch set fully, but if all we need is to
> >> override 0x200, why can't we replace the code with
> >>
> >>   mtsprg scratch, r0
> >>   li r0, HCALL_KVM_MC
> >>   sc 1
> >>
> >> then there is no complexity in that code at all with dynamically patched
> >> bits. Or am I missing the obvious?
> > 
> > There is more complexity in the patches because PAPR requires the 
> > hypervisor do
> > some work before invoking the guest's handler. The patch set does this by
> > writing a trampoline (roughly) like this:
> > * Call a new private hcall to set up the required state (re-trying if 
> > necessary).
> > * Return to the trampoline code.
> > * Jump via a patched branch instruction to the guest's handler.
> > 
> > So it's a bit roundabout but gets the job done.
> > 
> > If what you're suggesting is that we replace this by a single (new, private)
> > hcall that sets up the state and jumps the guest to the handler then I 
> > think it
> > might be a good compromise. It simplifies the trampoline code and doesn't
> > suffer from the kexec problem. The only issue would be that it would be an 
> > odd
> > hcall: rather than returning to the caller like a normal hcall, it would 
> > jump
> > out to some other address, and this jump would be by QEMU manipulating the
> > guest state.
> > 
> > If we followed this approach for both 0x100 and 0x200, maybe we should 
> > re-use
> > the hcall: it could either take a parameter or switch based on where it was
> > called from (since it's only going to be valid to call it from either the 
> > 0x100
> > or 0x200 vectors).
> > 
> > Maybe call it "HCALL_KVM_FWNMI_TRAMPOLINE"?
> > 
> >>> Because we're adding a new exit condition, presumably an upgraded KVM 
> >>> would
> >>> require an upgraded QEMU: is this much of a problem?
> >>
> >> Well, you would keep default behavior identical. On nmi-register QEMU
> >> would send an ioctl to KVM, telling it to route 0x200 to QEMU instead
> >> (just like with breakpoints). So old QEMU would still work the same way
> >> and new QEMU with old KVM would simply get non-working MC intercepts.
> > 
> > Great, doesn't sound like a problem :-)
> > 
> >>> From some investigation it looks like the current upstream KVM already
> >>> forwards (some) host machine checks to the guest by sending it directly to
> >>> 0x200 and that Linux guests expect this, regardless of support in the 
> >>> host for
> >>> ibm,nmi-register (although they do call ibm,nmi-register if it's present).
> >>>
> >>> (2) If we are using trampolines:
> >>>
> >>> About the trampoline code in the v3 patches: I like producing the code 
> >>> using
> >>> the assembler, but I'm not sure that the spapr-rtas blob is the right 
> >>> place to
> >>> store it. The spapr-rtas blob is loaded into guest memory but it's only 
> >>> QEMU
> >>> that needs it. It seems messy to me and means that the guest could 
> >>> corrupt it.
> >>
> >> If you like, rename the blob. My original proposal was to just use
> >> well-known offsets inside the blob that get indicated through a function
> >> pointer table at the beginning/end/known location.
> >>
> > 
> > OK.
> > 
> >>> Some other other options might be:
> >>>
> >>> (a) Create a new blob (spapr-rtas-trampoline?) just like the spapr-rtas 
> >>> one but
> >>> only load it when ibm,nmi-register is called, and only into QEMU not the 
> >>> guest
> >>> memory. There would be another "BIOS" blob to install, and it wouldn't 
> >>> really
> >>> actually be BIOS but it seems like it would work easily.  Since we need a
> >>> second, different, trampoline for System Reset, I would then need to add 
> >>> yet
> >>> another blob for that... Still, this doesn't seem so bad. I suppose we 
> >>> could
> >>> add some structure to the blob (e.g. a table of contents at the start) 
> >>> and fit
> >>> both trampolines in, but that's inventing yet another file format... ugh.
> >>
> >> Yes, I think inventing our own file format is the best way forward. It
> >> shouldn't be too bad. Just reserve say 10 64bit values somewhere are use
> >> then as function table.
> > 
> > OK.
> > 
> >>> (b) As above but assemble the trampoline code into an ELF dynamic library
> >>> rather than stripping it down to a raw binary: we could use known symbols 
> >>> to
> >>> find the trampolines, even the patch locations, so at least we wouldn't be
> >>> inventing our own format (using dlopen()/dlsym()... I wonder if this 
> >>> would be
> >>> OK for all platforms...).
> >>
> >> We have our own ELF loader in QEMU so it's workable, but I think it's
> >> actually more complicated and harder at the end of the day than (a).
> > 
> > OK, fair enough.
> > 
> >>> (c) Assemble it (as above) but include it directly in the QEMU binary by
> >>> objcopying it in or hexdumping into a C string or something similar. This 
> >>> seems
> >>> fairly neat but I'm not sure how people would feel about including 
> >>> "binaries"
> >>> into QEMU this way.  Although it would take some work in the build 
> >>> system, it
> >>> seems like a fairly neat solution to me.
> >>
> >> We tried to move away from code as hex arrays in QEMU to make it easier
> >> for people to patch things when they want to. But then again if we're
> >> talking 3 instructions it might not be the worst option.
> > 
> > Sounds sensible.
> > 
> > So, in summary, it sounds like a decent approach would be:
> > * store the guest's handlers in QEMU's spapr structure,
> > * simplify the trampolines down to a single, non-returning, hcall,
> 
> However, other instructions such as saving r3 and re-trying hcall are
> still required.

Ah yes, that's true. I was thinking that the retrying could happen inside the
hcall but it can't.

> > * implement the new hcall in QEMU, where it has the handler addresses,
> > * include the (now tiny) trampoline directly in the code,
> 
> This was my first approach...
> 
> If we simplify the trampoline, we can directly include the trampoline in
> the code. With the assumption on the new hcall, trampoline should be
> reduced by half. But it will still have 7 to 8 instructions.

Hmm. That's not as small as I'd hoped.

> > * no new blob or blob code,
> 
> Yes. no new blob or blob code. However, we still need to put error log
> in RTAS blob as per PAPR.
> 
> > * no changes to KVM,
> > * no problems with kexec.
> 
> Not sure if this solves kexec. What if the kexec kernel does not call
> nmi-register? Upon system-reset/machine-check we still jump to the old
> stale address.

I thought that the kexec would overwrite the trampoline code with the new
kernel's default handler, making it safe to use. Is that not the case?

> > Aravinda: what do you think?
> 
> Also, we need different trampolines for system reset and machine check
> as retrying is not required for system reset.

Yes.

> Apart from that does system reset requires any error log to be passed on
> to the guest?

No it doesn't. It just needs to jump to the handler :-)

> This approach sounds good. I will modify the code and post it soon.

Since we can't get the trampoline code as small as I'd hoped, I wonder if we
could go the other way and implement all of h_report_mc_err() directly in the
vector? (The RTAS blob address could be patched in like the handler address.)

We would obviously need to generate this code from the assembler and probably
store it in a new blob, but we could remove the private hcall. Perhaps this
would be better overall. Is there some reason this won't work, and we need it
to be an hcall?

> Regards,
> Aravinda
> 
> > 
> >> Alex
> > 
> > Thanks Alex,
> > Sam.
> > 
> 
> -- 
> Regards,
> Aravinda




reply via email to

[Prev in Thread] Current Thread [Next in Thread]