qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH v3 3/3] fw/pci: Allocate IGD stolen memory


From: Michael S. Tsirkin
Subject: Re: [Qemu-devel] [RFC PATCH v3 3/3] fw/pci: Allocate IGD stolen memory
Date: Mon, 15 Feb 2016 21:29:26 +0200

On Mon, Feb 15, 2016 at 12:20:23PM -0700, Alex Williamson wrote:
> On Mon, 15 Feb 2016 10:54:51 +0100
> Igor Mammedov <address@hidden> wrote:
> 
> > On Sat, 13 Feb 2016 18:03:31 -0700
> > Alex Williamson <address@hidden> wrote:
> > 
> > > On Sat, 13 Feb 2016 19:20:32 -0500
> > > "Kevin O'Connor" <address@hidden> wrote:
> > >   
> > > > On Sat, Feb 13, 2016 at 01:57:09PM -0700, Alex Williamson wrote:    
> > > > > On Sat, 13 Feb 2016 15:05:09 -0500
> > > > > "Kevin O'Connor" <address@hidden> wrote:      
> > > > > > On Sat, Feb 13, 2016 at 11:51:51AM -0700, Alex Williamson wrote:    
> > > > > >   
> > > > > > > On Sat, 13 Feb 2016 13:18:39 -0500
> > > > > > > "Kevin O'Connor" <address@hidden> wrote:        
> > > > > > > > On Sat, Feb 13, 2016 at 08:12:09AM -0700, Alex Williamson 
> > > > > > > > wrote:        
> > > > > > > > > On Fri, 12 Feb 2016 21:49:04 -0500
> > > > > > > > > "Kevin O'Connor" <address@hidden> wrote:          
> > > > > > > > > > On Fri, Feb 12, 2016 at 05:23:18PM -0700, Alex Williamson 
> > > > > > > > > > wrote:          
> > > > > > > > > > > Intel IGD makes use of memory allocated and marked 
> > > > > > > > > > > reserved by the
> > > > > > > > > > > BIOS as a stolen memory range.  For the most part, guest 
> > > > > > > > > > > drivers don't
> > > > > > > > > > > make use of this, but our achilles heel is the vBIOS.  
> > > > > > > > > > > The vBIOS
> > > > > > > > > > > programs the device to use the host stolen memory range 
> > > > > > > > > > > and it's used
> > > > > > > > > > > in the pre-boot environment.  Generally the guest won't 
> > > > > > > > > > > have access to
> > > > > > > > > > > the host stolen memory area, so these accesses either 
> > > > > > > > > > > land in VM
> > > > > > > > > > > memory or unassigned space and generate IOMMU faults.  By 
> > > > > > > > > > > allocating
> > > > > > > > > > > this range in SeaBIOS and programming it into the device, 
> > > > > > > > > > > QEMU (via
> > > > > > > > > > > vfio) can make sure this guest allocated stolen memory 
> > > > > > > > > > > range is used
> > > > > > > > > > > instead.            
> > > > > > > > > > 
> > > > > > > > > > What does "vBIOS" mean in this context?  Is it the video 
> > > > > > > > > > device option
> > > > > > > > > > rom or something else?          
> > > > > > > > > 
> > > > > > > > > vBIOS = video BIOS, you're correct, it's just the video 
> > > > > > > > > device option
> > > > > > > > > ROM.          
> > > > > > > > 
> > > > > > > > Is the problem from when the host runs the video option rom, or 
> > > > > > > > is the
> > > > > > > > problem from the guest (via SeaBIOS) running the video option 
> > > > > > > > rom?  If
> > > > > > > > the guest is running the option rom, is it the first time the 
> > > > > > > > option
> > > > > > > > rom has been run for the machine (ie, was the option rom not 
> > > > > > > > executed
> > > > > > > > on the host when the host machine first booted)?
> > > > > > > > 
> > > > > > > > FWIW, many of the chromebooks use coreboot with Intel graphics 
> > > > > > > > and a
> > > > > > > > number of users have installed SeaBIOS (running natively) on 
> > > > > > > > their
> > > > > > > > machines.  Running the intel video option rom more than once 
> > > > > > > > has been
> > > > > > > > known to cause issues.        
> > > > > > > 
> > > > > > > The issue is in the VM and it occurs every time the option ROM is
> > > > > > > executed.  Standard VGA text mode displays fine (ex. SeaBIOS 
> > > > > > > version
> > > > > > > string and boot menu), but any sort of extended graphics mode 
> > > > > > > (ex. live
> > > > > > > CD boot menu) tries to make use of the host memory area which
> > > > > > > corresponds to the stolen memory area of the physical device.  
> > > > > > > We're
> > > > > > > not really sure how the ROM execution arrives at these addresses 
> > > > > > > (not
> > > > > > > from the device according to access traces), but we can see when 
> > > > > > > the
> > > > > > > ROM is writing these addresses to the device and modify they 
> > > > > > > addresses
> > > > > > > to point at a VM memory range which we've allocated.  That's what 
> > > > > > > this
> > > > > > > code attempts to do, allocate a buffer and tell QEMU about it via 
> > > > > > > the
> > > > > > > BDSM (Base Data of Stolen Memory) register.        
> > > > > > 
> > > > > > Forgive me if I'm not fully understanding this.  If I read what 
> > > > > > you're
> > > > > > saying then the sequence is something like:
> > > > > > 
> > > > > > 1 - the host system bios (or vgabios) programs the GTT/stolen memory
> > > > > >     base register at host system bootup time and reserves it in the
> > > > > >     host e820 map.
> > > > > > 
> > > > > > 2 - upon running qemu, the guest reruns the vga bios option rom 
> > > > > > which
> > > > > >     seems to work (ie, text mode works)
> > > > > > 
> > > > > > 3 - in the guest, upon running a bootloader that uses graphics mode,
> > > > > >     the bootloader calls the vgabios to switch to graphics mode, and
> > > > > >     the vgabios sends commands to the graphics hardware that somehow
> > > > > >     reference the host stolen memory      
> > > > > 
> > > > > What exactly happens here isn't clear to me, but this is a plausible
> > > > > explanation.  What we see in tracing access to the hardware is that a
> > > > > bunch of addresses are written to the device that fall within the host
> > > > > e820 reserved area and then the device starts generating IOMMU faults
> > > > > trying to access those addresses.
> > > > >       
> > > > > > 4 - your patch causes QEMU to catch these commands with references 
> > > > > > to
> > > > > >     the host stolen memory and replace them with references to the
> > > > > >     guest stolen memory (which seabios creates)
> > > > > > 
> > > > > > Am I understanding the above correctly?      
> > > > > 
> > > > > Yes.
> > > > >        
> > > > > > Is the only reason to run the intel option rom in the guest for
> > > > > > bootloader graphic mode support?  Last time I looked, the intel vga
> > > > > > hardware could fully emulate a legacy vga device - if the device is 
> > > > > > in
> > > > > > vga compatibility mode then it may be possible to have seavgabios
> > > > > > drive mode changes.      
> > > > > 
> > > > > I have a SandyBridge based laptop (Lenovo W520) where the LCD panel
> > > > > won't turn on without the vBIOS.      
> > > > 
> > > > This confuses me - why didn't the host system BIOS turn on the LCD
> > > > panel during host bootup?    
> > > 
> > > It turns off when we reset the device between VM instances or between
> > > VM boots.  IGD supports Function Level Reset (FLR).
> > >   
> > > > >Another desktop IvyBridge system
> > > > > doesn't really care about the vBIOS so long as we don't ask it to
> > > > > output anything before the guest native drivers are loaded.  If we
> > > > > could, I think we'd just enable vBIOS for laptop panel support, but
> > > > > that's really not an option, it's going to run as a boot option ROM as
> > > > > well, so we need to fix the issues that it generates there.      
> > > > 
> > > > From my experience with coreboot, running the vga option rom multiple
> > > > times during a given boot is very fragile.  (By multiple times, I mean
> > > > either the host running it and then a guest, or running it multiple
> > > > times from multiple guests.)  YMMV.    
> > > 
> > > We do this regularly for graphics assignment, Nvidia, AMD, and now
> > > Intel.  It generally works ok.  Perhaps you've seen issues with the
> > > option ROM being run multiple times without resetting the device.  I
> > > could certainly believe that.  We only have one blacklisted Broadcom
> > > ROM in vfio, probably due to missing or incomplete device reset method.
> > >    
> > > > > > > > [...]        
> > > > > > > > > The write to 0x5C is used by QEMU code that traps writes to 
> > > > > > > > > the
> > > > > > > > > device I/O port BAR and replaces the host stolen memory 
> > > > > > > > > address
> > > > > > > > > with the new guest address generated here.  0x5C is 
> > > > > > > > > initialized to
> > > > > > > > > 0x0 by kernel vfio code, so we can detect whether it has been
> > > > > > > > > written.  If not written, QEMU has no place to redirect to for
> > > > > > > > > stolen memory and it will either overlap VM memory or an 
> > > > > > > > > unassigned
> > > > > > > > > area.  The former may corrupt VM memory, the latter throws 
> > > > > > > > > host
> > > > > > > > > errors.  We could in QEMU halt with a hardware error if 0x5C 
> > > > > > > > > hasn't
> > > > > > > > > been programmed.          
> > > > > > > > 
> > > > > > > > So, if I understand correctly, 0x5C is not a "real" register on 
> > > > > > > > the
> > > > > > > > hardware, but is instead just a mechanism to give QEMU the 
> > > > > > > > address of
> > > > > > > > some guest visible ram?        
> > > > > > > 
> > > > > > > It is a real register, BDSM that is virtualized by vfio turning it
> > > > > > > effectively into a scratch register.  On physical hardware the
> > > > > > > register is read-only.
> > > > > > >          
> > > > > > > > BTW, is 0xFC a "real" register in the hardware?  How does the 
> > > > > > > > guest
> > > > > > > > find the location of the "OpRegion" if SeaBIOS allocates it?    
> > > > > > > >     
> > > > > > > 
> > > > > > > 0xFC is the ASL Storage register, the guest finds the location of 
> > > > > > > the
> > > > > > > OpRegion using this register.  This is another register that is
> > > > > > > read-only on real hardware but virtualized through vfio so we can
> > > > > > > relocate the OpRegion into the VM address space.
> > > > > > > 
> > > > > > > I've found that allocating a dummy MMIO BAR does work as an 
> > > > > > > alternative
> > > > > > > for mapping space for this stolen memory into the VM address 
> > > > > > > space.
> > > > > > > For a Linux guest it works to allocate BAR5 on the IGD device.
> > > > > > > Windows10 is not so happy with this, but does work if I allocate 
> > > > > > > the
> > > > > > > BAR on something like the ISA bridge device.  My guess is that 
> > > > > > > the IGD
> > > > > > > driver in Windows freaks out at finding this strange new BAR on 
> > > > > > > its
> > > > > > > device.  So I'll need to come up with an algorithm for either 
> > > > > > > creating
> > > > > > > a dummy PCI device to host this BAR or trying to add it to other
> > > > > > > existing devices.  It's certainly a more self-contained solution 
> > > > > > > this
> > > > > > > way, so I expect we'll only need patch 1/3 from this series.  
> > > > > > > Thanks,        
> > > > > > 
> > > > > > Okay.  (I'm not saying patch 3 is bad, but okay.)
> > > > > > 
> > > > > > If you go through the trouble of mapping the BDSM through a pci bar,
> > > > > > then why not do the same with ASLS then too?      
> > > > > 
> > > > > I suppose we could do that.  There are a few nuances to the fake BAR
> > > > > solution:
> > > > > 
> > > > > 1) The BAR needs to get mapped and not remapped while in use - usually
> > > > >    not a problem.
> > > > > 
> > > > > 2) The guest needs to not disable the device we attach the BARs to,
> > > > >    which it might do if it doesn't recognize the device.
> > > > > 
> > > > > 3) We need to be careful about adding BARs to devices the guest does
> > > > >    have drivers for or we might overlap real functionality.
> > > > > 
> > > > > 4) If we create a dummy device with bogus IDs, it will show up with an
> > > > >    exclamation mark in device manager, which makes people unhappy.
> > > > > 
> > > > > So from a perspective of being self contained, the fake BAR solution 
> > > > > is
> > > > > very good, but it's not without issue.  I'll try to think of what sort
> > > > > of dummy device we could create that would always have a guest driver,
> > > > > but nothing that a couple extra BARs would interfere with.  Maybe a
> > > > > generic PCI bridge.  Thanks,      
> > > > 
> > > > Okay.  Again, I'm not stating a preferred direction.
> > > > 
> > > > BTW, I wonder if the recent discussion between Michael and Igor is
> > > > relevant here:
> > > >   https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg05602.html   
> > > >  
> > > 
> > > I'm certainly open to rebuttals against this approach, but I do have it
> > > working.  Being entirely self contained is pretty intriguing.
> > > Theoretically this would allow us to work with OVMF with no
> > > modifications.  Linux guests enable and disable devices several times
> > > during boot (per the spec, any time the BAR is sized it should be
> > > disabled), Windows never seems to disable the device.  The LPC/ISA
> > > bridge seems to be the best place to put these BARs, we need to create
> > > that anyway for pre-Broadwell/Skylake and the device itself has no
> > > implemented BARs. The ISA bridge is just a shell device to keep the
> > > driver happy on those older chips, so squatting on a couple BARs
> > > doesn't seem too terrible. Thanks,  
> > I'm not sure what way Windows would behave wrt rebalancing
> > if that fake bar is added to LPC/ISA bridge and I can think of
> > a way to test and verify it either.
> > 
> > In above discussion Michael dislikes that allocating GPA in QEMU,
> > while I dislike stealing guest's RAM for just getting the same GPA.
> > Perhaps it could better if we teach SeaBIOS/OVMF to allocate GPA
> > without stealing guest's RAM somewhere in free address space and
> > tell QEMU about it.
> > Michael suggested to do something similar for stolen RAM in
> > bios_linker interface only, which is a part of SeaBIOS/OVMF
> > but maybe making it more generic PV interface might be even
> > more useful. That way we won't have to implement fake PCI
> > devices/BARs nor hijack existing PCI devices.
> 
> Well, I was hoping that your requirements that the address cannot move
> or be disabled was more strict that mine, but I think I've just proven
> it's not.  I prevented the native drivers from loading in the guest
> (pci-stubs.ids=8086:0126), reserved the memory that SeaBIOS was
> programming for the ISA bridge BARs (memmap=2M#0xf1000000), told PCI to
> reassign devices (pci=realloc), and booted with a VESA mode (vga=ask).
> If I get all those things to align then VESA mode makes use of the
> stolen memory assigned at vBIOS init and when the guest moves it we get
> garbage on the screen and a spew of IOMMU faults in the host.  Even if
> I don't do the reserved/realloc thing, I get a single IOMMU fault in
> the host, which I assume is an access that occurs while the BAR is
> being sized and therefore disabled.  That would be a more realistic
> scenario of an install program not having native support for the
> GPU.  Darn, this PCI hack was growing on me.
> 
> I don't mind the allocation of guest memory issue, the fact that guest
> memory is consumed by built-in devices is exactly what happens on bare
> metal.

Right, I keep saying that.

> The only part I don't like about the BIOS assigning the GPA is
> that we need support in each BIOS to do that.

But then if we assign it on QEMU do we still need
each BIOS to read it?

>  I think we had already
> come up with a pretty good way to handle the opregion using
> etc/igd-opregion to fill the space, for stolen memory, I think we're
> just going to need to do something similar, maybe even via a dummy
> fw_cfg file so that SeaBIOS not only knows the size, but if we needed
> to pre-populate it, we could.  It at least puts QEMU in control, but
> relaying back the address via a device specific register is still a bit
> ugly.  Thanks,
> 
> Alex

I can build a generic interface to pass addresses
allocated by bios back to QEMU. It looks like this would
be useful for other purposes as well.  Interested?

-- 
MST



reply via email to

[Prev in Thread] Current Thread [Next in Thread]