Re: [Qemu-devel] Multi GPU passthrough via VFIO

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] Multi GPU passthrough via VFIO

From:	Alex Williamson
Subject:	Re: [Qemu-devel] Multi GPU passthrough via VFIO
Date:	Mon, 19 Jan 2015 10:43:07 -0700
On Fri, 2015-01-16 at 13:21 +0100, Maik Broemme wrote:
> Hi Alex,
> 
> Maik Broemme <address@hidden> wrote:
> > Hi Alex,
> > 
> > Maik Broemme <address@hidden> wrote:
> > > Hi Alex,
> > > 
> > > Alex Williamson <address@hidden> wrote:
> > > > On Fri, 2014-02-14 at 01:01 +0100, Maik Broemme wrote:
> > > > > Hi Alex,
> > > > > 
> > > > > Maik Broemme <address@hidden> wrote:
> > > > > > Hi Alex,
> > > > > > 
> > > > > > Alex Williamson <address@hidden> wrote:
> > > > > > > On Fri, 2014-02-07 at 01:22 +0100, Maik Broemme wrote:
> > > > > > > > Interesting is the diff between 1st and 2nd boot, so if I do 
> > > > > > > > the lspci
> > > > > > > > prior to the booting. The only difference between 1st start and 
> > > > > > > > 2nd
> > > > > > > > start are:
> > > > > > > > 
> > > > > > > > --- 001-lspci.290x.before.1st.log       2014-02-07 
> > > > > > > > 01:13:41.498827928 +0100
> > > > > > > > +++ 004-lspci.290x.before.2nd.log       2014-02-07 
> > > > > > > > 01:16:50.966611282 +0100
> > > > > > > > @@ -24,7 +24,7 @@
> > > > > > > >                         ClockPM- Surprise- LLActRep- BwNot-
> > > > > > > >                 LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes 
> > > > > > > > Disabled- CommClk+
> > > > > > > >                         ExtSynch- ClockPM- AutWidDis- BWInt- 
> > > > > > > > AutBWInt-
> > > > > > > > -               LnkSta: Speed 5GT/s, Width x16, TrErr- Train- 
> > > > > > > > SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > > > > > > +               LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- 
> > > > > > > > SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > > > > > >                 DevCap2: Completion Timeout: Not Supported, 
> > > > > > > > TimeoutDis-, LTR-, OBFF Not Supported
> > > > > > > >                 DevCtl2: Completion Timeout: 50us to 50ms, 
> > > > > > > > TimeoutDis-, LTR-, OBFF Disabled
> > > > > > > >                 LnkCtl2: Target Link Speed: 8GT/s, 
> > > > > > > > EnterCompliance- SpeedDis-
> > > > > > > > @@ -33,13 +33,13 @@
> > > > > > > >                 LnkSta2: Current De-emphasis Level: -3.5dB, 
> > > > > > > > EqualizationComplete-, EqualizationPhase1-
> > > > > > > >                          EqualizationPhase2-, 
> > > > > > > > EqualizationPhase3-, LinkEqualizationRequest-
> > > > > > > >         Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 
> > > > > > > > 64bit+
> > > > > > > > -               Address: 0000000000000000  Data: 0000
> > > > > > > > +               Address: 00000000fee00000  Data: 0000
> > > > > > > >         Capabilities: [100 v1] Vendor Specific Information: 
> > > > > > > > ID=0001 Rev=1 Len=010 <?>
> > > > > > > >         Capabilities: [150 v2] Advanced Error Reporting
> > > > > > > >                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- 
> > > > > > > > CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > > > > > > >                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- 
> > > > > > > > CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > > > > > > >                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- 
> > > > > > > > CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> > > > > > > > -               CESta:  RxErr- BadTLP- BadDLLP- Rollover- 
> > > > > > > > Timeout- NonFatalErr-
> > > > > > > > +               CESta:  RxErr- BadTLP- BadDLLP- Rollover- 
> > > > > > > > Timeout- NonFatalErr+
> > > > > > > >                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- 
> > > > > > > > Timeout- NonFatalErr+
> > > > > > > >                 AERCap: First Error Pointer: 00, GenCap+ 
> > > > > > > > CGenEn- ChkCap+ ChkEn-
> > > > > > > >         Capabilities: [270 v1] #19
> > > > > > > > 
> > > > > > > > After that if I do suspend-to-ram / resume trick I have again 
> > > > > > > > lspci
> > > > > > > > output from before 1st boot.
> > > > > > > 
> > > > > > > The Link Status change after X is stopped seems the most 
> > > > > > > interesting to
> > > > > > > me.  The MSI change is probably explained by the MSI save/restore 
> > > > > > > of the
> > > > > > > device, but should be harmless since MSI is disabled.  I'm a bit
> > > > > > > surprised the Correctable Error Status in the AER capability 
> > > > > > > didn't get
> > > > > > > cleared.  I would have thought that a bus reset would have caused 
> > > > > > > the
> > > > > > > link to retrain back to the original speed/width as well.  Let's 
> > > > > > > check
> > > > > > > that we're actually getting a bus reset, try this in addition to 
> > > > > > > the
> > > > > > > previous qemu patch.  This just enables debug logging for the bus 
> > > > > > > resest
> > > > > > > function.  Thanks,
> > > > > > > 
> > > > > > 
> > > > > > Below are the outputs from 2 boots, VGA, load fglrx and start X. 
> > > > > > (2nd
> > > > > > time X gets killed and oops happened)
> > > > > > 
> > > > > > - 1st boot:
> > > > > > 
> > > > > > vfio: vfio_pci_hot_reset(0000:01:00.1) multi
> > > > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > > > vfio:       0000:01:00.0 group 1
> > > > > > vfio:       0000:01:00.1 group 1
> > > > > > vfio: 0000:01:00.1 hot reset: Success
> > > > > > vfio: vfio_pci_hot_reset(0000:01:00.1) one
> > > > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > > > vfio:       0000:01:00.0 group 1
> > > > > > vfio: vfio: found another in-use device 0000:01:00.0
> > > > > > vfio: vfio_pci_hot_reset(0000:01:00.0) one
> > > > > > vfio: 0000:01:00.0: hot reset dependent devices:
> > > > > > vfio:       0000:01:00.0 group 1
> > > > > > vfio:       0000:01:00.1 group 1
> > > > > > vfio: vfio: found another in-use device 0000:01:00.1
> > > > > > 
> > > > > > - 2nd boot:
> > > > > > 
> > > > > > vfio: vfio_pci_hot_reset(0000:01:00.1) multi
> > > > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > > > vfio:       0000:01:00.0 group 1
> > > > > > vfio:       0000:01:00.1 group 1
> > > > > > vfio: 0000:01:00.1 hot reset: Success
> > > > > > vfio: vfio_pci_hot_reset(0000:01:00.1) one
> > > > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > > > vfio:       0000:01:00.0 group 1
> > > > > > vfio: vfio: found another in-use device 0000:01:00.0
> > > > > > vfio: vfio_pci_hot_reset(0000:01:00.0) one
> > > > > > vfio: 0000:01:00.0: hot reset dependent devices:
> > > > > > vfio:       0000:01:00.0 group 1
> > > > > > vfio:       0000:01:00.1 group 1
> > > > > > vfio: vfio: found another in-use device 0000:01:00.1
> > > > > > 
> > > > > 
> > > > > Did you had already a chance to look into it or anything else I can 
> > > > > help
> > > > > with?
> > > > 
> > > > According to the log we're doing the bus reset on both the first and 2nd
> > > > boot (it's expected that only the "multi" call gets to success).  I'm
> > > > surprised then that the link doesn't retrain back to the original width.
> > > > You could try forcing the link to retrain.  Look at the root port
> > > > upstream from the GPU, lspci -t is handy for this.  Run lspci on the
> > > > root port to get the PCI express capability offset, then use setpci to
> > > > set the link retrain bit.  For example:
> > > > 
> > > > # lspci -tv | grep NVIDIA
> > > >            +-07.0-[03]--+-00.0  NVIDIA Corporation GK106GL [Quadro 
> > > > K4000]
> > > >            |            \-00.1  NVIDIA Corporation GK106 HDMI Audio 
> > > > Controller
> > > > 
> > > > (upstream root port is 00:07.0)
> > > > 
> > > > # lspci -v -s 7.0 | grep Capabilities
> > > >         Capabilities: [40] Subsystem: Intel Corporation 5520/5500/X58 
> > > > I/O Hub PCI Express Root Port 7
> > > >         Capabilities: [60] MSI: Enable+ Count=1/2 Maskable+ 64bit-
> > > >         Capabilities: [90] Express Root Port (Slot+), MSI 00
> > > >         Capabilities: [e0] Power Management version 3
> > > >         Capabilities: [100] Advanced Error Reporting
> > > >         Capabilities: [150] Access Control Services
> > > >         Capabilities: [160] Vendor Specific Information: ID=0002 Rev=0 
> > > > Len=00c <?>
> > > > 
> > > > (PCI express capability is offset 0x90, Link Control is 0x10 off that)
> > > > 
> > > > # setpci -s 7.0 a0.w
> > > > 0040
> > > > 
> > > > (retrain is bit 5, 0x20, OR'd with read value is 0x60)
> > > > 
> > > > # setpci -s 7.0 a0.w=60
> > > > 
> > > > # lspci... did it work?
> > > > 
> > > > Try doing that after the first boot to see if you can get back to a x16
> > > > link.  If that works, we may need to add something in the kernel to do
> > > > it automatically around a bus reset.  Thanks,
> > > > 
> > > 
> > > Well this doesn't help either and it looks like VFIO reset is setting it
> > > already back to original width. For example:
> > > 
> > >            +-02.0-[01]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] 
> > > Hawaii XT [Radeon HD 8970]
> > >            |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] 
> > > Device aac8
> > > 
> > > Before 1st run:
> > > 
> > > address@hidden:~# lspci -vvv -s 00:02.0 | grep LnkSta:
> > >           LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ 
> > > DLActive+ BWMgmt- ABWMgmt-
> > > address@hidden:~# lspci -vvv -s 01:00.0 | grep LnkSta:
> > >           LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ 
> > > DLActive- BWMgmt- ABWMgmt-
> > > 
> > > After power down of VM:
> > > 
> > > address@hidden:~# lspci -vvv -s 00:02.0 | grep LnkSta:
> > >           LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
> > > DLActive+ BWMgmt- ABWMgmt+
> > > address@hidden:~# lspci -vvv -s 01:00.0 | grep LnkSta:
> > >           LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
> > > DLActive- BWMgmt- ABWMgmt-
> > > 
> > > After 2nd start once VFIO did reset:
> > > 
> > > address@hidden:~# lspci -vvv -s 00:02.0 | grep LnkSta:
> > >           LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ 
> > > DLActive+ BWMgmt- ABWMgmt+
> > > address@hidden:~# lspci -vvv -s 01:00.0 | grep LnkSta:
> > >           LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ 
> > > DLActive- BWMgmt- ABWMgmt-
> > > 
> > > The only difference on bus I see here is ABWMgmt- vs ABWMgmt+ but it
> > > shouldn't be relevant here as it the same if I unload fglrx module
> > > before shutdown the VM which is the only case where I can run multiple
> > > VM reboot cycles.
> > > 
> > > So the only difference on bus is the following:
> > > 
> > > -60: 10 08 00 00 02 cd 31 00 40 00 02 b1 80 25 14 00
> > > +60: 10 08 00 00 02 cd 31 00 40 00 11 b0 80 25 14 00
> > > 
> > > 6a (before 02, after 11)
> > > 6b (before b1, after b0)
> > > 
> > > But I cannot write these parameters using setpci. My PCI express 
> > > capability
> > > is offset 0x58 + 0x10 for link control which is already set back to 40
> > > 
> > > address@hidden:~# lspci -vvv -s 00:02.0 | grep Capa
> > >   Capabilities: [50] Power Management version 3
> > >   Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
> > >   Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit-
> > >   Capabilities: [b0] Subsystem: Gigabyte Technology Co., Ltd Device 5000
> > >   Capabilities: [b8] HyperTransport: MSI Mapping Enable+ Fixed+
> > >   Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 
> > > Len=010 <?>
> > >   Capabilities: [190 v1] Access Control Services
> > > 
> > 
> > Wouldn't it be a possible solution to do a D0 -> D3 -> D0 transition for
> > devices which doesn't support FLR? The setpci way doesn't help me at all
> > 
> 
> I want to renew the thread a bit as with latest slot/bus reset some
> things have changed but it still doesn't work in all cases.
> 
> #1 QEMU+OVMF (UEFI):
> 
> I've flashed my R9 290X with an UEFI compatible BIOS and QEMU+OVMF
> (without CSM) boots Windows 8.1 fine. Catalyst 14.12 drivers can be
> installed without issues and work fine. However an attempt to reboot the
> VM result in Windows 8.1 typical "Something went wrong :(" screen. The
> suspend/resume trick still works between VM reboots.
> 
> #2 QEMU (BIOS):
> 
> In this scenario I use secondary GPU passthrough (no VGA as primary
> adapter) using Windows 7. Catalyst 14.12 drivers can be installed
> without issues and work fine. Also I was surprised that an attempt to
> reboot the VM was also working. Windows 7 restarts fine, I see the login
> screen and no performance issues. But it doesn't work always, sometimes
> it works for 3-4 reboots and next one fails with just a black screen
> (but Windows VM is pingable and ACPI shutdown still works), sometimes it
> works only for one reboot. In all cases the suspend/resume trick still
> works.
> 
> So I would like to narrow down the problem. Anything I can try Alex,
> like debugging logs of QEMU.
> 
> Used QEMU version is 2.2.0, kernel is 3.18.2.

There's a small changed queued for v3.20 that will exclude PM reset as
an option for AMD GPUs (because it doesn't so anything), but I don't
expect this will change anything for you.  It mostly just enables reset
on release for cards like my HD8570 that report they support PM reset.

Cards like your R9 290X (if I'm remembering correctly) and my R7790
simply don't seem to reset their internal components like they're
supposed to during a bus reset.  I've reached out to AMD developers
regarding this problem; it has theoretically been passed to the
appropriate teams, but I haven't heard of any progress or resolution.

Assignment as a secondary GPU requires driver support and while AMD
seems interested in supporting GPU assignment, I haven't seen any
evidence that they're willing to do anything to make it happen.
Guessing what might be wrong in case #2 is not fun, so it's not a very
interesting case unless AMD wants to make an effort there.  Have you
tried reporting the bug to AMD?  Perhaps you can install a VNC server in
the guest so you can interact and collect data in the failure case.
Case #1 is stuck at the reset problem, and again AMD isn't offering much
help there and I'm out of ideas short of dissecting datasheets for
various root ports to figure out if we can toggle power to the slot.
Thanks,

Alex
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-devel] Multi GPU passthrough via VFIO, Maik Broemme, 2015/01/16
- Re: [Qemu-devel] Multi GPU passthrough via VFIO, Alex Williamson <=
Prev by Date: Re: [Qemu-devel] [PATCH 2/2] tcg-arm: more instruction execution control
Next by Date: Re: [Qemu-devel] [PATCH v7 05/32] target-arm: make arm_current_el() return EL3
Previous by thread: Re: [Qemu-devel] Multi GPU passthrough via VFIO
Next by thread: [Qemu-devel] [PULL 00/16] Block patches
Index(es):
- Date
- Thread