[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v2 5/5] vfio/pci: Allow relocating MSI-X MMIO
From: |
Auger Eric |
Subject: |
Re: [Qemu-devel] [PATCH v2 5/5] vfio/pci: Allow relocating MSI-X MMIO |
Date: |
Fri, 19 Jan 2018 16:27:42 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 |
Hi Alex,
On 19/01/18 16:23, Alex Williamson wrote:
> Hi Eric,
>
> On Fri, 19 Jan 2018 10:50:53 +0100
> Auger Eric <address@hidden> wrote:
>
>> Hi Alex,
>>
>> On 10/01/18 20:02, Alex Williamson wrote:
>>> Recently proposed vfio-pci kernel changes (v4.16) remove the
>>> restriction preventing userspace from mmap'ing PCI BARs in areas
>>> overlapping the MSI-X vector table. This change is primarily intended
>>> to benefit host platforms which make use of system page sizes larger
>>> than the PCI spec recommendation for alignment of MSI-X data
>>> structures (ie. not x86_64). In the case of POWER systems, the SPAPR
>>> spec requires the VM to program MSI-X using hypercalls, rendering the
>>> MSI-X vector table unused in the VM view of the device. However,
>>> ARM64 platforms also support 64KB pages and rely on QEMU emulation of
>>> MSI-X. Regardless of the kernel driver allowing mmaps overlapping
>>> the MSI-X vector table, emulation of the MSI-X vector table also
>>> prevents direct mapping of device MMIO spaces overlapping this page.
>>> Thanks to the fact that PCI devices have a standard self discovery
>>> mechanism, we can try to resolve this by relocating the MSI-X data
>>> structures, either by creating a new PCI BAR or extending an existing
>>> BAR and updating the MSI-X capability for the new location. There's
>>> even a very slim chance that this could benefit devices which do not
>>> adhere to the PCI spec alignment guidelines on x86_64 systems.
>>>
>>> This new x-msix-relocation option accepts the following choices:
>>>
>>> off: Disable MSI-X relocation, use native device config (default)
>>> auto: Use a known good combination for the platform/device (none yet)
>>> bar0..bar5: Specify the target BAR for MSI-X data structures
>>>
>>> If compatible, the target BAR will either be created or extended and
>>> the new portion will be used for MSI-X emulation.
>>>
>>> The first obvious user question with this option is how to determine
>>> whether a given platform and device might benefit from this option.
>>> In most cases, the answer is that it won't, especially on x86_64.
>>> Devices often dedicate an entire BAR to MSI-X and therefore no
>>> performance sensitive registers overlap the MSI-X area. Take for
>>> example:
>>>
>>> # lspci -vvvs 0a:00.0
>>> 0a:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network
>>> Connection
>>> ...
>>> Region 0: Memory at db680000 (32-bit, non-prefetchable) [size=512K]
>>> Region 3: Memory at db7f8000 (32-bit, non-prefetchable) [size=16K]
>>> ...
>>> Capabilities: [70] MSI-X: Enable+ Count=10 Masked-
>>> Vector table: BAR=3 offset=00000000
>>> PBA: BAR=3 offset=00002000
>>>
>>> This device uses the 16K bar3 for MSI-X with the vector table at
>>> offset zero and the pending bits arrary at offset 8K, fully honoring
>> array
>>> the PCI spec alignment guidance. The data sheet specifically refers
>>> to this as an MSI-X BAR. This device would not see a benefit from
>>> MSI-X relocation regardless of the platform, regardless of the page
>>> size.
>>>
>>> However, here's another example:
>>>
>>> # lspci -vvvs 02:00.0
>>> 02:00.0 Serial Attached SCSI controller: xxxxxxxx
>>> ...
>>> Region 0: I/O ports at c000 [size=256]
>>> Region 1: Memory at ef640000 (64-bit, non-prefetchable) [size=64K]
>>> Region 3: Memory at ef600000 (64-bit, non-prefetchable) [size=256K]
>>> ...
>>> Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
>>> Vector table: BAR=1 offset=0000e000
>>> PBA: BAR=1 offset=0000f000
>>>
>>> Here the MSI-X data structures are placed on separate 4K pages at the
>>> end of a 64KB BAR. If our host page size is 4K, we're likely fine,
>>> but at 64KB page size, MSI-X emulation at that location prevents the
>>> entire BAR from being directly mapped into the VM address space.
>>> Overlapping performance sensitive registers then starts to be a very
>>> likely scenario on such a platform. At this point, the user could
>>> enable tracing on vfio_region_read and vfio_region_write to determine
>>> more conclusively if device accesses are being trapped through QEMU.
>>>
>>> Upon finding a device and platform in need of MSI-X relocation, the
>>> next problem is how to choose target PCI BAR to host the MSI-X data
>>> structures. A few key rules to keep in mind for this selection
>>> include:
>>>
>>> * There are only 6 BAR slots, bar0..bar5
>>> * 64-bit BARs occupy two BAR slots, 'lspci -vvv' lists the first slot
>>> * PCI BARs are always a power of 2 in size, extending == doubling
>>> * The maximum size of a 32-bit BAR is 2GB
>>> * MSI-X data structures must reside in an MMIO BAR
>>>
>>> Using these rules, we can evaluate each BAR of the second example
>>> device above as follows:
>>>
>>> bar0: I/O port BAR, incompatible with MSI-X tables
>>> bar1: BAR could be extended, incurring another 64KB of MMIO
>>> bar2: Unavailable, bar1 is 64-bit, this register is used by bar1
>>> bar3: BAR could be extended, incurring another 256KB of MMIO
>>> bar4: Unavailable, bar3 is 64bit, this register is used by bar3
>>> bar5: Available, empty BAR, minimum additional MMIO
>>>
>>> A secondary optimization we might wish to make in relocating MSI-X
>>> is to minimize the additional MMIO required for the device, therefore
>>> we might test the available choices in order of preference as bar5,
>>> bar1, and finally bar3. The original proposal for this feature
>>> included an 'auto' option which would choose bar5 in this case, but
>>> various drivers have been found that make assumptions about the
>>> properties of the "first" BAR or the size of BARs such that there
>>> appears to be no foolproof automatic selection available, requiring
>>> known good combinations to be sourced from users. This patch is
>>> pre-enabled for an 'auto' selection making use of a validated lookup
>>> table, but no entries are yet identified.
>>>
>>> Signed-off-by: Alex Williamson <address@hidden>
>>> ---
>>> hw/vfio/pci.c | 101
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++
>>> hw/vfio/pci.h | 1
>>> hw/vfio/trace-events | 2 +
>>> 3 files changed, 103 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index 20252ea7aeb7..7171ba18213c 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -1352,6 +1352,100 @@ static void
>>> vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>>> }
>>> }
>>>
>>> +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev, Error **errp)
>>> +{
>>> + int target_bar = -1;
>>> + size_t msix_sz;
>>> +
>>> + if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) {
>>> + return;
>>> + }
>>> +
>>> + /* The actual minimum size of MSI-X structures */
>>> + msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) +
>>> + (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8);
>>> + /* Round up to host pages, we don't want to share a page */
>>> + msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz);
>>> + /* PCI BARs must be a power of 2 */
>>> + msix_sz = pow2ceil(msix_sz);
>>> +
>>> + if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) {
>>> + /*
>>> + * TODO: Lookup table for known devices.
>>> + *
>>> + * Logically we might use an algorithm here to select the BAR
>>> adding
>>> + * the least additional MMIO space, but we cannot programatically
>>> + * predict the driver dependency on BAR ordering or sizing,
>>> therefore
>>> + * 'auto' becomes a lookup for combinations reported to work.
>>> + */
>>> + if (target_bar < 0) {
>>> + error_setg_errno(errp, EINVAL, "No automatic MSI-X relocation "
>>> + "available for device %04x:%04x",
>>> + vdev->vendor_id, vdev->device_id);
>> don't you want error_setg here and below?
>
> What's the benefit of one vs the other? I wasn't sure which to use.
well that's nit-picking: using the _errno with EINVAL arg will output
"Invalid Argument" at the end of the error message which does not bring
much value to the end-user.
>
>>> + return;
>>> + }
>>> + } else {
>>> + target_bar = (int)(vdev->msix_relo - OFF_AUTOPCIBAR_BAR0);
>>> + }
>>> +
>>> + /* I/O port BARs cannot host MSI-X structures */
>>> + if (vdev->bars[target_bar].ioport) {
>>> + error_setg_errno(errp, EINVAL, "Invalid MSI-X relocation BAR %d, "
>>> + "I/O port BAR", target_bar);
>>
>>> + return;
>>> + }
>>> +
>>> + /* Cannot use a BAR in the "shadow" of a 64-bit BAR */
>>> + if (!vdev->bars[target_bar].size &&
>>> + target_bar > 0 && vdev->bars[target_bar - 1].mem64) {
>>> + error_setg_errno(errp, EINVAL, "Invalid MSI-X relocation BAR %d, "
>>> + "consumed by 64-bit BAR %d", target_bar,
>>> + target_bar - 1);
>>> + return;
>>> + }
>>> +
>>> + /* 2GB max size for 32-bit BARs */
>>> + if (vdev->bars[target_bar].size > (1 * 1024 * 1024 * 1024) &&
>> nit: the comment versus the check is a bit misleading. If I understand
>> correctly, the x2 size would be gt 2GB.
>
> Right, if the BAR is >1G already (ie. 2G), there's no room to make it
> bigger. I'll add "... cannot double if already > 1G".
>
>>> + !vdev->bars[target_bar].mem64) {
>>> + error_setg_errno(errp, EINVAL, "Invalid MSI-X relocation BAR %d, "
>>> + "no space to extend 32-bit BAR", target_bar);
>>> + return;
>>> + }
>>> +
>>> + /*
>>> + * If adding a new BAR, test if we can make it 64bit. We make it
>>> + * prefetchable since QEMU MSI-X emulation has no read side effects
>>> + * and doing so makes mapping more flexible.
>>> + */
>>> + if (!vdev->bars[target_bar].size) {
>>> + if (target_bar < (PCI_ROM_SLOT - 1) &&
>>> + !vdev->bars[target_bar + 1].size) {
>>> + vdev->bars[target_bar].mem64 = true;
>>> + vdev->bars[target_bar].type = PCI_BASE_ADDRESS_MEM_TYPE_64;
>>> + }
>>> + vdev->bars[target_bar].type |= PCI_BASE_ADDRESS_MEM_PREFETCH;
>>> + vdev->bars[target_bar].size = msix_sz;
>>> + vdev->msix->table_offset = 0;
>>> + } else {
>>> + vdev->bars[target_bar].size = MAX(vdev->bars[target_bar].size * 2,
>>> + msix_sz * 2);
>>> + /*
>>> + * Due to above size calc, MSI-X always starts halfway into the
>>> BAR,
>>> + * which will always be a separate host page.
>>
>> nit: the spec gives this recommendation.
>>
>> "If a dedicated Base Address register is not feasible, it is recommended
>> that a function isolate the MSI-X structures from the non-MSI-X
>> structures with aligned 8 KB ranges rather than
>> the mandatory aligned 4 KB ranges."
>>
>> In some corner circumstances, with 4kB - which is not our main use case
>> - this may not be enforced here.
>
> I think this was an attempt to future proof hardware designs for larger
> page sizes, but 8K turned out to be mostly (entirely?) skipped as a
> common page size. We're not building real hardware and I can't think
> of any advantage to using a minimum of 8K alignment if the host page
> size is 4K, can you? Thanks,
OK no worries
Thanks
Eric
>
> Alex
>
>
>>> + */
>>> + vdev->msix->table_offset = vdev->bars[target_bar].size / 2;
>>> + }
>>> +
>>> + vdev->msix->table_bar = target_bar;
>>> + vdev->msix->pba_bar = target_bar;
>>> + /* Requires 8-byte alignment, but PCI_MSIX_ENTRY_SIZE guarantees that
>>> */
>>> + vdev->msix->pba_offset = vdev->msix->table_offset +
>>> + (vdev->msix->entries *
>>> PCI_MSIX_ENTRY_SIZE);
>>> +
>>> + trace_vfio_msix_relo(vdev->vbasedev.name,
>>> + vdev->msix->table_bar, vdev->msix->table_offset);
>>> +}
>>> +
>>> /*
>>> * We don't have any control over how pci_add_capability() inserts
>>> * capabilities into the chain. In order to setup MSI-X we need a
>>> @@ -1430,6 +1524,8 @@ static void vfio_msix_early_setup(VFIOPCIDevice
>>> *vdev, Error **errp)
>>> vdev->msix = msix;
>>>
>>> vfio_pci_fixup_msix_region(vdev);
>>> +
>>> + vfio_pci_relocate_msix(vdev, errp);
>>> }
>>>
>>> static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
>>> @@ -2845,13 +2941,14 @@ static void vfio_realize(PCIDevice *pdev, Error
>>> **errp)
>>>
>>> vfio_pci_size_rom(vdev);
>>>
>>> + vfio_bars_prepare(vdev);
>>> +
>>> vfio_msix_early_setup(vdev, &err);
>>> if (err) {
>>> error_propagate(errp, err);
>>> goto error;
>>> }
>>>
>>> - vfio_bars_prepare(vdev);
>>> vfio_bars_register(vdev);
>>>
>>> ret = vfio_add_capabilities(vdev, errp);
>>> @@ -3041,6 +3138,8 @@ static Property vfio_pci_dev_properties[] = {
>>> DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice,
>>> nv_gpudirect_clique,
>>> qdev_prop_nv_gpudirect_clique, uint8_t),
>>> + DEFINE_PROP_OFF_AUTO_PCIBAR("x-msix-relocation", VFIOPCIDevice,
>>> msix_relo,
>>> + OFF_AUTOPCIBAR_OFF),
>>> /*
>>> * TODO - support passed fds... is this necessary?
>>> * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name),
>>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>>> index dcdb1a806769..588381f201b4 100644
>>> --- a/hw/vfio/pci.h
>>> +++ b/hw/vfio/pci.h
>>> @@ -135,6 +135,7 @@ typedef struct VFIOPCIDevice {
>>> (1 << VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT)
>>> int32_t bootindex;
>>> uint32_t igd_gms;
>>> + OffAutoPCIBAR msix_relo;
>>> uint8_t pm_cap;
>>> uint8_t nv_gpudirect_clique;
>>> bool pci_aer;
>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>> index fae096c0724f..437ccdd29053 100644
>>> --- a/hw/vfio/trace-events
>>> +++ b/hw/vfio/trace-events
>>> @@ -16,6 +16,8 @@ vfio_msix_pba_disable(const char *name) " (%s)"
>>> vfio_msix_pba_enable(const char *name) " (%s)"
>>> vfio_msix_disable(const char *name) " (%s)"
>>> vfio_msix_fixup(const char *name, int bar, uint64_t start, uint64_t end) "
>>> (%s) MSI-X region %d mmap fixup [0x%"PRIx64" - 0x%"PRIx64"]"
>>> +vfio_msix_relo_cost(const char *name, int bar, uint64_t cost) " (%s) BAR
>>> %d cost 0x%"PRIx64""
>>> +vfio_msix_relo(const char *name, int bar, uint64_t offset) " (%s) BAR %d
>>> offset 0x%"PRIx64""
>>> vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled %d MSI
>>> vectors"
>>> vfio_msi_disable(const char *name) " (%s)"
>>> vfio_pci_load_rom(const char *name, unsigned long size, unsigned long
>>> offset, unsigned long flags) "Device %s ROM:\n size: 0x%lx, offset: 0x%lx,
>>> flags: 0x%lx"
>>>
>
- [Qemu-devel] [PATCH v2 0/5] vfio/pci: MSI-X MMIO relocation, Alex Williamson, 2018/01/10
- [Qemu-devel] [PATCH v2 1/5] vfio/pci: Fixup VFIOMSIXInfo comment, Alex Williamson, 2018/01/10
- [Qemu-devel] [PATCH v2 2/5] vfio/pci: Add base BAR MemoryRegion, Alex Williamson, 2018/01/10
- [Qemu-devel] [PATCH v2 3/5] vfio/pci: Emulate BARs, Alex Williamson, 2018/01/10
- [Qemu-devel] [PATCH v2 4/5] qapi: Create DEFINE_PROP_OFF_AUTO_PCIBAR, Alex Williamson, 2018/01/10
- [Qemu-devel] [PATCH v2 5/5] vfio/pci: Allow relocating MSI-X MMIO, Alex Williamson, 2018/01/10
- Re: [Qemu-devel] [PATCH v2 0/5] vfio/pci: MSI-X MMIO relocation, Alexey Kardashevskiy, 2018/01/14
- Re: [Qemu-devel] [PATCH v2 0/5] vfio/pci: MSI-X MMIO relocation, Auger Eric, 2018/01/19