qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD syste


From: Joao Martins
Subject: Re: [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU
Date: Thu, 23 Jun 2022 00:18:06 +0100

On 6/22/22 23:37, Alex Williamson wrote:
> On Fri, 20 May 2022 11:45:27 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> v4[5] -> v5:
>> * Fixed the 32-bit build(s) (patch 1, Michael Tsirkin)
>> * Fix wrong reference (patch 4) to TCG_PHYS_BITS in code comment and
>> commit message;
>>
>> ---
>>
>> This series lets Qemu spawn i386 guests with >= 1010G with VFIO,
>> particularly when running on AMD systems with an IOMMU.
>>
>> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid 
>> and it
>> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
>> affected by this extra validation. But AMD systems with IOMMU have a hole in
>> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
>> here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
>> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
>>
>> VFIO DMA_MAP calls in this IOVA address range fall through this check and 
>> hence return
>>  -EINVAL, consequently failing the creation the guests bigger than 1010G. 
>> Example
>> of the failure:
>>
>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: 
>> VFIO_MAP_DMA: -22
>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 
>> 0000:41:10.1: 
>>      failed to setup container for group 258: memory listener initialization 
>> failed:
>>              Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 
>> 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
>>
>> Prior to v5.4, we could map to these IOVAs *but* that's still not the right 
>> thing
>> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), 
>> or
>> spurious guest VF failures from the resultant IOMMU target abort (see Errata 
>> 1155[2])
>> as documented on the links down below.
>>
>> This small series tries to address that by dealing with this AMD-specific 
>> 1Tb hole,
>> but rather than dealing like the 4G hole, it instead relocates RAM above 4G
>> to be above the 1T if the maximum RAM range crosses the HT reserved range.
>> It is organized as following:
>>
>> patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting
>>          address of the 4G boundary
>>
>> patches 2-3: Move pci-host qdev creation to be before pc_memory_init(),
>>           to get accessing to pci_hole64_size. The actual pci-host
>>           initialization is kept as is, only the qdev_new.
>>
>> patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
>> possible address acrosses the HT region. Errors out if the phys-bits is too
>> low, which is only the case for >=1010G configurations or something that
>> crosses the HT region.
>>
>> patch 5: Ensure valid IOVAs only on new machine types, but not older
>> ones (<= v7.0.0)
>>
>> The 'consequence' of this approach is that we may need more than the default
>> phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
>> address, consequently needing 41 phys-bits as opposed to the default of 40
>> (TCG_PHYS_ADDR_BITS). Today there's already a precedent to depend on the 
>> user to
>> pick the right value of phys-bits (regardless of this series), so we warn in
>> case phys-bits aren't enough. Finally, CMOS loosing its meaning of the above 
>> 4G
>> ram blocks, but it was mentioned over RFC that CMOS is only useful for very
>> old seabios. 
>>
>> Additionally, the reserved region is added to E820 if the relocation is done.
> 
> I was helping a user on irc yesterday who was assigning a bunch of GPUs
> on an AMD system and was not specifying an increased PCI hole and
> therefore was not triggering the relocation.  The result was that the
> VM doesn't know about this special range and given their guest RAM
> size, firmware was mapping GPU BARs overlapping this reserved range
> anyway.  I didn't see any evidence that this user was doing anything
> like booting with pci=nocrs to blatantly ignore the firmware provided
> bus resources.
> 
> To avoid this sort of thing, shouldn't this hypertransport range always
> be marked reserved regardless of whether the relocation is done?
> 
Yeap, I think that's the right thing to do. We were alluding to that in patch 4.

I can switch said patch to IS_AMD() together with a phys-bits check to add the
range to e820.

But in practice, right now, this is going to be merely informative and doesn't
change the outcome, as OVMF ignores reserved ranges if I understood that code
correctly.

relocation is most effective at avoiding this reserved-range overlapping issue
on guests with less than a 1010GiB.

> vfio-pci won't generate a fatal error when MMIO mappings fail, so this
> scenario can be rather subtle.  NB, it also did not resolve this user's
> problem to specify the PCI hole size and activate the relocation, so
> this was not necessarily the issue they were fighting, but I noted it
> as an apparent gap in this series.  Thanks,

So I take it that even after the user expanded the PCI hole64 size and thus
the GPU BARS were placed in a non-reserved range... still saw the MMIO
mappings fail?

        Joao



reply via email to

[Prev in Thread] Current Thread [Next in Thread]