qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH qemu v7 2/4] vfio/pci: Relax DMA map errors for


From: Alexey Kardashevskiy
Subject: Re: [Qemu-devel] [PATCH qemu v7 2/4] vfio/pci: Relax DMA map errors for MMIO regions
Date: Wed, 7 Mar 2018 13:17:50 +1100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0

On 26/02/18 19:36, Alexey Kardashevskiy wrote:
> On 19/02/18 13:46, Alexey Kardashevskiy wrote:
>> On 16/02/18 16:28, David Gibson wrote:
>>> On Wed, Feb 14, 2018 at 08:55:41AM -0700, Alex Williamson wrote:
>>>> On Wed, 14 Feb 2018 19:09:16 +1100
>>>> Alexey Kardashevskiy <address@hidden> wrote:
>>>>
>>>>> On 14/02/18 12:33, David Gibson wrote:
>>>>>> On Tue, Feb 13, 2018 at 07:20:56PM +1100, Alexey Kardashevskiy wrote:  
>>>>>>> On 13/02/18 16:41, David Gibson wrote:  
>>>>>>>> On Tue, Feb 13, 2018 at 04:36:30PM +1100, David Gibson wrote:  
>>>>>>>>> On Tue, Feb 13, 2018 at 12:15:52PM +1100, Alexey Kardashevskiy wrote: 
>>>>>>>>>  
>>>>>>>>>> On 13/02/18 03:06, Alex Williamson wrote:  
>>>>>>>>>>> On Mon, 12 Feb 2018 18:05:54 +1100
>>>>>>>>>>> Alexey Kardashevskiy <address@hidden> wrote:
>>>>>>>>>>>  
>>>>>>>>>>>> On 12/02/18 16:19, David Gibson wrote:  
>>>>>>>>>>>>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy 
>>>>>>>>>>>>> wrote:    
>>>>>>>>>>>>>> At the moment if vfio_memory_listener is registered in the 
>>>>>>>>>>>>>> system memory
>>>>>>>>>>>>>> address space, it maps/unmaps every RAM memory region for DMA.
>>>>>>>>>>>>>> It expects system page size aligned memory sections so 
>>>>>>>>>>>>>> vfio_dma_map
>>>>>>>>>>>>>> would not fail and so far this has been the case. A mapping 
>>>>>>>>>>>>>> failure
>>>>>>>>>>>>>> would be fatal. A side effect of such behavior is that some MMIO 
>>>>>>>>>>>>>> pages
>>>>>>>>>>>>>> would not be mapped silently.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However we are going to change MSIX BAR handling so we will end 
>>>>>>>>>>>>>> having
>>>>>>>>>>>>>> non-aligned sections in vfio_memory_listener (more details is in
>>>>>>>>>>>>>> the next patch) and vfio_dma_map will exit QEMU.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In order to avoid fatal failures on what previously was not a 
>>>>>>>>>>>>>> failure and
>>>>>>>>>>>>>> was just silently ignored, this checks the section alignment to
>>>>>>>>>>>>>> the smallest supported IOMMU page size and prints an error if 
>>>>>>>>>>>>>> not aligned;
>>>>>>>>>>>>>> it also prints an error if vfio_dma_map failed despite the page 
>>>>>>>>>>>>>> size check.
>>>>>>>>>>>>>> Both errors are not fatal; only MMIO RAM regions are checked
>>>>>>>>>>>>>> (aka "RAM device" regions).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If the amount of errors printed is overwhelming, the MSIX 
>>>>>>>>>>>>>> relocation
>>>>>>>>>>>>>> could be used to avoid excessive error output.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is unlikely to cause any behavioral change.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <address@hidden>    
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are some relatively superficial problems noted below.
>>>>>>>>>>>>>
>>>>>>>>>>>>> But more fundamentally, this feels like it's extending an existing
>>>>>>>>>>>>> hack past the point of usefulness.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The explicit check for is_ram_device() here has always bothered 
>>>>>>>>>>>>> me -
>>>>>>>>>>>>> it's not like a real bus bridge magically knows whether a target
>>>>>>>>>>>>> address maps to RAM or not.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What I think is really going on is that even for systems without 
>>>>>>>>>>>>> an
>>>>>>>>>>>>> IOMMU, it's not really true to say that the PCI address space maps
>>>>>>>>>>>>> directly onto address_space_memory.  Instead, there's a large, but
>>>>>>>>>>>>> much less than 2^64 sized, "upstream window" at address 0 on the 
>>>>>>>>>>>>> PCI
>>>>>>>>>>>>> bus, which is identity mapped to the system bus.  Details will 
>>>>>>>>>>>>> vary
>>>>>>>>>>>>> with the system, but in practice we expect nothing but RAM to be 
>>>>>>>>>>>>> in
>>>>>>>>>>>>> that window.  Addresses not within that window won't be mapped to 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> system bus but will just be broadcast on the PCI bus and might be
>>>>>>>>>>>>> picked up as a p2p transaction.    
>>>>>>>>>>>>
>>>>>>>>>>>> Currently this p2p works only via the IOMMU, direct p2p is not 
>>>>>>>>>>>> possible as
>>>>>>>>>>>> the guest needs to know physical MMIO addresses to make p2p work 
>>>>>>>>>>>> and it
>>>>>>>>>>>> does not.  
>>>>>>>>>>>
>>>>>>>>>>> /me points to the Direct Translated P2P section of the ACS spec, 
>>>>>>>>>>> though
>>>>>>>>>>> it's as prone to spoofing by the device as ATS.  In any case, p2p
>>>>>>>>>>> reflected from the IOMMU is still p2p and offloads the CPU even if
>>>>>>>>>>> bandwidth suffers vs bare metal depending on if the data doubles 
>>>>>>>>>>> back
>>>>>>>>>>> over any links.  Thanks,  
>>>>>>>>>>
>>>>>>>>>> Sure, I was just saying that p2p via IOMMU won't be as simple as 
>>>>>>>>>> broadcast
>>>>>>>>>> on the PCI bus, IOMMU needs to be programmed in advance to make this 
>>>>>>>>>> work,
>>>>>>>>>> and current that broadcast won't work for the passed through 
>>>>>>>>>> devices.  
>>>>>>>>>
>>>>>>>>> Well, sure, p2p in a guest with passthrough devices clearly needs to
>>>>>>>>> be translated through the IOMMU (and p2p from a passthrough to an
>>>>>>>>> emulated device is essentially impossible).
>>>>>>>>>
>>>>>>>>> But.. what does that have to do with this code.  This is the memory
>>>>>>>>> area watcher, looking for memory regions being mapped directly into
>>>>>>>>> the PCI space.  NOT IOMMU regions, since those are handled separately
>>>>>>>>> by wiring up the IOMMU notifier.  This will only trigger if RAM-like,
>>>>>>>>> non-RAM regions are put into PCI space *not* behind an IOMMMU.  
>>>>>>>>
>>>>>>>> Duh, sorry, realised I was mixing up host and guest IOMMU.  I guess
>>>>>>>> the point here is that this will map RAM-like devices into the host
>>>>>>>> IOMMU when there is no guest IOMMU, allowing p2p transactions between
>>>>>>>> passthrough devices (though not from passthrough to emulated devices). 
>>>>>>>>  
>>>>>>>
>>>>>>> Correct.
>>>>>>>  
>>>>>>>>
>>>>>>>> The conditions still seem kind of awkward to me, but I guess it makes
>>>>>>>> sense.  
>>>>>>>
>>>>>>> Is it the time to split this listener to RAM-listener and PCI bus 
>>>>>>> listener?  
>>>>>>
>>>>>> I'm not really sure what you mean by that.
>>>>>>   
>>>>>>> On x86 it listens on the "memory" AS, on spapr - on the
>>>>>>> "address@hidden" AS, this will just create more confusion over time...  
>>>>>>
>>>>>> Hm, it's still logically the same AS in each case - the PCI address
>>>>>> space.  It's just that in the x86 case that happens to be the same as
>>>>>> the memory AS (at least when there isn't a guest side IOMMU).  
>>>>>
>>>>> Hm. Ok.
>>>>>
>>>>>> I do wonder if that's really right even for x86 without IOMMU.  The
>>>>>> PCI address space is identity mapped to the "main" memory address
>>>>>> space there, but I'm not sure it's mapped th *all* of the main memory  
>>>>>
>>>>> What should have been in the place of "th" above? :)
>>>>>
>>>>>> address space, probably just certain ranges of it.  That's what I was
>>>>>> suggesting earlier in the thread.  
>>>>>
>>>>> afaict vfio is trying to mmap every RAM MR == KVM memory slot, no ranges 
>>>>> or
>>>>> anything like that. I am trying to figure out what is not correct with the
>>>>> patch. Thanks,
>>>>
>>>> It is possible for x86 systems to have translation applied to MMIO vs
>>>> RAM such that the processor view and device view of MMIO are different,
>>>> putting them into separate address spaces, but this is not typical and
>>>> not available on the class of chipsets that QEMU emulates for PC.
>>>> Therefore I think it's correct that MMIO and RAM all live in one big
>>>> happy, flat address space as they do now (assuming the guest IOMMU is
>>>> not present or disabled).  Thanks,
>>>
>>> Ah.. I think the thing I was missing is that on PC (at least with
>>> typical chipsets) *all* MMIO essentially comes from PCI space.  Even
>>> the legacy devices are essentially ISA which is treated as being on a
>>> bridge under the PCI space.  On non-x86 there are often at least a
>>> handful of MMIO devices that aren't within the PCI space - say, the
>>> PCI host bridge itself at least.  x86 avoids that by using the
>>> separate IO space, which is also a bit weird in that it's
>>> simultaneously direct attached to the cpu (and so doesn't need bridge
>>> configuration), but also identified with the legay IO space treated as
>>> being under two bridges (host->PCI, PCI->ISA) for other purposes.
>>>
>>> Maybe it's just me, but I find it makes more sense to me if I think of
>>> it as the two ASes being equal because on PC the system address map
>>> (address_space_memory) is made equal to the PCI address map, rather
>>> than the PCI address map being made equal to the system one.
>>
>> It makes more sense to me too, we just do not exploit/expose this on SPAPR
>> - the PCI address space only has 2xIOMMU and 1xMSI windows and that's it,
>> no MMIO which is mapped to the system AS only (adding those to the PCI AS
>> is little tricky as mentioned elsewhere).
>>
>> Anyway, I still wonder - what needs to be addressed in the 2/4 patch in
>> order to proceed?
> 
> Ping?
> 


Ping?


-- 
Alexey

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]