[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation
|
From: |
Duan, Zhenzhong |
|
Subject: |
RE: [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation |
|
Date: |
Mon, 22 Jan 2024 05:59:29 +0000 |
>-----Original Message-----
>From: Jason Wang <jasowang@redhat.com>
>Subject: Re: [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation
>
>On Mon, Jan 15, 2024 at 6:39 PM Zhenzhong Duan
><zhenzhong.duan@intel.com> wrote:
>>
>> Hi,
>>
>> This series enables stage-1 translation support in intel iommu which
>> we called "modern" mode. In this mode, we don't do shadowing of
>> guest page table for passthrough device but pass stage-1 page table
>> to host side to construct a nested domain; we also support emulated
>> device by translating the stage-1 page table. There was some effort
>> to enable this feature in old days, see [1] for details.
>>
>> The key design is to utilize the dual-stage IOMMU translation
>> (also known as IOMMU nested translation) capability in host IOMMU.
>> As the below diagram shows, guest I/O page table pointer in GPA
>> (guest physical address) is passed to host and be used to perform
>> the stage-1 address translation. Along with it, modifications to
>> present mappings in the guest I/O page table should be followed
>> with an IOTLB invalidation.
>>
>> .-------------. .---------------------------.
>> | vIOMMU | | Guest I/O page table |
>> | | '---------------------------'
>> .----------------/
>> | PASID Entry |--- PASID cache flush --+
>> '-------------' |
>> | | V
>> | | I/O page table pointer in GPA
>> '-------------'
>> Guest
>> ------| Shadow |---------------------------|--------
>> v v v
>> Host
>> .-------------. .------------------------.
>> | pIOMMU | | FS for GIOVA->GPA |
>> | | '------------------------'
>> .----------------/ |
>> | PASID Entry | V (Nested xlate)
>> '----------------\.----------------------------------.
>> | | | SS for GPA->HPA, unmanaged domain|
>> | | '----------------------------------'
>> '-------------'
>> Where:
>> - FS = First stage page tables
>> - SS = Second stage page tables
>> <Intel VT-d Nested translation>
>>
>> There are some interactions between VFIO and vIOMMU.
>> * vIOMMU registers PCIIOMMUOps to PCI subsystem which VFIO can
>> use to registers/unregisters IOMMUDevice object.
>> * VFIO registers an IOMMUFDDevice object at vfio device realize
>> stage to vIOMMU, this is implemented as a prerequisite series[2].
>> * vIOMMU calls IOMMUFDDevice interface callback IOMMUFDDeviceOps
>> to bind/unbind device to IOMMUFD backed domains, either nested
>> domain or not.
>>
>> See below diagram:
>>
>> VFIO Device Intel IOMMU
>> .-----------------. .-------------------.
>> | | | |
>> | .---------|PCIIOMMUOps |.-------------. |
>> | | IOMMUFD |(set_iommu_device) || IOMMUFD | |
>> | | Device |------------------------>|| Device list | |
>> | .---------|(unset_iommu_device) |.-------------. |
>> | | | | |
>> | | | V |
>> | .---------| IOMMUFDDeviceOps| .---------. |
>> | | IOMMUFD | (attach_hwpt)| | IOMMUFD | |
>> | | link |<------------------------| | Device | |
>> | .---------| (detach_hwpt)| .---------. |
>> | | | | |
>> | | | ... |
>> .-----------------. .-------------------.
>>
>> Based on Yi's suggestion, we updated a new design of managing ioas and
>> hwpt, made it support multiple iommufd objects and the ERRATA_772415
>> case, meanwhile tried to be optimal to share ioas and hwpt whenever
>> possible.
>>
>> Stage-2 page table could be shared by different devices if there is
>> no conflict and devices link to same iommufd object, i.e. devices
>> under same host IOMMU can share same stage-2 page table. If there
>> is conflict, i.e. there is one device under non cache coherency
>> mode which is different from others, it requires a seperate
>> stage-2 page table in non-CC mode.
>>
>> SPR platform has ERRATA_772415 which requires no readonly mappings
>> in stage-2 page table. This series supports creating VTDIOASContainer
>> with no readonly mappings. I'm not clear if there is a rare case that
>> some IOMMUs on a multiple IOMMUs host have ERRATA_772415, this
>design
>> can survive even in that case.
>>
>> See below example diagram for a full view:
>>
>> IntelIOMMUState
>> |
>> V
>> .------------------. .------------------. .-------------------.
>> | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer
>|-->...
>> | (iommufd0,RW&RO) | | (iommufd1,RW&RO) | | (iommufd0,RW
>only)|
>> .------------------. .------------------. .-------------------.
>> | | |
>> | .-->... |
>> V V
>> .-------------------. .-------------------.
>> .---------------.
>> | VTDS2Hwpt(CC) |--->| VTDS2Hwpt(non-CC) |-->... |
>VTDS2Hwpt(CC) |-->...
>> .-------------------. .-------------------.
>> .---------------.
>> | | | |
>> | | | |
>> .-----------. .-----------. .------------. .------------.
>> | IOMMUFD | | IOMMUFD | | IOMMUFD | | IOMMUFD |
>> | Device(CC)| | Device(CC)| | Device | | Device(CC) |
>> | (iommufd0)| | (iommufd0)| | (non-CC) | | (errata) |
>> | | | | | (iommufd0) | | (iommufd0) |
>> .-----------. .-----------. .------------. .------------.
>>
>> This series is also a prerequisite work for vSVA, i.e. Sharing
>> guest application address space with passthrough devices.
>>
>> To enable "modern" mode, only need to add "x-scalable-mode=modern".
>> i.e. -device intel-iommu,x-scalable-mode=modern,...
>>
>> Passthrough device should use iommufd backend to work in "modern"
>mode.
>> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>>
>> If host doens't support nested translation, qemu will fail
>> with an unsupported report.
>>
>> Test done:
>> - devices hotplug/unplug
>> - different devices linked to different iommufds
>>
>> PATCH1-2: Some preparing work to update header and IOMMUFD uAPI
>> PATCH3-4: Initialize vfio IOMMUFDDevice interface and pass to vIOMMU
>> PATCH5: Introduce a placeholder variable for scalable modern mode
>> PATCH6: Sync host cap/ecap with vIOMMU default cap/ecap in modern
>mode
>> PATCH7-22: Implement first stage page table for passthrough and
>emulated device
>
>Can we split the series and start from the emulated devices (and have
>a qtest for that)? This might help for reviewing.
Sure, will do in rfcv2.
Thanks
Zhenzhong
- [PATCH rfcv1 15/23] intel_iommu: process PASID-based Device-TLB invalidation, (continued)
- [PATCH rfcv1 15/23] intel_iommu: process PASID-based Device-TLB invalidation, Zhenzhong Duan, 2024/01/15
- [PATCH rfcv1 16/23] intel_iommu: rename slpte in iotlb_entry to pte, Zhenzhong Duan, 2024/01/15
- [PATCH rfcv1 17/23] intel_iommu: implement firt level translation, Zhenzhong Duan, 2024/01/15
- [PATCH rfcv1 18/23] intel_iommu: fix the fault reason report, Zhenzhong Duan, 2024/01/15
- [PATCH rfcv1 21/23] intel_iommu: invalidate piotlb when flush pasid, Zhenzhong Duan, 2024/01/15
- [PATCH rfcv1 20/23] intel_iommu: piotlb invalidation should notify unmap, Zhenzhong Duan, 2024/01/15
- [PATCH rfcv1 19/23] intel_iommu: introduce pasid iotlb cache, Zhenzhong Duan, 2024/01/15
- [PATCH rfcv1 22/23] intel_iommu: refresh pasid bind after pasid cache force reset, Zhenzhong Duan, 2024/01/15
- [PATCH rfcv1 23/23] intel_iommu: modify x-scalable-mode to be string option, Zhenzhong Duan, 2024/01/15
- Re: [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation, Jason Wang, 2024/01/21
- RE: [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation,
Duan, Zhenzhong <=