qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RESEND PATCH 2/6] memory: introduce AddressSpaceOps an


From: Liu, Yi L
Subject: Re: [Qemu-devel] [RESEND PATCH 2/6] memory: introduce AddressSpaceOps and IOMMUObject
Date: Wed, 20 Dec 2017 14:32:42 +0800
User-agent: Mutt/1.5.21 (2010-09-15)

On Mon, Dec 18, 2017 at 10:22:18PM +1100, David Gibson wrote:
> On Mon, Dec 18, 2017 at 05:17:35PM +0800, Liu, Yi L wrote:
> > On Mon, Dec 18, 2017 at 05:14:42PM +1100, David Gibson wrote:
> > > On Thu, Nov 16, 2017 at 04:57:09PM +0800, Liu, Yi L wrote:
> > > > Hi David,
> > > > 
> > > > On Tue, Nov 14, 2017 at 11:59:34AM +1100, David Gibson wrote:
> > > > > On Mon, Nov 13, 2017 at 04:28:45PM +0800, Peter Xu wrote:
> > > > > > On Mon, Nov 13, 2017 at 04:56:01PM +1100, David Gibson wrote:
> > > > > > > On Fri, Nov 03, 2017 at 08:01:52PM +0800, Liu, Yi L wrote:
> > > > > > > > From: Peter Xu <address@hidden>
> > > > > > > > 
> > > > > > > > AddressSpaceOps is similar to MemoryRegionOps, it's just for 
> > > > > > > > address
> > > > > > > > spaces to store arch-specific hooks.
> > > > > > > > 
> > > > > > > > The first hook I would like to introduce is iommu_get(). Return 
> > > > > > > > an
> > > > > > > > IOMMUObject behind the AddressSpace.
> > > > > > > > 
> > > > > > > > For systems that have IOMMUs, we will create a special address
> > > > > > > > space per device which is different from system default address
> > > > > > > > space for it (please refer to pci_device_iommu_address_space()).
> > > > > > > > Normally when that happens, there will be one specific IOMMU (or
> > > > > > > > say, translation unit) stands right behind that new address 
> > > > > > > > space.
> > > > > > > > 
> > > > > > > > This iommu_get() fetches that guy behind the address space. 
> > > > > > > > Here,
> > > > > > > > the guy is defined as IOMMUObject, which includes a 
> > > > > > > > notifier_list
> > > > > > > > so far, may extend in future. Along with IOMMUObject, a new 
> > > > > > > > iommu
> > > > > > > > notifier mechanism is introduced. It would be used for virt-svm.
> > > > > > > > Also IOMMUObject can further have a IOMMUObjectOps which is 
> > > > > > > > similar
> > > > > > > > to MemoryRegionOps. The difference is IOMMUObjectOps is not 
> > > > > > > > relied
> > > > > > > > on MemoryRegion.
> > > > > > > > 
> > > > > > > > Signed-off-by: Peter Xu <address@hidden>
> > > > > > > > Signed-off-by: Liu, Yi L <address@hidden>
> > > > > > > 
> > > > > > > Hi, sorry I didn't reply to the earlier postings of this after our
> > > > > > > discussion in China.  I've been sick several times and very busy.
> > > > > > > 
> > > > > > > I still don't feel like there's an adequate explanation of exactly
> > > > > > > what an IOMMUObject represents.   Obviously it can represent more 
> > > > > > > than
> > > > > > > a single translation window - since that's represented by the
> > > > > > > IOMMUMR.  But what exactly do all the MRs - or whatever else - 
> > > > > > > that
> > > > > > > are represented by the IOMMUObject have in common, from a 
> > > > > > > functional
> > > > > > > point of view.
> > > > > > > 
> > > > > > > Even understanding the SVM stuff better than I did, I don't 
> > > > > > > really see
> > > > > > > why an AddressSpace is an obvious unit to have an IOMMUObject
> > > > > > > associated with it.
> > > > > > 
> > > > > > Here's what I thought about it: IOMMUObject was planned to be the
> > > > > > abstraction of the hardware translation unit, which is a higher 
> > > > > > level
> > > > > > of the translated address spaces.  Say, for each PCI device, it can
> > > > > > have its own translated address space.  However for multiple PCI
> > > > > > devices, they can be sharing the same translation unit that handles
> > > > > > the translation requests from different devices.  That's the case 
> > > > > > for
> > > > > > Intel platforms.  We introduced this IOMMUObject because sometimes 
> > > > > > we
> > > > > > want to do something with that translation unit rather than a 
> > > > > > specific
> > > > > > device, in which we need a general IOMMU device handle.
> > > > > 
> > > > > Ok, but what does "hardware translation unit" mean in practice.  The
> > > > > guest neither knows nor cares, which bits of IOMMU translation happen
> > > > > to be included in the same bundle of silicon.  It only cares what the
> > > > > behaviour is.  What behavioural characteristics does a single
> > > > > IOMMUObject have?
> > > > > 
> > > > > > IIRC one issue left over during last time's discussion was that 
> > > > > > there
> > > > > > could be more complicated IOMMU models. E.g., one device's DMA 
> > > > > > request
> > > > > > can be translated nestedly by two or multiple IOMMUs, and current
> > > > > > proposal cannot really handle that complicated hierachy.  I'm just
> > > > > > thinking whether we can start from a simple model (say, we don't 
> > > > > > allow
> > > > > > nested IOMMUs, and actually we don't even allow multiple IOMMUs so
> > > > > > far), then we can evolve from that point in the future.
> > > > > > 
> > > > > > Also, I thought there were something you mentioned that this 
> > > > > > approach
> > > > > > is not correct for Power systems, but I can't really remember the
> > > > > > details...  Anyways, I think this is not the only approach to solve
> > > > > > the problem, and I believe any new better idea would be greatly
> > > > > > welcomed as well. :)
> > > > > 
> > > > > So, some of my initial comments were based on a misunderstanding of
> > > > > what was proposed here - since discussing this with Yi at LinuxCon
> > > > > Beijing, I have a better idea of what's going on.
> > > > > 
> > > > > On POWER - or rather the "pseries" platform, which is paravirtualized.
> > > > > We can have multiple vIOMMU windows (usually 2) for a single virtual
> > > > 
> > > > On POWER, the DMA isolation is done by allocating different DMA window
> > > > to different isolation domains? And a single isolation domain may 
> > > > include
> > > > multiple dma windows? So with or withou IOMMU, there is only a single
> > > > DMA address shared by all the devices in the system? The isolation 
> > > > mechanism is as what described above?
> > > 
> > > No, the multiple windows are completely unrelated to how things are
> > > isolated.
> > 
> > I'm afraid I chose a wrong word by using "DMA window"..
> > Actually, when mentioning "DMA window", I mean address ranges in an iova
> > address space.
> 
> Yes, so did I.  My one window I mean one contiguous range of IOVA addresses.
> 
> > Anyhow, let me re-shape my understanding of POWER IOMMU and
> > make sure we are in the same page.
> > 
> > > 
> > > Just like on x86, each IOMMU domain has independent IOMMU mappings.
> > > The only difference is that IBM calls the domains "partitionable
> > > endpoints" (PEs) and they tend to be statically created at boot time,
> > > rather than runtime generated.
> > 
> > Does POWER IOMMU also have iova concept? Device can use an iova to
> > access memory, and IOMMU translates the iova to an address within the
> > system physical address?
> 
> Yes.  When I say the "PCI address space" I mean the IOVA space.
> 
> > > The windows are about what addresses in PCI space are translated by
> > > the IOMMU.  If the device generates a PCI cycle, only certain
> > > addresses will be mapped by the IOMMU to DMA - other addresses will
> > > correspond to other devices MMIOs, MSI vectors, maybe other things.
> > 
> > I guess the windows you mentioned here is the address ranges within the
> > system physical address space as you also mentioned MMIOs and etc.
> 
> No.  I mean ranges within the PCI space == IOVA space. It's simplest
> to understand with traditional PCI.  A cycle on the bus doesn't know
> whether the destination is a device or memory, it just has an address
> - a PCI memory address.  Part of that address range is mapped to
> system RAM, optionally with an IOMMU translating it.  Other parts of
> that address space are used for devices.
> 
> With PCI-E things get more complicated, but the conceptual model is
> the same.
> 
> > > The set of addresses translated by the IOMMU need not be contiguous.
> > 
> > I suppose you mean the output addresses of the IOMMU need not be
> > contiguous?
> 
> No.  I mean the input addresses of the IOMMU.
> 
> > > Or, there could be two IOMMUs on the bus, each accepting different
> > > address ranges.  These two situations are not distinguishable from the
> > > guest's point of view.
> > > 
> > > So for a typical PAPR setup, the device can access system RAM either
> > > via DMA in the range 0..1GiB (the "32-bit window") or in the range
> > > 2^59..2^59+<something> (the "64-bit window").  Typically the 32-bit
> > > window has mappings dynamically created by drivers, and the 64-bit
> > > window has all of system RAM mapped 1:1, but that's entirely up to the
> > > OS, it can map each window however it wants.
> > > 
> > > 32-bit devices (or "64 bit" devices which don't actually implement
> > > enough the address bits) will only be able to use the 32-bit window,
> > > of course.
> > > 
> > > MMIOs of other devices, the "magic" MSI-X addresses belonging to the
> > > host bridge and other things exist outside those ranges.  Those are
> > > just the ranges which are used to DMA to RAM.
> > > 
> > > Each PE (domain) can see a different version of what's in each
> > > window.
> > 
> > If I'm correct so far. PE actually defines a mapping between an address
> > range of an address space(aka. iova address space) and an address range
> > of the system physical address space.
> 
> No.  A PE means several things, but basically it is an isolation
> domain, like an Intel IOMMU domain.  Each PE has an independent set of
> IOMMU mappings which translate part of the PCI address space to system
> memory space.
> 
> > Then my question is: does each PE define a separate iova address sapce
> > which is flat from 0 - 2^AW -1, AW is address width? As a reference,
> > VT-d domain defines a flat address space for each domain.
> 
> Partly.  Each PE has an address space which all devices in the PE see.
> Only some of that address space is mapped to system memory though,
> other parts are occupied by devices, others are unmapped.
> 
> Only the parts mapped by the IOMMU vary between PEs - the other parts
> of the address space will be identical for all PEs on the host

Thx, this comment addressed me well. This is different from what we have
on VT-d.

> bridge.  However for POWER guests (not for hosts) there is exactly one
> PE for each virtual host bridge.
> 
> > > In fact, if I understand the "IO hole" correctly, the situation on x86
> > > isn't very different.  It has a window below the IO hole and a second
> > > window above the IO hole.  The addresses within the IO hole go to
> > > (32-bit) devices on the PCI bus, rather than being translated by the
> > 
> > If you mean the "IO hole" within the system physcial address space, I think
> > it's yes.
> 
> Well, really I mean the IO hole in PCI address space.  Because system
> address space and PCI memory space were traditionally identity mapped
> on x86 this is easy to confuse though.
> 
> > > IOMMU to RAM addresses.  Because the gap is smaller between the two
> > > windows, I think we get away without really modelling this detail in
> > > qemu though.
> > > 
> > > > > PCI host bridge.  Because of the paravirtualization, the mapping to
> > > > > hardware is fuzzy, but for passthrough devices they will both be
> > > > > implemented by the IOMMU built into the physical host bridge.  That
> > > > > isn't importat to the guest, though - all operations happen at the
> > > > > window level.
> > > > 
> > > > On VT-d, with IOMMU presented, each isolation domain has its own address
> > > > space. That's why we talked more on address space level. And iommu makes
> > > > the difference. That's the behavioural characteristics a single iommu
> > > > translation unit has. And thus an IOMMUObject going to have.
> > > 
> > > Right, that's the same on POWER.  But the IOMMU only translates *some*
> > > addresses within the address space, not all of them.  The rest will go
> > > to other PCI devices or be unmapped, but won't go to RAM.
> > > 
> > > That's why the IOMMU should really be associated with an MR (or
> > > several MRs), not an AddressSpace, it only translates some addresses.
> > 
> > If I'm correct so far, I do believe the major difference between VT-d and
> > POWER IOMMU is that VT-d isolation domain is a flat address space while
> > PE of POWER is something different(need your input here as I'm not sure 
> > about
> > it). Maybe it's like there is a flat address space, each PE takes some 
> > address
> > ranges and maps the address ranges to different system physcial address 
> > ranges.
> 
> No, it's really not that different.  In both cases (without virt-SVM)
> there's a system memory address space, and a PCI address space for
> each domain / PE.  There are one or more "outbound" windows in system
> memory space that map system memory cycles to PCI cycles (used by the
> CPU to access MMIO) and one or more "inbound" (DMA) windows in PCI
> memory space which map PCI cycles onto system memory cycles (used by
> devices to access system memory).
> 
> On old-style PCs, both inbound and outbound windows were (mostly)
> identity maps.  On POWER they are not.
> 
> > > > > The other thing that bothers me here is the way it's attached to an
> > > > > AddressSpace.
> > > > 
> > > > My consideration is iommu handles AddressSpaces. dma address space is 
> > > > also
> > > > an address space managed by iommu.
> > > 
> > > No, it's not.  It's a region (or several) within the overall PCI
> > > address space.  Other things in the addressspace, such as other
> > > device's BARs exist independent of the IOMMU.
> > > 
> > > It's not something that could really work with PCI-E, I think, but
> > > with a more traditional PCI bus there's no reason you couldn't have
> > > multiple IOMMUs listening on different regions of the PCI address
> > > space.
> > 
> > I think the point here is on POWER, the input addresses of IOMMUs are 
> > actaully
> > in the same address space?
> 
> I'm not sure what you mean, but I don't think so.  Each PE has its own
> IOMMU input address space.
> 
> > What IOMMU does is mapping the different ranges to
> > different system physcial address ranges. So it's as you mentioned, multiple
> > IOMMUs listen on different regions of a PCI address space.
> 
> No.  That could be the case in theory, but it's not the usual case.
> 
> Or rather it depends what you mean by "an IOMMU".  For PAPR guests,
> both IOVA 0..1GiB and 2^59..(somewhere) are mapped to system memory,
> but with separate page tables.  You could consider that two IOMMUs (we
> mostly treat it that way in qemu).  However, all the mapping is
> handled by the same host bridge with 2 sets of page tables per PE, so
> you could also call it one IOMMU.
> 
> This is what I'm getting at when I say that "one IOMMU" is not a
> clearly defined unit.
> 
> > While for VT-d, it's not the case. The input addresses of IOMMUs may not
> > in the same address sapce. As I mentioned, each IOMMU domain on VT-d is a
> > separate address space. So for VT-d, IOMMUs are actually listening to 
> > different
> > address spaces. That's the point why we want to have address space level
> > abstraction of IOMMU.
> > 
> > > 
> > > > That's why we believe it is fine to
> > > > associate dma address space with an IOMMUObject.
> > > 
> > > > >  IIUC how SVM works, the whole point is that the device
> > > > > no longer writes into a specific PCI address space.  Instead, it
> > > > > writes directly into a process address space.  So it seems to me more
> > > > > that SVM should operate at the PCI level, and disassociate the device
> > > > > from the normal PCI address space entirely, rather than hooking up
> > > > > something via that address space.

After thinking more, I agree that it is not suitable to hook up something for
1st level via the PCI address space. In the time 1st and 2nd level translation
is exposed to guest, a device would write to multiple address spaces. PCI 
address
space is only one of them. I think your reply in another email is a good start,
let me reply my thoughts under that email.

Regards,
Yi L

> > > > 
> > > > As Peter replied, we still need the PCI address space, it would be used
> > > > to build up the 2nd level page table which would be used in nested
> > > > translation.
> > > > 
> > > > Thanks,
> > > > Yi L
> > > > 
> > > > > 
> > > > 
> > > 
> > 
> > Regards,
> > Yi L
> > 
> 
> -- 
> David Gibson                  | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au        | minimalist, thank you.  NOT _the_ 
> _other_
>                               | _way_ _around_!
> http://www.ozlabs.org/~dgibson





reply via email to

[Prev in Thread] Current Thread [Next in Thread]