qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] QOMification of AXI streams


From: Benjamin Herrenschmidt
Subject: Re: [Qemu-devel] [RFC] QOMification of AXI streams
Date: Tue, 12 Jun 2012 09:46:40 +1000

On Mon, 2012-06-11 at 17:29 -0500, Anthony Liguori wrote:

> I don't know that we really have bit masking done right in the memory API.

That's not a big deal:

> When we add a subregion, it always removes the offset from the address when 
> it 
> dispatches.  This more often than not works out well but for what you're 
> describing above, it sounds like you'd really want to get an adjusted size 
> (that 
> could be transformed).
> 
> Today we generate a linear dispatch table.  This prevents us from applying 
> device-level transforms.

subregions being relative to the parent would probably work for me.

Typically I have something like:

  0x1234_0000_0000 .. 0x1234_3fff_ffff : window to PCI memory 

Which generates PCI cycles into the range:

  0x0000_c000_0000 .. 0x0000_ffff_ffff

Which is a combination of bit masking & offset, ie, the HW algorithm is
to forward some bits (0x3fff_ffff in this case) and replace the other
ones (with 0xc000_0000) in this case.

Is that properly represented in the current scheme by a subregion ?

> > We somewhat implements that in spapr_pci today since it works but I
> > don't quite understand how :-) Or rather, the terminology "alias" seems
> > to be fairly bogus, we aren't talking about aliases here...
> >
> > So today we create a memory region with an "alias" (whatever that means)
> > that is [B...B+S] and add a subregion which is [A...A+S]. That seems to
> > work but but it's obscure.
> >
> > If I was to implement that, I would make it so that the struct
> > MemoryRegion used in that hierarchy contains the address in the local
> > domain -and- the transformed address in the CPU domain, so you can still
> > sort them by CPU addresses for quick access and make this offsetting a
> > standard property of any memory region since it's very common that
> > busses drop address bits along the way.
> >
> > Now, if you want to use that structure for DMA, what you need to do
> > first is when an access happens, walk up the region tree and scan for
> > all siblings at every level, which can be costly.
> 
> So if you stick with the notion of subregions, you would still have a single 
> MemoryRegion at the PCI bus layer that has all of it's children as sub 
> regions. 
>   Presumably that "scan for all siblings" is a binary search which shouldn't 
> really be that expensive considering that we're likely to have a shallow 
> depth 
> in the memory hierarchy.

Possibly but on every DMA access ...

> > Additionally to handle iommu's etc... you need the option for a given
> > memory region to have functions to perform the transformation in the
> > upstream direction.
> 
> I think that transformation function lives in the bus layer MemoryRegion.  
> It's 
> a bit tricky though because you need some sort of notion of "who is asking".  
> So 
> you need:
> 
> dma_memory_write(MemoryRegion *parent, DeviceState *caller,
>                   const void *data, size_t size);

Why do you need the parent argument ? Can't it be implied from the
DeviceState ? Or do you envision devices sitting on multiple busses ? Or
it's just a matter that each device "class" will store that separately ?

Also you want the parent to be the direct parent P2P no ? Not the host
bridge ? Ie. to be completely correct, you want to look at every sibling
device under a given bridge, then go up if there is no match, etc...
until you reach the PHB at which point you hit the iommu and eventually
the system fabric.

> This could be simplified at each layer via:
> 
> void pci_device_write(PCIDevice *dev, const void *data, size_t size) {
>      dma_memory_write(dev->bus->mr, DEVICE(dev), data, size);
> }

Ok.

> > To be true to the HW, each bridge should have its memory region, so a
> > setup with
> >
> >        /pci-host
> >            |
> >            |--/p2p
> >                 |
> >            |--/device
> >
> > Any DMA done by the device would walk through the p2p region to the host
> > which would contain a region with transform ops.
> >
> > However, at each level, you'd have to search for sibling regions that
> > may decode the address at that level before moving up, ie implement
> > essentially the equivalent of the PCI substractive decoding scheme.
> 
> Not quite...  subtractive decoding only happens for very specific devices 
> IIUC. 
>   For instance, an PCI-ISA bridge.  Normally, it's positive decoding and a 
> bridge has to describe the full region of MMIO/PIO that it handles.

That's for downstream. Upstream is essentially substractive as far as I
can tell. IE. If no sibling device decodes the cycle, then it goes up
no ? I don't remember off hand (actually I wonder how it works if a
device does a cycle to address A and it's both below a P2P bridge and
there's a substractive decoding bridge next door... who gets the cycle ?
upstream of substractive ?).

The problem is non-existent on PCIe of course.

> So it's only necessary to transverse down the tree again for the very special 
> case of PCI-ISA bridges.  Normally you can tell just by looking at siblings.
> 
> > That will be a significant overhead for your DMA ops I believe, though
> > doable.
> 
> Worst case scenario, 256 devices with what, a 3 level deep hierarchy?  we're 
> still talking about 24 simple address compares.  That shouldn't be so bad.

Per DMA access...

> > Then we'd have to add map/unmap to MemoryRegion as well, with the
> > understanding that they may not be supported at every level...
> 
> map/unmap can always fall back to bounce buffers.

Right.

> > So yeah, it sounds doable and it would handle what DMAContext doesn't
> > handle which is access to peer devices without going all the way back to
> > the "top level", but it's complex and ... I need something in qemu
> > 1.2 :-)
> 
> I think we need a longer term vision here.  We can find incremental solutions 
> for the short term but I'm pretty nervous about having two parallel APIs only 
> to 
> discover that we need to converge in 2 years.

Can we agree that what we need to avoid having to change every second
day is the driver API ?

In which case, what about I come up with some pci_* DMA APIs such as the
ones you suggested above and fixup a handful of devices we care about to
them.

Under the hood, we can make it use the DMAContext for now and have that
working for us, but we can easily "merge" DMAContext and MemoryRegion
later since it's not directly exposed to devices.

IE. The devices would only use:

  pci_device_read/write/map/unmap(PCIDevice,...)

And DMAContext remains burried in the implementation.

Would that be ok with you ?

Cheers,
Ben.

> Regards,
> 
> Anthony Liguori
> 
> 
> > In addition there's the memory barrier business so we probably want to
> > keep the idea of having DMA specific accessors ...
> >
> > Could we keep the DMAContext for now and just rename it to MemoryRegion
> > (keeping the accessors) when we go for a more in depth transformation ?
> >
> > Cheers,
> > Ben.
> >
> >
> >





reply via email to

[Prev in Thread] Current Thread [Next in Thread]