qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access fun


From: Michael S. Tsirkin
Subject: Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
Date: Mon, 21 May 2012 13:31:02 +0300

On Mon, May 21, 2012 at 07:53:23PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2012-05-21 at 12:34 +0300, Michael S. Tsirkin wrote:
> > On Mon, May 21, 2012 at 07:16:27PM +1000, Benjamin Herrenschmidt wrote:
> > > On Mon, 2012-05-21 at 19:07 +1000, Benjamin Herrenschmidt wrote:
> > > 
> > > > One thing that might alleviate some of your concerns would possibly be
> > > > to "remember" in a global (to be replaced with a thread var eventually)
> > > > the last transfer direction and use a simple test to chose the barrier,
> > > > ie, store + store -> wmb, load + load -> rmb, other -> mb.
> > 
> > But how do you know guest did a store?
> 
> This isn't vs. guest access, but vs DMA access, ie we are ordering DMA
> accesses vs. each other. The guest is still responsible to do it's own
> side of the barriers as usual.
> > > > But first I'd be curious if some x86 folks could actually measure the
> > > > impact of the patch as I proposed it. That would give us an idea of how
> > > > bad the performance problem is and how far we need to go to address it.
> > > 
> > > Another option.... go back to something more like the original patch,
> > > ie, put the barrier in the new dma_* accessors (and provide a
> > > non-barrier one while at it) rather than the low level cpu_physical_*
> > > accessor.
> > > 
> > > That makes it a lot easier for selected driver to be converted to avoid
> > > the barrier in thing like code running in the vcpu context. It also
> > > means that virtio doesn't get any added barriers which is what we want
> > > as well.
> > > 
> > > IE. Have something along the lines (based on the accessors added by the
> > > iommu series) (using __ kernel-style, feel free to provide a better
> > > naming)
> > > 
> > > static inline int __dma_memory_rw( ... args ... )
> > > {
> > >     if (!dma_has_iommu(dma)) {
> > >         /* Fast-path for no IOMMU */
> > >         cpu_physical_memory_rw( ... args ...);
> > >         return 0;
> > >     } else {
> > >         return iommu_dma_memory_rw( ... args ...);
> > >     }
> > > }
> > > 
> > > static inline int dma_memory_rw( ... args ... )
> > > {
> > >   smp_mb(); /* Or use finer grained as discussied earlier */
> > > 
> > >   return __dma_memory_rw( ... args ... )
> > 
> > Heh. But don't we need an mb afterwards too?
> 
> Not really no, but we can discuss the fine point, I'm pretty sure
> one-before is enough as long as we ensure MSIs are properly ordered.

Hmm. MSI injection causes IPI. So that does an SMP
barrier I think. But see below about the use of
write-combining in guest.

> > > }
> > > 
> > > And corresponding __dma_memory_read/__dma_memory_write (again, feel
> > > free to suggest a more "qemu'ish" naming if you don't like __, it's
> > > a kernel habit, not sure what you guys do in qemu land).
> > > 
> > > Cheers,
> > > Ben.
> > 
> > And my preference is to first convert everyone to __ variants and
> > carefully switch devices to the barrier version after a bit of
> > consideration.
> 
> I very strongly disagree. This is exactly the wrong approach. In pretty
> much -all- cases the ordered versions are going to be safer, since they
> basically provide the similar ordering semantics to what a PCI bus would
> provide.
> 
> IE. Just making the default accessors ordered means that all devices
> written with the assumption that the guest will see accesses in the
> order they are written in the emulated device will be correct, which
> means pretty much all of them (well, almost).
> 
>  --> It actually fixes a real bug that affects almost all devices
>      that do DMA today in qemu

In theory fine but practical examples that affect x86?
We might want to at least document some of them.

wmb and rmb are nops so there's no bug in practice.
So the only actual rule which might be violated by qemu is that
read flushes out writes.
It's unlikely you will find real examples where this matters
but I'm interested to hear otherwise.


I also note that guests do use write-combining e.g. for vga.
One wonders whether stronger barrriers are needed
because of that?


> Then, fine-tuning performance critical one by selectively removing
> barriers allows to improve performance where it would be othewise
> harmed.

So that breaks attempts to bisect performance regressions.
Not good.

> So on that I will not compromise.
> 
> However, I think it might be better to leave the barrier in the dma
> accessor since that's how you also get iommu transparency etc... so it's
> not a bad place to put them, and leave the cpu_physical_* for use by
> lower level device drivers which are thus responsible also for dealing
> with ordering if they have to.
> 
> Cheers,
> Ben.

You claim to understand what matters for all devices I doubt that.

Why don't we add safe APIs, then go over devices and switch over?
I counted 97 pci_dma_ accesses.
33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000.

Let maintainers make a decision where does speed matter.


-- 
MST



reply via email to

[Prev in Thread] Current Thread [Next in Thread]