Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access fun

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access fun

From:	Benjamin Herrenschmidt
Subject:	Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function
Date:	Tue, 22 May 2012 07:58:17 +1000
On Mon, 2012-05-21 at 15:18 +0300, Michael S. Tsirkin wrote:

> > mb is more than just "read flush writes" (besides it's not a statement
> > about flushing, it's a statement about ordering. whether it has a
> > flushing side effect on x86 is a separate issue, it doesn't on power for
> > example).
> 
> I referred to reads not bypassing writes on PCI.

Again, from which originator ? From a given initiator, nothing bypasses
anything, so the right thing to do here is a full mb(). However, I
suspect what you are talking about here is read -responses- not
bypassing writes in the direction of the response (ie, the "flushing"
semantic of reads) which is a different matter. Also don't forget that
this is only a semantic of PCI, not of the system fabric, ie, a device
DMA read doesn't flush a CPU write that is still in that CPU store
queue.

> This is the real argument in my eyes: that we
> should behave the way real hardware does.

But that doesn't really make much sense since we don't actually have a
non-coherent bus sitting in the middle :-)

However we should as much as possible be observed to behave as such, I
agree, though I don't think we need to bother too much about timings
since we don't really have way to enforce the immediate visibility of
stores within the coherent domain without a bunch of arch specific very
very heavy hammers which we really don't want to wield at this point.

> > Real flushing out writes matters very much in real life in two very
> > different contexts that tend to not affect emulation in qemu as much.
> > 
> > One is flushing write in the opposite direction (or rather, having the
> > read response queued up behind those writes) which is critical to
> > ensuring proper completion of DMAs after an LSI from a guest driver
> > perspective on real HW typically.
> > 
> > The other classic case is to flush posted MMIO writes in order to ensure
> > that a subsequent delay is respected.
> > 
> > Most of those don't actually matter when doing emulation. Besides a
> > barrier won't provide you the second guarantee, you need a nastier
> > construct at least on some architectures like power.
> 
> Exactly. This is what I was saying too.

Right and I'm reasonably sure that none of those above is our problem. 

As I said, at this point, what I want to sort out is purely the
observable ordering of DMA transactions. The side effect of reads in one
direction on writes in the other direction is an orthogonal problem
which as I wrote above is probably not hurting us.

> > However, we do need to ensure that read and writes are properly ordered
> > vs. each other (regardless of any "flush" semantic) or things could go
> > very wrong on OO architectures (here too, my understanding on x86 is
> > limited).
> 
> Right. Here's a compromize:
> - add smp_rmb() on any DMA read
> - add smp_wmb( on any DMA write
> This is almost zero cost on x86 at least.
> So we are not regressing existing setups.

And it's not correct. With that setup, DMA writes can pass DMA reads
(and vice-versa) which doesn't correspond to the guarantees of the PCI
spec. The question I suppose is whether this is a problem in practice...

> Are there any places where devices do read after write?

It's possible, things like update of a descriptor followed by reading of
the next one, etc...  I don't have an example hot in mind right know of
a device that would be hurt but I'm a bit nervous as this would be a
violation of the PCI guaranteed ordering.

> My preferred way is to find them and do pci_dma_flush() invoking
> smp_mb(). If there is such a case it's likely on datapath anyway
> so we do care.
> 
> But I can also live with a global flag "latest_dma_read"
> and on read we could do
>       if (unlikely(latest_dma_read))
>               smp_mb();
> 
> if you really insist on it
> though I do think it's inelegant.

Again, why do you object on simply making the default accessors fully
ordered ? Do you think it will be a measurable different in most cases ?

Shouldn't we measure it first ?

> You said above x86 is unaffected. This is portability, not safety.

x86 is unaffected by the missing wmb/rmb, it might not be unaffected by
the missing ordering between loads and stores, I don't know, as I said,
I don't fully know the x86 memory model.

In any case, opposing "portability" to "safety" the way you do it means
you are making assumptions that basically "qemu is written for x86 and
nothing else matters".

If that's your point of view, so be it and be clear about it, but I will
disagree :-) And while I can understand that powerpc might not be
considered as the most important arch around at this point in time,
these problems are going to affect ARM as well.

> > Almost all our
> > devices were written without any thought given to ordering, so they
> > basically can and should be considered as all broken.
> 
> Problem is, a lot of code is likely broken even after you sprinkle
> barriers around. For example qemu might write A then B where guest driver
> expects to see B written before A.

No, this is totally unrelated bugs, nothing to do with barriers. You are
mixing up two completely different problems and using one as an excuse
to not fix the other one :-)

A device with the above problem would be broken today on x86 regardless.

> > Since thinking
> > about ordering is something that, by experience, very few programmer can
> > do and get right, the default should absolutely be fully ordered.
> 
> Give it bus ordering. That is not fully ordered.

It pretty much is actually, look at your PCI spec :-)

> > Performance regressions aren't a big deal to bisect in that case: If
> > there's a regression for a given driver and it points to this specific
> > patch adding the barriers then we know precisely where the regression
> > come from, and we can get some insight about how this specific driver
> > can be improved to use more relaxed accessors.
> > 
> > I don't see the problem.
> > 
> > One thing that might be worth looking at is if indeed mb() is so much
> > more costly than just wmb/rmb, in which circumstances we could have some
> > smarts in the accessors to make them skip the full mb based on knowledge
> > of previous access direction, though here too I would be tempted to only
> > do that if absolutely necessary (ie if we can't instead just fix the
> > sensitive driver to use explicitly relaxed accessors).
> 
> We did this in virtio and yes it is measureable.

You did it in virtio in a very hot spot on a performance critical
driver. My argument is that:

 - We can do it in a way that doesn't affect virtio at all (by using the
dma accessors instead of cpu_*)

 - Only few drivers have that kind of performance criticality and they
can be easily hand fixed.

> branches are pretty cheap though.

Depends, not always but yes, cheaper than barriers in many cases.

> > > > So on that I will not compromise.
> > > > 
> > > > However, I think it might be better to leave the barrier in the dma
> > > > accessor since that's how you also get iommu transparency etc... so it's
> > > > not a bad place to put them, and leave the cpu_physical_* for use by
> > > > lower level device drivers which are thus responsible also for dealing
> > > > with ordering if they have to.
> > > > 
> > > > Cheers,
> > > > Ben.
> > > 
> > > You claim to understand what matters for all devices I doubt that.
> > 
> > It's pretty obvious that anything that does DMA using a classic
> > descriptor + buffers structure is broken without appropriate ordering.
> > 
> > And yes, I claim to have a fairly good idea of the problem, but I don't
> > think throwing credentials around is going to be helpful.
> >  
> > > Why don't we add safe APIs, then go over devices and switch over?
> > > I counted 97 pci_dma_ accesses.
> > > 33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000.
> > > 
> > > Let maintainers make a decision where does speed matter.
> > 
> > No. Let's fix the general bug first. Then let's people who know the
> > individual drivers intimately and understand their access patterns make
> > the call as to when things can/should be relaxed.
> > 
> > Ben.
> 
> As a maintainer of a device, if you send me a patch I can review.
> If you change core APIs creating performance regressions
> I don't even know what to review without wasting time debugging
> and bisecting.

Well, if you don't know then you ask on the list and others (such as
myself or a certain mst) who happens to know those issues will help that
lone clueless maintainer, seriously it's not like it's hard, and a lot
easier than just keeping everything broken and hoping we get to audit
them all properly.

> According to what you said you want to fix kvm on powerpc.
> Good. Find a way that looks non intrusive on x86 please.

And ARM. And any other OO arch.

Ben.
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions, (continued)
Prev by Date: Re: [Qemu-devel] [RFC PATCH 3/4] block: Enable QEMU to retrieve passed fd before attempting open
Next by Date: Re: [Qemu-devel] Get current env within io_handler ?
Previous by thread: Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
Next by thread: Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function
Index(es):
- Date
- Thread