qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW fun


From: Benjamin Herrenschmidt
Subject: Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
Date: Thu, 17 May 2012 07:10:45 +1000

On Wed, 2012-05-16 at 14:39 -0500, Anthony Liguori wrote:

> I must confess, I have no idea what PCI et al guarantee with respect to 
> ordering.  What's nasty about this patch is that you're not just ordering wrt 
> device writes/reads, but also with the other VCPUs.  I don't suspect this 
> would 
> be prohibitively expensive but it still worries me.

So the precise ordering rules of various busses can vary slightly.

We could try to get as precise & fine grained as those busses are in HW
or ... it's my belief that it makes sense to simply guarantee that the
DMA accesses done by emulated devices always appear to other VCPUs in
the order they were done by the device emulation code.

IE. If we can prove that the cost of doing so is negligible, then it's
also the simplest approach since just sticking that one barrier here
will provide that ordering guarantee (at least for anything using the
dma_* accessors).

Ordering problems can be really sneaky & nasty to debug and so I'm
really tempted to use that big hammer approach here, provided there is
no problematic performance loss.

> > So by putting the barrier right in the dma_* accessor we kill all the
> > birds with one stone without having to audit all drivers for use of the
> > right accessors and all bus types.
> >
> > Also while the goal of using more targeted barriers might be worthwhile
> > in the long run, it's not totally trivial because we do want to order
> > store vs. subsequent loads in all cases and load vs. loads, and we don't
> > want to have to keep track of what the previous access was, so at this
> > stage it's simply easier to just use a full barrier.
> >
> > So my suggestion is to see if that patch introduces a measurable
> > performance regression anywhere we care about (ie on x86) and if not,
> > just go for it, it will solve a very real problem and we can ponder ways
> > to do it better as a second step if it's worthwhile.
> >
> > Anthony, how do you usually benchmark these things ? Any chance you can
> > run a few tests to see if there's any visible loss ?
> 
> My concern would really be limited to virtio ring processing. It all depends 
> on 
> where you place the barriers in the end.

So virtio doesn't use the dma_* interface since it bypasses the iommu
(on purpose).

> I really don't want to just conservatively stick barriers everywhere either. 
> I'd like to have a specific ordering guarantee and then implement that and 
> deal 
> with the performance consequences.

Well, my idea is to provide a well defined ordering semantic of all DMA
accesses issued by a device :-) IE. All DMAs done by the device
emulation appear to other VCPUs in the order they were issued by the
emulation code. IE. Making the storage accesses visible in the right
order to "other VCPUs" is the whole point of the exercise.

This is well defined, though a bit broad and possibly broader than
strictly necessary but it's a cost/benefit game here. If the cost is low
enough, the benefit is that it's going to be safe, we won't have subtle
cases of things passing each other etc... and it's also simpler to
implement and maintain since it's basically one barrier in the right
place.

I have a long experience with dealing with ordering issues on large SMP
systems and believe me, anything "fine grained" is really really hard to
generally get right, and the resulting bugs are really nasty to track
down and even identify. So I have a strong bias toward the big hammer
approach that is guaranteed to avoid the problem for anything using the
right DMA accessors.

> I also wonder if the "fix" that you see from this is papering around a bigger 
> problem.  Can you explain the ohci problem that led you to do this in the 
> first 
> place?

Well, we did an audit of OHCI and we discovered several bugs there which
have been fixed since then, mostly cases where the emulated device would
incorrectly read/modify/write entire data structures in guest memory
rather than just updating the fields it's supposed to update, causing
simultaneous updates of other fields by the guest driver to be lost.

The result was that we still had an occasional mild instability where
every now and then the host would seem to get errors or miss
completions, which the barrier appeared to fix.

On the other hand, we -know- that not having the barrier is incorrect so
that was enough for me to be happy about the diagnosis.

IE. The OHCI -will- update fields that must be visible in the right
order by the host driver (such as a link pointer in a TD followed by the
done list pointer pointing to that TD) and we know that POWER cpus are
very good at shooting stores out of order, so missing TDs on completion
being one of our symptoms, I think we pretty much nailed it.

I'm going to try on some fast x86 using things like AHCI to see if I can
show a performance issue.

Cheers,
Ben.

> Regards,
> 
> Anthony Liguori
> 
> >
> > Cheers,
> > Ben.
> >
> >
> >





reply via email to

[Prev in Thread] Current Thread [Next in Thread]