Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/3 QEMU v4] vfio on ppc64

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/3 QEMU v4] vfio on ppc64

From:	David Gibson
Subject:	Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/3 QEMU v4] vfio on ppc64
Date:	Tue, 23 Apr 2013 13:14:11 +1000
User-agent:	Mutt/1.5.21 (2010-09-15)

On Mon, Apr 22, 2013 at 07:58:12PM -0600, Alex Williamson wrote:
> On Tue, 2013-04-23 at 09:44 +1000, David Gibson wrote:
> > On Mon, Apr 22, 2013 at 12:39:26PM -0600, Alex Williamson wrote:
> > > On Mon, 2013-04-22 at 18:02 +1000, Alexey Kardashevskiy wrote:
> > > > Yes, we are still tuning this stuff for us :)
> > > 
> > > Yay!
> > > 
> > > > Changes:
> > > > 
> > > > * new "spapr-pci-vfio-host-bridge" device is introduced
> > > > (inherits from spapr-pci-host-bridge);
> > > > 
> > > > * device#1->group->container->device#2,.. creation scheme is changed to
> > > > group->container->devices scheme (on ppc64);
> > > > 
> > > > * removed vfio_iommu_spapr_tce_dma_map/vfio_iommu_spapr_tce_dma_unmap
> > > > as they are not really necessary now and it does not look like we will
> > > > need them in the nearest future.
> > > > 
> > > > Comments?
> > > 
> > > Not happy with this... :(
> > 
> > So, part of that will be my fault, not Alexey's, because of how I told
> > him to do this.
> > 
> > > > Alexey Kardashevskiy (3):
> > > >   vfio: make some of VFIO API public
> > > 
> > > Exports and conditionalizes vfio_get_group for no good reason other than
> > > to get a pointer to a group that's not connected to a container.  Not
> > > only is the bool parameter ugly, it has a poor interface - call it once
> > > with "false" and get an unconnected group, call it for the same group
> > > with "true", and still get back an unconnected group.  I'd say
> > > vfio_get_group should be split to have group and container connect
> > > separate, but I don't think you even need that...
> > 
> > The boolean is ugly, yes.  I was suggesting splitting a group_create()
> > function from get_group().
> > 
> > > >   vfio: add support for spapr platform (powerpc64 server)
> > > 
> > > ...because this patch ignores the extensibility of
> > > vfio_connect_container.  There is absolutely no reason for
> > > vfio_container_spapr_alloc to exist.  It duplicates most of
> > > vfio_connect_container, apparently using the attempt to reuse containers
> > > as an excuse, even though this should happily fall through unless the
> > > host code is broken.  vfio supports containers capable of multiple
> > > groups, it does not assume it or require it.
> > 
> > So, initially Alexey did extend vfio_connect_container(), but I think
> > this is wrong.  vfio_connect_container() does something different from
> > what we want on pseries, and not just because of details of the IOMMU
> > interface.
> > 
> > Apart from the construction of the container object (which could be
> > done by a shared helper), the guts of vfio_connect_container() does
> > two things:
> >     a) Finds a container for the group.  But it assumes any
> > container is suitable, if it will accept the group.  That works for
> > x86 because every container has the same mappings - all of RAM.  Even
> > if we were able to extend the kernel interface to support multiple
> > groups per container it would be wrong for pseries though: the PAPR
> > interface requires that each guest host bridge have independent DMA
> > mappings.  So the group *must* be bound to the container associated
> > with this device's guest host bridge, no other choice can be correct.
> 
> This is not true.

It is absolutely true.

>  vfio_connect_container _tests_ whether any of the
> other containers are compatible, it makes no assumptions.  The
> VFIO_GROUP_SET_CONTAINER must fail if the group is not compatible with
> the existing container.  I'm not asking you to change that, I'm asking
> that you implement exactly that meaning that this loop:

I know what the loop does.  But the loops "compatibility" test is
based on whether the kernel can add the group to the container.  That
only tests, and can only test whether the hardware and driver supports
having both groups in a container.  What it *can't* test is what set
of IO address spaces we wish to present to the guest.  That's a
constraint that only qemu has the information to apply correctly.

Now it happens that because we only support one group per container
this will always create a new container, which is what we need.  But
that's basically just getting the right answer by accident.

>     QLIST_FOREACH(container, &container_list, next) {
>         if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>             group->container = container;
>             QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>             return 0;
>         }
>     }
> 
> never reaches the inner branch on spapr.  If the spapr vfio iommu driver
> doesn't do this and allows, but does not support, multiple groups to be
> connected to a container, it's a bug.  A group is only going to do this
> once and it's only at setup, so I don't think there's a valid
> performance argument here either.

> >     b) It wires up the container to the MemorListener, so that it
> > maps all RAM.  Our model doesn't do that.
> 
> This is done within the type1 specific setup, ignore it.

Yes, but my point is that connect_container at present has no clear
semantic, other than the type 1 semantic.  What needs to be done
depends on the IOMMU usage model, and only qemu can choose that.

> > Note that both these considerations would become relevant for x86,
> > too, if and when a guest-visible IOMMU interface is implemented: you
> > wouldn't want to wire up a memory listener, and the (guest) device
> > would have to be wired to the container associated with its guest
> > IOMMU domain, not any accepting container.
> 
> At that point the problem is actually different because we potentially
> have multiple viable iommu backends on the same host that will need to
> behave a little differently.  That's not the case for spapr.  There's
> only one viable iommu backend and it exactly fits the model that
> vfio_connect_container was designed for.

We shouldn't design the code around the coincidence that at present we
only support one backend per platform and only one usage model per
backend.

Perhaps you can explain the model that vfio_connect_container() is
supposed to implement.  Because the only model I can figure out for it
is the Type1 one and that absolutely does not fit the spapr model.

> > The fundamental point is we have two very different models of using
> > the IOMMU, the "map all" and "guest visible" models..  And while it
> > happens that the PAPR IOMMU has constraints which mean it can't be
> > used in map all mode, and we never currently use the x86 IOMMUs in
> > guest visible mode, there's nothing to prevent a future IOMMU being
> > usable in either mode.  In fact there's nothing preventing using Type1
> > in either mode right now, except for the exercise of coding the guest
> > IOMMU interface into qemu.  So the model of IOMMU usage should not be
> > treated as simply a detail of the IOMMU type.
> 
> If and when we have multiple viable iommu backends on a single platform,
> I agree, we may need to be smarter about when groups get merged into the
> same container.  We're not there yet.  spapr doesn't plan to support
> multiple groups within a container, so whether we test it or not should
> be a non-issue.  The current kernel implementation needs to disallow it.
> The memory listener setup is completely contained within type1 setup, so
> that should also be a non-issue.
> 
> > Now, what I do wonder is if we ought to have two different qemu device
> > types for a vfio device in map all mode, which will find and prepare
> > its own container and one in guest visible mode which must be told
> > which container to use, and fail if it can't add itself to that
> > container.
> 
> A different vfio- device type is an option, but I'm not seeing a
> sufficient arguments above for why the existing model doesn't work.

It works, but only by accident.  It is conceptually broken for the
guest visible usage model.  Well, or the map everything model, you can
pick either one depending on how you define what the semantics should
be.  But no set of coherent semantics makes it work for both cases.

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

signature.asc
Description: Digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [PATCH 0/3 QEMU v4] vfio on ppc64, Alexey Kardashevskiy, 2013/04/22
- [Qemu-devel] [PATCH 1/3] vfio: make some of VFIO API public, Alexey Kardashevskiy, 2013/04/22
- [Qemu-devel] [PATCH 2/3] vfio: add support for spapr platform (powerpc64 server), Alexey Kardashevskiy, 2013/04/22
- [Qemu-devel] [PATCH 3/3] spapr vfio: add spapr-pci-vfio-host-bridge to support vfio, Alexey Kardashevskiy, 2013/04/22
- Re: [Qemu-devel] [PATCH 0/3 QEMU v4] vfio on ppc64, Alex Williamson, 2013/04/22
  - Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/3 QEMU v4] vfio on ppc64, David Gibson, 2013/04/22
    - Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/3 QEMU v4] vfio on ppc64, Alex Williamson, 2013/04/22
    - Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/3 QEMU v4] vfio on ppc64, David Gibson <=

Prev by Date: Re: [Qemu-devel] [PATCH] monitor: introduce query-config-schema
Next by Date: Re: [Qemu-devel] [PATCH v4] Add option to mlock qemu and guest memory
Previous by thread: Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/3 QEMU v4] vfio on ppc64
Next by thread: Re: [Qemu-devel] [PATCH 1/2 v3] [S390-KVM] Regsync: Allow selective runtime register synchronization
Index(es):
- Date
- Thread