[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
From: |
Konrad Rzeszutek Wilk |
Subject: |
Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework |
Date: |
Fri, 11 Nov 2011 12:51:18 -0500 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
> VFIO provides a secure, IOMMU based interface for user space
> drivers, including device assignment to virtual machines.
> This provides the base management of IOMMU groups, devices,
> and IOMMU objects. See Documentation/vfio.txt included in
> this patch for user and kernel API description.
>
> Note, this implements the new API discussed at KVM Forum
> 2011, as represented by the drvier version 0.2. It's hoped
> that this provides a modular enough interface to support PCI
> and non-PCI userspace drivers across various architectures
> and IOMMU implementations.
>
> Signed-off-by: Alex Williamson <address@hidden>
> ---
>
> Fingers crossed, this is the last RFC for VFIO, but we need
> the iommu group support before this can go upstream
> (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html),
> hoping this helps push that along.
>
> Since the last posting, this version completely modularizes
> the device backends and better defines the APIs between the
> core VFIO code and the device backends. I expect that we
> might also adopt a modular IOMMU interface as iommu_ops learns
> about different types of hardware. Also many, many cleanups.
> Check the complete git history for details:
>
> git://github.com/awilliam/linux-vfio.git vfio-ng
>
> (matching qemu tree: git://github.com/awilliam/qemu-vfio.git)
>
> This version, along with the supporting VFIO PCI backend can
> be found here:
>
> git://github.com/awilliam/linux-vfio.git vfio-next-20111103
>
> I've held off on implementing a kernel->user signaling
> mechanism for now since the previous netlink version produced
> too many gag reflexes. It's easy enough to set a bit in the
> group flags too indicate such support in the future, so I
> think we can move ahead without it.
>
> Appreciate any feedback or suggestions. Thanks,
>
> Alex
>
> Documentation/ioctl/ioctl-number.txt | 1
> Documentation/vfio.txt | 304 +++++++++
> MAINTAINERS | 8
> drivers/Kconfig | 2
> drivers/Makefile | 1
> drivers/vfio/Kconfig | 8
> drivers/vfio/Makefile | 3
> drivers/vfio/vfio_iommu.c | 530 ++++++++++++++++
> drivers/vfio/vfio_main.c | 1151
> ++++++++++++++++++++++++++++++++++
> drivers/vfio/vfio_private.h | 34 +
> include/linux/vfio.h | 155 +++++
> 11 files changed, 2197 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/vfio.txt
> create mode 100644 drivers/vfio/Kconfig
> create mode 100644 drivers/vfio/Makefile
> create mode 100644 drivers/vfio/vfio_iommu.c
> create mode 100644 drivers/vfio/vfio_main.c
> create mode 100644 drivers/vfio/vfio_private.h
> create mode 100644 include/linux/vfio.h
>
> diff --git a/Documentation/ioctl/ioctl-number.txt
> b/Documentation/ioctl/ioctl-number.txt
> index 54078ed..59d01e4 100644
> --- a/Documentation/ioctl/ioctl-number.txt
> +++ b/Documentation/ioctl/ioctl-number.txt
> @@ -88,6 +88,7 @@ Code Seq#(hex) Include File Comments
> and kernel/power/user.c
> '8' all SNP8023 advanced NIC card
> <mailto:address@hidden>
> +';' 64-76 linux/vfio.h
> '@' 00-0F linux/radeonfb.h conflict!
> '@' 00-0F drivers/video/aty/aty128fb.c conflict!
> 'A' 00-1F linux/apm_bios.h conflict!
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> new file mode 100644
> index 0000000..5866896
> --- /dev/null
> +++ b/Documentation/vfio.txt
> @@ -0,0 +1,304 @@
> +VFIO - "Virtual Function I/O"[1]
> +-------------------------------------------------------------------------------
> +Many modern system now provide DMA and interrupt remapping facilities
> +to help ensure I/O devices behave within the boundaries they've been
> +allotted. This includes x86 hardware with AMD-Vi and Intel VT-d as
> +well as POWER systems with Partitionable Endpoints (PEs) and even
> +embedded powerpc systems (technology name unknown). The VFIO driver
> +is an IOMMU/device agnostic framework for exposing direct device
> +access to userspace, in a secure, IOMMU protected environment. In
> +other words, this allows safe, non-privileged, userspace drivers.
> +
> +Why do we want that? Virtual machines often make use of direct device
> +access ("device assignment") when configured for the highest possible
> +I/O performance. From a device and host perspective, this simply turns
> +the VM into a userspace driver, with the benefits of significantly
> +reduced latency, higher bandwidth, and direct use of bare-metal device
> +drivers[2].
Are there any constraints of running a 32-bit userspace with
a 64-bit kernel and with 32-bit user space drivers?
> +
> +Some applications, particularly in the high performance computing
> +field, also benefit from low-overhead, direct device access from
> +userspace. Examples include network adapters (often non-TCP/IP based)
> +and compute accelerators. Previous to VFIO, these drivers needed to
> +go through the full development cycle to become proper upstream driver,
> +be maintained out of tree, or make use of the UIO framework, which
> +has no notion of IOMMU protection, limited interrupt support, and
> +requires root privileges to access things like PCI configuration space.
> +
> +The VFIO driver framework intends to unify these, replacing both the
> +KVM PCI specific device assignment currently used as well as provide
> +a more secure, more featureful userspace driver environment than UIO.
> +
> +Groups, Devices, IOMMUs, oh my
<chuckles> oh my, eh?
> +-------------------------------------------------------------------------------
> +
> +A fundamental component of VFIO is the notion of IOMMU groups. IOMMUs
> +can't always distinguish transactions from each individual device in
> +the system. Sometimes this is because of the IOMMU design, such as with
> +PEs, other times it's caused by the I/O topology, for instance a
> +PCIe-to-PCI bridge masking all devices behind it. We call the sets of
> +devices created by these restictions IOMMU groups (or just "groups" for
> +this document).
> +
> +The IOMMU cannot distiguish transactions between the individual devices
> +within the group, therefore the group is the basic unit of ownership for
> +a userspace process. Because of this, groups are also the primary
> +interface to both devices and IOMMU domains in VFIO.
> +
> +The VFIO representation of groups is created as devices are added into
> +the framework by a VFIO bus driver. The vfio-pci module is an example
> +of a bus driver. This module registers devices along with a set of bus
> +specific callbacks with the VFIO core. These callbacks provide the
> +interfaces later used for device access. As each new group is created,
> +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP
> +character device.
> +
> +In addition to the device enumeration and callbacks, the VFIO bus driver
> +also provides a traditional device driver and is able to bind to devices
> +on it's bus. When a device is bound to the bus driver it's available to
> +VFIO. When all the devices within a group are bound to their bus drivers,
> +the group becomes "viable" and a user with sufficient access to the VFIO
> +group chardev can obtain exclusive access to the set of group devices.
> +
> +As documented in linux/vfio.h, several ioctls are provided on the
> +group chardev:
> +
> +#define VFIO_GROUP_GET_FLAGS _IOR(';', 100, __u64)
> + #define VFIO_GROUP_FLAGS_VIABLE (1 << 0)
> + #define VFIO_GROUP_FLAGS_MM_LOCKED (1 << 1)
> +#define VFIO_GROUP_MERGE _IOW(';', 101, int)
> +#define VFIO_GROUP_UNMERGE _IOW(';', 102, int)
> +#define VFIO_GROUP_GET_IOMMU_FD _IO(';', 103)
> +#define VFIO_GROUP_GET_DEVICE_FD _IOW(';', 104, char *)
> +
> +The last two ioctls return new file descriptors for accessing
> +individual devices within the group and programming the IOMMU. Each of
> +these new file descriptors provide their own set of file interfaces.
> +These ioctls will fail if any of the devices within the group are not
> +bound to their VFIO bus driver. Additionally, when either of these
> +interfaces are used, the group is then bound to the struct_mm of the
> +caller. The GET_FLAGS ioctl can be used to view the state of the group.
> +
> +When either the GET_IOMMU_FD or GET_DEVICE_FD ioctls are invoked, a
> +new IOMMU domain is created and all of the devices in the group are
> +attached to it. This is the only way to ensure full IOMMU isolation
> +of the group, but potentially wastes resources and cycles if the user
> +intends to manage multiple groups with the same set of IOMMU mappings.
> +VFIO therefore provides a group MERGE and UNMERGE interface, which
> +allows multiple groups to share an IOMMU domain. Not all IOMMUs allow
> +arbitrary groups to be merged, so the user should assume merging is
> +opportunistic. A new group, with no open device or IOMMU file
> +descriptors, can be merged into an existing, in-use, group using the
> +MERGE ioctl. A merged group can be unmerged using the UNMERGE ioctl
> +once all of the device file descriptors for the group being merged
> +"out" are closed.
> +
> +When groups are merged, the GET_IOMMU_FD and GET_DEVICE_FD ioctls are
> +essentially fungible between group file descriptors (ie. if device A
> +is in group X, and X is merged with Y, a file descriptor for A can be
> +retrieved using GET_DEVICE_FD on Y. Likewise, GET_IOMMU_FD returns a
> +file descriptor referencing the same internal IOMMU object from either
> +X or Y). Merged groups can be dissolved either explictly with UNMERGE
> +or automatically when ALL file descriptors for the merged group are
> +closed (all IOMMUs, all devices, all groups).
> +
> +The IOMMU file descriptor provides this set of ioctls:
> +
> +#define VFIO_IOMMU_GET_FLAGS _IOR(';', 105, __u64)
> + #define VFIO_IOMMU_FLAGS_MAP_ANY (1 << 0)
> +#define VFIO_IOMMU_MAP_DMA _IOWR(';', 106, struct vfio_dma_map)
> +#define VFIO_IOMMU_UNMAP_DMA _IOWR(';', 107, struct vfio_dma_map)
Coherency support is not going to be addressed right? What about sync?
Say you need to sync CPU to Device address?
> +
> +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
> +We currently only support IOMMU domains that are able to map any
> +virtual address to any IOVA. This is indicated by the MAP_ANY flag.
> +
> +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
> +and unmapping IOVAs to process virtual addresses:
> +
> +struct vfio_dma_map {
> + __u64 len; /* length of structure */
What is the purpose of the 'len' field? Is it to guard against future
version changes?
> + __u64 vaddr; /* process virtual addr */
> + __u64 dmaaddr; /* desired and/or returned dma address */
> + __u64 size; /* size in bytes */
> + __u64 flags;
> +#define VFIO_DMA_MAP_FLAG_WRITE (1 << 0) /* req writeable DMA mem */
> +};
> +
> +Current users of VFIO use relatively static DMA mappings, not requiring
> +high frequency turnover. As new users are added, it's expected that the
Is there a limit to how many DMA mappings can be created?
> +IOMMU file descriptor will evolve to support new mapping interfaces, this
> +will be reflected in the flags and may present new ioctls and file
> +interfaces.
> +
> +The device GET_FLAGS ioctl is intended to return basic device type and
> +indicate support for optional capabilities. Flags currently include whether
> +the device is PCI or described by Device Tree, and whether the RESET ioctl
> +is supported:
And reset in terms of PCIe spec is the FLR?
> +
> +#define VFIO_DEVICE_GET_FLAGS _IOR(';', 108, __u64)
> + #define VFIO_DEVICE_FLAGS_PCI (1 << 0)
> + #define VFIO_DEVICE_FLAGS_DT (1 << 1)
> + #define VFIO_DEVICE_FLAGS_RESET (1 << 2)
> +
> +The MMIO and IOP resources used by a device are described by regions.
IOP?
> +The GET_NUM_REGIONS ioctl tells us how many regions the device supports:
> +
> +#define VFIO_DEVICE_GET_NUM_REGIONS _IOR(';', 109, int)
Don't want __u32?
> +
> +Regions are described by a struct vfio_region_info, which is retrieved by
> +using the GET_REGION_INFO ioctl with vfio_region_info.index field set to
> +the desired region (0 based index). Note that devices may implement zero
>
+sized regions (vfio-pci does this to provide a 1:1 BAR to region index
> +mapping).
Huh?
> +
> +struct vfio_region_info {
> + __u32 len; /* length of structure */
> + __u32 index; /* region number */
> + __u64 size; /* size in bytes of region */
> + __u64 offset; /* start offset of region */
> + __u64 flags;
> +#define VFIO_REGION_INFO_FLAG_MMAP (1 << 0)
> +#define VFIO_REGION_INFO_FLAG_RO (1 << 1)
> +#define VFIO_REGION_INFO_FLAG_PHYS_VALID (1 << 2)
What is FLAG_MMAP? Does it mean: 1) it can be mmaped, or 2) it is mmaped?
FLAG_RO is pretty obvious - presumarily this is for firmware regions and such.
And PHYS_VALID is if the region is disabled for some reasons? If so
would the name FLAG_DISABLED be better?
> + __u64 phys; /* physical address of region */
> +};
> +
> +#define VFIO_DEVICE_GET_REGION_INFO _IOWR(';', 110, struct
> vfio_region_info)
> +
> +The offset indicates the offset into the device file descriptor which
> +accesses the given range (for read/write/mmap/seek). Flags indicate the
> +available access types and validity of optional fields. For instance
> +the phys field may only be valid for certain devices types.
> +
> +Interrupts are described using a similar interface. GET_NUM_IRQS
> +reports the number or IRQ indexes for the device.
> +
> +#define VFIO_DEVICE_GET_NUM_IRQS _IOR(';', 111, int)
_u32?
> +
> +struct vfio_irq_info {
> + __u32 len; /* length of structure */
> + __u32 index; /* IRQ number */
> + __u32 count; /* number of individual IRQs */
> + __u64 flags;
> +#define VFIO_IRQ_INFO_FLAG_LEVEL (1 << 0)
> +};
> +
> +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> +type to index mapping).
I am not really sure what that means.
> +
> +Information about each index can be retrieved using the GET_IRQ_INFO
> +ioctl, used much like GET_REGION_INFO.
> +
> +#define VFIO_DEVICE_GET_IRQ_INFO _IOWR(';', 112, struct vfio_irq_info)
> +
> +Individual indexes can describe single or sets of IRQs. This provides the
> +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface.
> +
> +All VFIO interrupts are signaled to userspace via eventfds. Integer arrays,
> +as shown below, are used to pass the IRQ info index, the number of eventfds,
> +and each eventfd to be signaled. Using a count of 0 disables the interrupt.
> +
> +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
Are eventfds u64 or u32?
Why not just define a structure?
struct vfio_irq_eventfds {
__u32 index;
__u32 count;
__u64 eventfds[0]
};
How do you get an eventfd to feed in here?
> +#define VFIO_DEVICE_SET_IRQ_EVENTFDS _IOW(';', 113, int)
u32?
> +
> +When a level triggered interrupt is signaled, the interrupt is masked
> +on the host. This prevents an unresponsive userspace driver from
> +continuing to interrupt the host system. After servicing the interrupt,
> +UNMASK_IRQ is used to allow the interrupt to retrigger. Note that level
> +triggered interrupts implicitly have a count of 1 per index.
So they are enabled automatically? Meaning you don't even hav to do
SET_IRQ_EVENTFDS b/c the count is set to 1?
> +
> +/* Unmask IRQ index, arg[0] = index */
> +#define VFIO_DEVICE_UNMASK_IRQ _IOW(';', 114, int)
So this is for MSI as well? So if I've an index = 1, with count = 4,
and doing unmaks IRQ will chip enable all the MSI event at once?
I guess there is not much point in enabling/disabling selective MSI
IRQs..
> +
> +Level triggered interrupts can also be unmasked using an irqfd. Use
irqfd or eventfd?
> +SET_UNMASK_IRQ_EVENTFD to set the file descriptor for this.
So only level triggered? Hmm, how do I know whether the device is
level or edge? Or is that edge (MSI) can also be unmaked using the
eventfs
> +
> +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD _IOW(';', 115, int)
> +
> +When supported, as indicated by the device flags, reset the device.
> +
> +#define VFIO_DEVICE_RESET _IO(';', 116)
Does it disable the 'count'? Err, does it disable the IRQ on the
device after this and one should call VFIO_DEVICE_SET_IRQ_EVENTFDS
to set new eventfds? Or does it re-use the eventfds and the device
is enabled after this?
> +
> +Device tree devices also invlude ioctls for further defining the
include
> +device tree properties of the device:
> +
> +struct vfio_dtpath {
> + __u32 len; /* length of structure */
> + __u32 index;
0 based I presume?
> + __u64 flags;
> +#define VFIO_DTPATH_FLAGS_REGION (1 << 0)
What is region in this context?? Or would this make much more sense
if I knew what Device Tree actually is.
> +#define VFIO_DTPATH_FLAGS_IRQ (1 << 1)
> + char *path;
Ah, now I see why you want 'len' here.. But I am still at loss
why you want that with the other structures.
> +};
> +#define VFIO_DEVICE_GET_DTPATH _IOWR(';', 117, struct vfio_dtpath)
> +
> +struct vfio_dtindex {
> + __u32 len; /* length of structure */
> + __u32 index;
> + __u32 prop_type;
Is that an enum type? Is this definied somewhere?
> + __u32 prop_index;
What is the purpose of this field?
> + __u64 flags;
> +#define VFIO_DTINDEX_FLAGS_REGION (1 << 0)
> +#define VFIO_DTINDEX_FLAGS_IRQ (1 << 1)
> +};
> +#define VFIO_DEVICE_GET_DTINDEX _IOWR(';', 118, struct vfio_dtindex)
> +
> +
> +VFIO bus driver API
> +-------------------------------------------------------------------------------
> +
> +Bus drivers, such as PCI, have three jobs:
> + 1) Add/remove devices from vfio
> + 2) Provide vfio_device_ops for device access
> + 3) Device binding and unbinding
suspend/resume?
> +
> +When initialized, the bus driver should enumerate the devices on it's
> +bus and call vfio_group_add_dev() for each device. If the bus supports
> +hotplug, notifiers should be enabled to track devices being added and
> +removed. vfio_group_del_dev() removes a previously added device from
> +vfio.
> +
> +Adding a device registers a vfio_device_ops function pointer structure
> +for the device:
Huh? So this gets created for _every_ 'struct device' that is added
the VFIO bus? Is this structure exposed? Or is this an internal one?
> +
> +struct vfio_device_ops {
> + bool (*match)(struct device *, char *);
> + int (*get)(void *);
> + void (*put)(void *);
> + ssize_t (*read)(void *, char __user *,
> + size_t, loff_t *);
> + ssize_t (*write)(void *, const char __user *,
> + size_t, loff_t *);
> + long (*ioctl)(void *, unsigned int, unsigned long);
> + int (*mmap)(void *, struct vm_area_struct *);
> +};
> +
> +When a device is bound to the bus driver, the bus driver indicates this
> +to vfio using the vfio_bind_dev() interface. The device_data parameter
Might want to paste the function decleration for it.. b/c I am not sure
where the 'device_data' parameter is on the argument list.
> +is a pointer to an opaque data structure for use only by the bus driver.
> +The get, put, read, write, ioctl, and mmap vfio_device_ops all pass
> +this data structure back to the bus driver. When a device is unbound
Oh, so it is on the 'void *'.
> +from the bus driver, the vfio_unbind_dev() interface signals this to
> +vfio. This function returns the pointer to the device_data structure
That function
> +registered for the device.
I am not really sure what this section purpose is? Could this be part
of the header file or the code? It does not look to be part of the
ioctl API?
> +
> +As noted previously, a group contains one or more devices, so
> +GROUP_GET_DEVICE_FD needs to identify the specific device being requested.
> +The vfio_device_ops.match callback is used to allow bus drivers to determine
> +the match. For drivers like vfio-pci, it's a simple match to dev_name(),
> +which is unique in the system due to the PCI bus topology, other bus drivers
> +may need to include parent devices to create a unique match, so this is
> +left as a bus driver interface.
> +
> +-------------------------------------------------------------------------------
> +
> +[1] VFIO was originally an acronym for "Virtual Function I/O" in it's
> +initial implementation by Tom Lyon while as Cisco. We've since outgrown
> +the acronym, but it's catchy.
> +
> +[2] As always there are trade-offs to virtual machine device
> +assignment that are beyond the scope of VFIO. It's expected that
> +future IOMMU technologies will reduce some, but maybe not all, of
> +these trade-offs.
> diff --git a/MAINTAINERS b/MAINTAINERS
> index f05f5f6..4bd5aa0 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7106,6 +7106,14 @@ S: Maintained
> F: Documentation/filesystems/vfat.txt
> F: fs/fat/
>
> +VFIO DRIVER
> +M: Alex Williamson <address@hidden>
> +L: address@hidden
No vfio mailing list? Or a vfio-mailing list?
> +S: Maintained
> +F: Documentation/vfio.txt
> +F: drivers/vfio/
> +F: include/linux/vfio.h
> +
> VIDEOBUF2 FRAMEWORK
> M: Pawel Osciak <address@hidden>
> M: Marek Szyprowski <address@hidden>
> diff --git a/drivers/Kconfig b/drivers/Kconfig
> index b5e6f24..e15578b 100644
> --- a/drivers/Kconfig
> +++ b/drivers/Kconfig
> @@ -112,6 +112,8 @@ source "drivers/auxdisplay/Kconfig"
>
> source "drivers/uio/Kconfig"
>
> +source "drivers/vfio/Kconfig"
> +
> source "drivers/vlynq/Kconfig"
>
> source "drivers/virtio/Kconfig"
> diff --git a/drivers/Makefile b/drivers/Makefile
> index 1b31421..5f138b5 100644
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -58,6 +58,7 @@ obj-$(CONFIG_ATM) += atm/
> obj-$(CONFIG_FUSION) += message/
> obj-y += firewire/
> obj-$(CONFIG_UIO) += uio/
> +obj-$(CONFIG_VFIO) += vfio/
> obj-y += cdrom/
> obj-y += auxdisplay/
> obj-$(CONFIG_PCCARD) += pcmcia/
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> new file mode 100644
> index 0000000..9acb1e7
> --- /dev/null
> +++ b/drivers/vfio/Kconfig
> @@ -0,0 +1,8 @@
> +menuconfig VFIO
> + tristate "VFIO Non-Privileged userspace driver framework"
> + depends on IOMMU_API
> + help
> + VFIO provides a framework for secure userspace device drivers.
> + See Documentation/vfio.txt for more details.
> +
> + If you don't know what to do here, say N.
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> new file mode 100644
> index 0000000..088faf1
> --- /dev/null
> +++ b/drivers/vfio/Makefile
> @@ -0,0 +1,3 @@
> +vfio-y := vfio_main.o vfio_iommu.o
> +
> +obj-$(CONFIG_VFIO) := vfio.o
> diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> new file mode 100644
> index 0000000..029dae3
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu.c
> @@ -0,0 +1,530 @@
> +/*
> + * VFIO: IOMMU DMA mapping support
> + *
> + * Copyright (C) 2011 Red Hat, Inc. All rights reserved.
> + * Author: Alex Williamson <address@hidden>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> + * Author: Tom Lyon, address@hidden
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/device.h>
> +#include <linux/fs.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/workqueue.h>
> +
> +#include "vfio_private.h"
> +
> +struct dma_map_page {
> + struct list_head list;
> + dma_addr_t daddr;
> + unsigned long vaddr;
> + int npage;
> + int rdwr;
rdwr? Is this a flag thing? Could it be made in an enum?
> +};
> +
> +/*
> + * This code handles mapping and unmapping of user data buffers
> + * into DMA'ble space using the IOMMU
> + */
> +
> +#define NPAGE_TO_SIZE(npage) ((size_t)(npage) << PAGE_SHIFT)
> +
> +struct vwork {
> + struct mm_struct *mm;
> + int npage;
> + struct work_struct work;
> +};
> +
> +/* delayed decrement for locked_vm */
> +static void vfio_lock_acct_bg(struct work_struct *work)
> +{
> + struct vwork *vwork = container_of(work, struct vwork, work);
> + struct mm_struct *mm;
> +
> + mm = vwork->mm;
> + down_write(&mm->mmap_sem);
> + mm->locked_vm += vwork->npage;
> + up_write(&mm->mmap_sem);
> + mmput(mm); /* unref mm */
> + kfree(vwork);
> +}
> +
> +static void vfio_lock_acct(int npage)
> +{
> + struct vwork *vwork;
> + struct mm_struct *mm;
> +
> + if (!current->mm) {
> + /* process exited */
> + return;
> + }
> + if (down_write_trylock(¤t->mm->mmap_sem)) {
> + current->mm->locked_vm += npage;
> + up_write(¤t->mm->mmap_sem);
> + return;
> + }
> + /*
> + * Couldn't get mmap_sem lock, so must setup to decrement
> + * mm->locked_vm later. If locked_vm were atomic, we wouldn't
> + * need this silliness
> + */
> + vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
> + if (!vwork)
> + return;
> + mm = get_task_mm(current); /* take ref mm */
> + if (!mm) {
> + kfree(vwork);
> + return;
> + }
> + INIT_WORK(&vwork->work, vfio_lock_acct_bg);
> + vwork->mm = mm;
> + vwork->npage = npage;
> + schedule_work(&vwork->work);
> +}
> +
> +/* Some mappings aren't backed by a struct page, for example an mmap'd
> + * MMIO range for our own or another device. These use a different
> + * pfn conversion and shouldn't be tracked as locked pages. */
> +static int is_invalid_reserved_pfn(unsigned long pfn)
static bool
> +{
> + if (pfn_valid(pfn)) {
> + int reserved;
> + struct page *tail = pfn_to_page(pfn);
> + struct page *head = compound_trans_head(tail);
> + reserved = PageReserved(head);
bool reserved = PageReserved(head);
> + if (head != tail) {
> + /* "head" is not a dangling pointer
> + * (compound_trans_head takes care of that)
> + * but the hugepage may have been split
> + * from under us (and we may not hold a
> + * reference count on the head page so it can
> + * be reused before we run PageReferenced), so
> + * we've to check PageTail before returning
> + * what we just read.
> + */
> + smp_rmb();
> + if (PageTail(tail))
> + return reserved;
> + }
> + return PageReserved(tail);
> + }
> +
> + return true;
> +}
> +
> +static int put_pfn(unsigned long pfn, int rdwr)
> +{
> + if (!is_invalid_reserved_pfn(pfn)) {
> + struct page *page = pfn_to_page(pfn);
> + if (rdwr)
> + SetPageDirty(page);
> + put_page(page);
> + return 1;
> + }
> + return 0;
> +}
> +
> +/* Unmap DMA region */
> +/* dgate must be held */
dgate?
> +static int __vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> + int npage, int rdwr)
> +{
> + int i, unlocked = 0;
> +
> + for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> + unsigned long pfn;
> +
> + pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
> + if (pfn) {
> + iommu_unmap(iommu->domain, iova, 0);
What is the '0' for? Perhaps a comment: /* We only do zero order */
> + unlocked += put_pfn(pfn, rdwr);
> + }
> + }
> + return unlocked;
> +}
> +
> +static void vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> + unsigned long npage, int rdwr)
> +{
> + int unlocked;
> +
> + unlocked = __vfio_dma_unmap(iommu, iova, npage, rdwr);
> + vfio_lock_acct(-unlocked);
> +}
> +
> +/* Unmap ALL DMA regions */
> +void vfio_iommu_unmapall(struct vfio_iommu *iommu)
> +{
> + struct list_head *pos, *pos2;
pos2 should probably be just called 'tmp'
> + struct dma_map_page *mlp;
What does 'mlp' stand for?
mlp -> dma_page ?
> +
> + mutex_lock(&iommu->dgate);
> + list_for_each_safe(pos, pos2, &iommu->dm_list) {
> + mlp = list_entry(pos, struct dma_map_page, list);
> + vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
Uh, so if it did not get put_page() we would try to still delete it?
Couldn't that lead to corruption as the 'mlp' is returned to the poll?
Ah wait, the put_page is on the DMA page, so it is OK to
delete the tracking structure. It will be just a leaked page.
> + list_del(&mlp->list);
> + kfree(mlp);
> + }
> + mutex_unlock(&iommu->dgate);
> +}
> +
> +static int vaddr_get_pfn(unsigned long vaddr, int rdwr, unsigned long *pfn)
> +{
> + struct page *page[1];
> + struct vm_area_struct *vma;
> + int ret = -EFAULT;
> +
> + if (get_user_pages_fast(vaddr, 1, rdwr, page) == 1) {
> + *pfn = page_to_pfn(page[0]);
> + return 0;
> + }
> +
> + down_read(¤t->mm->mmap_sem);
> +
> + vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +
> + if (vma && vma->vm_flags & VM_PFNMAP) {
> + *pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> + if (is_invalid_reserved_pfn(*pfn))
> + ret = 0;
Did you mean to break here?
> + }
> +
> + up_read(¤t->mm->mmap_sem);
> +
> + return ret;
> +}
> +
> +/* Map DMA region */
> +/* dgate must be held */
> +static int vfio_dma_map(struct vfio_iommu *iommu, unsigned long iova,
> + unsigned long vaddr, int npage, int rdwr)
> +{
> + unsigned long start = iova;
> + int i, ret, locked = 0, prot = IOMMU_READ;
> +
> + /* Verify pages are not already mapped */
I think a 'that' is missing above.
> + for (i = 0; i < npage; i++, iova += PAGE_SIZE)
> + if (iommu_iova_to_phys(iommu->domain, iova))
> + return -EBUSY;
> +
> + iova = start;
> +
> + if (rdwr)
> + prot |= IOMMU_WRITE;
> + if (iommu->cache)
> + prot |= IOMMU_CACHE;
> +
> + for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
> + unsigned long pfn = 0;
> +
> + ret = vaddr_get_pfn(vaddr, rdwr, &pfn);
> + if (ret) {
> + __vfio_dma_unmap(iommu, start, i, rdwr);
> + return ret;
> + }
> +
> + /* Only add actual locked pages to accounting */
> + if (!is_invalid_reserved_pfn(pfn))
> + locked++;
> +
> + ret = iommu_map(iommu->domain, iova,
> + (phys_addr_t)pfn << PAGE_SHIFT, 0, prot);
Put a comment by the 0 saying /* order 0 pages only! */
> + if (ret) {
> + /* Back out mappings on error */
> + put_pfn(pfn, rdwr);
> + __vfio_dma_unmap(iommu, start, i, rdwr);
> + return ret;
> + }
> + }
> + vfio_lock_acct(locked);
> + return 0;
> +}
> +
> +static inline int ranges_overlap(unsigned long start1, size_t size1,
Perhaps a bool?
> + unsigned long start2, size_t size2)
> +{
> + return !(start1 + size1 <= start2 || start2 + size2 <= start1);
> +}
> +
> +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu,
> + dma_addr_t start, size_t size)
> +{
> + struct list_head *pos;
> + struct dma_map_page *mlp;
> +
> + list_for_each(pos, &iommu->dm_list) {
> + mlp = list_entry(pos, struct dma_map_page, list);
> + if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> + start, size))
> + return mlp;
> + }
> + return NULL;
> +}
> +
> +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
> + size_t size, struct dma_map_page *mlp)
> +{
> + struct dma_map_page *split;
> + int npage_lo, npage_hi;
> +
> + /* Existing dma region is completely covered, unmap all */
> + if (start <= mlp->daddr &&
> + start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> + vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> + list_del(&mlp->list);
> + npage_lo = mlp->npage;
> + kfree(mlp);
> + return npage_lo;
> + }
> +
> + /* Overlap low address of existing range */
> + if (start <= mlp->daddr) {
> + size_t overlap;
> +
> + overlap = start + size - mlp->daddr;
> + npage_lo = overlap >> PAGE_SHIFT;
> + npage_hi = mlp->npage - npage_lo;
> +
> + vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> + mlp->daddr += overlap;
> + mlp->vaddr += overlap;
> + mlp->npage -= npage_lo;
> + return npage_lo;
> + }
> +
> + /* Overlap high address of existing range */
> + if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> + size_t overlap;
> +
> + overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> + npage_hi = overlap >> PAGE_SHIFT;
> + npage_lo = mlp->npage - npage_hi;
> +
> + vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> + mlp->npage -= npage_hi;
> + return npage_hi;
> + }
> +
> + /* Split existing */
> + npage_lo = (start - mlp->daddr) >> PAGE_SHIFT;
> + npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo;
> +
> + split = kzalloc(sizeof *split, GFP_KERNEL);
> + if (!split)
> + return -ENOMEM;
> +
> + vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr);
> +
> + mlp->npage = npage_lo;
> +
> + split->npage = npage_hi;
> + split->daddr = start + size;
> + split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
> + split->rdwr = mlp->rdwr;
> + list_add(&split->list, &iommu->dm_list);
> + return size >> PAGE_SHIFT;
> +}
> +
> +int vfio_dma_unmap_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> +{
> + int ret = 0;
> + size_t npage = dmp->size >> PAGE_SHIFT;
> + struct list_head *pos, *n;
> +
> + if (dmp->dmaaddr & ~PAGE_MASK)
> + return -EINVAL;
> + if (dmp->size & ~PAGE_MASK)
> + return -EINVAL;
> +
> + mutex_lock(&iommu->dgate);
> +
> + list_for_each_safe(pos, n, &iommu->dm_list) {
> + struct dma_map_page *mlp;
> +
> + mlp = list_entry(pos, struct dma_map_page, list);
> + if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> + dmp->dmaaddr, dmp->size)) {
> + ret = vfio_remove_dma_overlap(iommu, dmp->dmaaddr,
> + dmp->size, mlp);
> + if (ret > 0)
> + npage -= NPAGE_TO_SIZE(ret);
> + if (ret < 0 || npage == 0)
> + break;
> + }
> + }
> + mutex_unlock(&iommu->dgate);
> + return ret > 0 ? 0 : ret;
> +}
> +
> +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> +{
> + int npage;
> + struct dma_map_page *mlp, *mmlp = NULL;
> + dma_addr_t daddr = dmp->dmaaddr;
> + unsigned long locked, lock_limit, vaddr = dmp->vaddr;
> + size_t size = dmp->size;
> + int ret = 0, rdwr = dmp->flags & VFIO_DMA_MAP_FLAG_WRITE;
> +
> + if (vaddr & (PAGE_SIZE-1))
> + return -EINVAL;
> + if (daddr & (PAGE_SIZE-1))
> + return -EINVAL;
> + if (size & (PAGE_SIZE-1))
> + return -EINVAL;
> +
> + npage = size >> PAGE_SHIFT;
> + if (!npage)
> + return -EINVAL;
> +
> + if (!iommu)
> + return -EINVAL;
> +
> + mutex_lock(&iommu->dgate);
> +
> + if (vfio_find_dma(iommu, daddr, size)) {
> + ret = -EBUSY;
> + goto out_lock;
> + }
> +
> + /* account for locked pages */
> + locked = current->mm->locked_vm + npage;
> + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> + if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> + printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> + __func__, rlimit(RLIMIT_MEMLOCK));
> + ret = -ENOMEM;
> + goto out_lock;
> + }
> +
> + ret = vfio_dma_map(iommu, daddr, vaddr, npage, rdwr);
> + if (ret)
> + goto out_lock;
> +
> + /* Check if we abut a region below */
> + if (daddr) {
> + mlp = vfio_find_dma(iommu, daddr - 1, 1);
> + if (mlp && mlp->rdwr == rdwr &&
> + mlp->vaddr + NPAGE_TO_SIZE(mlp->npage) == vaddr) {
> +
> + mlp->npage += npage;
> + daddr = mlp->daddr;
> + vaddr = mlp->vaddr;
> + npage = mlp->npage;
> + size = NPAGE_TO_SIZE(npage);
> +
> + mmlp = mlp;
> + }
> + }
> +
> + if (daddr + size) {
> + mlp = vfio_find_dma(iommu, daddr + size, 1);
> + if (mlp && mlp->rdwr == rdwr && mlp->vaddr == vaddr + size) {
> +
> + mlp->npage += npage;
> + mlp->daddr = daddr;
> + mlp->vaddr = vaddr;
> +
> + /* If merged above and below, remove previously
> + * merged entry. New entry covers it. */
> + if (mmlp) {
> + list_del(&mmlp->list);
> + kfree(mmlp);
> + }
> + mmlp = mlp;
> + }
> + }
> +
> + if (!mmlp) {
> + mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
> + if (!mlp) {
> + ret = -ENOMEM;
> + vfio_dma_unmap(iommu, daddr, npage, rdwr);
> + goto out_lock;
> + }
> +
> + mlp->npage = npage;
> + mlp->daddr = daddr;
> + mlp->vaddr = vaddr;
> + mlp->rdwr = rdwr;
> + list_add(&mlp->list, &iommu->dm_list);
> + }
> +
> +out_lock:
> + mutex_unlock(&iommu->dgate);
> + return ret;
> +}
> +
> +static int vfio_iommu_release(struct inode *inode, struct file *filep)
> +{
> + struct vfio_iommu *iommu = filep->private_data;
> +
> + vfio_release_iommu(iommu);
> + return 0;
> +}
> +
> +static long vfio_iommu_unl_ioctl(struct file *filep,
> + unsigned int cmd, unsigned long arg)
> +{
> + struct vfio_iommu *iommu = filep->private_data;
> + int ret = -ENOSYS;
> +
> + if (cmd == VFIO_IOMMU_GET_FLAGS) {
Something is weird with the tabbing here..
> + u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;
> +
> + ret = put_user(flags, (u64 __user *)arg);
> +
> + } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> + struct vfio_dma_map dm;
> +
> + if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> + return -EFAULT;
> +
> + ret = vfio_dma_map_dm(iommu, &dm);
> +
> + if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> + ret = -EFAULT;
> +
> + } else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
> + struct vfio_dma_map dm;
> +
> + if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> + return -EFAULT;
> +
> + ret = vfio_dma_unmap_dm(iommu, &dm);
> +
> + if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> + ret = -EFAULT;
> + }
> + return ret;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long vfio_iommu_compat_ioctl(struct file *filep,
> + unsigned int cmd, unsigned long arg)
> +{
> + arg = (unsigned long)compat_ptr(arg);
> + return vfio_iommu_unl_ioctl(filep, cmd, arg);
> +}
> +#endif /* CONFIG_COMPAT */
> +
> +const struct file_operations vfio_iommu_fops = {
> + .owner = THIS_MODULE,
> + .release = vfio_iommu_release,
> + .unlocked_ioctl = vfio_iommu_unl_ioctl,
> +#ifdef CONFIG_COMPAT
> + .compat_ioctl = vfio_iommu_compat_ioctl,
> +#endif
> +};
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> new file mode 100644
> index 0000000..6169356
> --- /dev/null
> +++ b/drivers/vfio/vfio_main.c
> @@ -0,0 +1,1151 @@
> +/*
> + * VFIO framework
> + *
> + * Copyright (C) 2011 Red Hat, Inc. All rights reserved.
> + * Author: Alex Williamson <address@hidden>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> + * Author: Tom Lyon, address@hidden
> + */
> +
> +#include <linux/cdev.h>
> +#include <linux/compat.h>
> +#include <linux/device.h>
> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/fs.h>
> +#include <linux/idr.h>
> +#include <linux/iommu.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/string.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/wait.h>
> +
> +#include "vfio_private.h"
> +
> +#define DRIVER_VERSION "0.2"
> +#define DRIVER_AUTHOR "Alex Williamson <address@hidden>"
> +#define DRIVER_DESC "VFIO - User Level meta-driver"
> +
> +static int allow_unsafe_intrs;
__read_mostly
> +module_param(allow_unsafe_intrs, int, 0);
S_IRUGO ?
> +MODULE_PARM_DESC(allow_unsafe_intrs,
> + "Allow use of IOMMUs which do not support interrupt remapping");
> +
> +static struct vfio {
> + dev_t devt;
> + struct cdev cdev;
> + struct list_head group_list;
> + struct mutex lock;
> + struct kref kref;
> + struct class *class;
> + struct idr idr;
> + wait_queue_head_t release_q;
> +} vfio;
You probably want to move this below the 'vfio_group'
as vfio contains the vfio_group.
> +
> +static const struct file_operations vfio_group_fops;
> +extern const struct file_operations vfio_iommu_fops;
> +
> +struct vfio_group {
> + dev_t devt;
> + unsigned int groupid;
> + struct bus_type *bus;
> + struct vfio_iommu *iommu;
> + struct list_head device_list;
> + struct list_head iommu_next;
> + struct list_head group_next;
> + int refcnt;
> +};
> +
> +struct vfio_device {
> + struct device *dev;
> + const struct vfio_device_ops *ops;
> + struct vfio_iommu *iommu;
> + struct vfio_group *group;
> + struct list_head device_next;
> + bool attached;
> + int refcnt;
> + void *device_data;
> +};
And perhaps move this above vfio_group. As vfio_group
contains a list of these structures?
> +
> +/*
> + * Helper functions called under vfio.lock
> + */
> +
> +/* Return true if any devices within a group are opened */
> +static bool __vfio_group_devs_inuse(struct vfio_group *group)
> +{
> + struct list_head *pos;
> +
> + list_for_each(pos, &group->device_list) {
> + struct vfio_device *device;
> +
> + device = list_entry(pos, struct vfio_device, device_next);
> + if (device->refcnt)
> + return true;
> + }
> + return false;
> +}
> +
> +/* Return true if any of the groups attached to an iommu are opened.
> + * We can only tear apart merged groups when nothing is left open. */
> +static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu)
> +{
> + struct list_head *pos;
> +
> + list_for_each(pos, &iommu->group_list) {
> + struct vfio_group *group;
> +
> + group = list_entry(pos, struct vfio_group, iommu_next);
> + if (group->refcnt)
> + return true;
> + }
> + return false;
> +}
> +
> +/* An iommu is "in use" if it has a file descriptor open or if any of
> + * the groups assigned to the iommu have devices open. */
> +static bool __vfio_iommu_inuse(struct vfio_iommu *iommu)
> +{
> + struct list_head *pos;
> +
> + if (iommu->refcnt)
> + return true;
> +
> + list_for_each(pos, &iommu->group_list) {
> + struct vfio_group *group;
> +
> + group = list_entry(pos, struct vfio_group, iommu_next);
> +
> + if (__vfio_group_devs_inuse(group))
> + return true;
> + }
> + return false;
> +}
> +
> +static void __vfio_group_set_iommu(struct vfio_group *group,
> + struct vfio_iommu *iommu)
> +{
> + struct list_head *pos;
> +
> + if (group->iommu)
> + list_del(&group->iommu_next);
> + if (iommu)
> + list_add(&group->iommu_next, &iommu->group_list);
> +
> + group->iommu = iommu;
> +
> + list_for_each(pos, &group->device_list) {
> + struct vfio_device *device;
> +
> + device = list_entry(pos, struct vfio_device, device_next);
> + device->iommu = iommu;
> + }
> +}
> +
> +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
> + struct vfio_device *device)
> +{
> + BUG_ON(!iommu->domain && device->attached);
Whoa. Heavy hammer there.
Perhaps WARN_ON as you do check it later on.
> +
> + if (!iommu->domain || !device->attached)
> + return;
> +
> + iommu_detach_device(iommu->domain, device->dev);
> + device->attached = false;
> +}
> +
> +static void __vfio_iommu_detach_group(struct vfio_iommu *iommu,
> + struct vfio_group *group)
> +{
> + struct list_head *pos;
> +
> + list_for_each(pos, &group->device_list) {
> + struct vfio_device *device;
> +
> + device = list_entry(pos, struct vfio_device, device_next);
> + __vfio_iommu_detach_dev(iommu, device);
> + }
> +}
> +
> +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
> + struct vfio_device *device)
> +{
> + int ret;
> +
> + BUG_ON(device->attached);
How about:
WARN_ON(device->attached, "The engineer who wrote the user-space device driver
is trying to register
the device again! Tell him/her to stop please.\n");
> +
> + if (!iommu || !iommu->domain)
> + return -EINVAL;
> +
> + ret = iommu_attach_device(iommu->domain, device->dev);
> + if (!ret)
> + device->attached = true;
> +
> + return ret;
> +}
> +
> +static int __vfio_iommu_attach_group(struct vfio_iommu *iommu,
> + struct vfio_group *group)
> +{
> + struct list_head *pos;
> +
> + list_for_each(pos, &group->device_list) {
> + struct vfio_device *device;
> + int ret;
> +
> + device = list_entry(pos, struct vfio_device, device_next);
> + ret = __vfio_iommu_attach_dev(iommu, device);
> + if (ret) {
> + __vfio_iommu_detach_group(iommu, group);
> + return ret;
> + }
> + }
> + return 0;
> +}
> +
> +/* The iommu is viable, ie. ready to be configured, when all the devices
> + * for all the groups attached to the iommu are bound to their vfio device
> + * drivers (ex. vfio-pci). This sets the device_data private data pointer.
> */
> +static bool __vfio_iommu_viable(struct vfio_iommu *iommu)
> +{
> + struct list_head *gpos, *dpos;
> +
> + list_for_each(gpos, &iommu->group_list) {
> + struct vfio_group *group;
> + group = list_entry(gpos, struct vfio_group, iommu_next);
> +
> + list_for_each(dpos, &group->device_list) {
> + struct vfio_device *device;
> + device = list_entry(dpos,
> + struct vfio_device, device_next);
> +
> + if (!device->device_data)
> + return false;
> + }
> + }
> + return true;
> +}
> +
> +static void __vfio_close_iommu(struct vfio_iommu *iommu)
> +{
> + struct list_head *pos;
> +
> + if (!iommu->domain)
> + return;
> +
> + list_for_each(pos, &iommu->group_list) {
> + struct vfio_group *group;
> + group = list_entry(pos, struct vfio_group, iommu_next);
> +
> + __vfio_iommu_detach_group(iommu, group);
> + }
> +
> + vfio_iommu_unmapall(iommu);
> +
> + iommu_domain_free(iommu->domain);
> + iommu->domain = NULL;
> + iommu->mm = NULL;
> +}
> +
> +/* Open the IOMMU. This gates all access to the iommu or device file
> + * descriptors and sets current->mm as the exclusive user. */
> +static int __vfio_open_iommu(struct vfio_iommu *iommu)
> +{
> + struct list_head *pos;
> + int ret;
> +
> + if (!__vfio_iommu_viable(iommu))
> + return -EBUSY;
> +
> + if (iommu->domain)
> + return -EINVAL;
> +
> + iommu->domain = iommu_domain_alloc(iommu->bus);
> + if (!iommu->domain)
> + return -EFAULT;
ENOMEM?
> +
> + list_for_each(pos, &iommu->group_list) {
> + struct vfio_group *group;
> + group = list_entry(pos, struct vfio_group, iommu_next);
> +
> + ret = __vfio_iommu_attach_group(iommu, group);
> + if (ret) {
> + __vfio_close_iommu(iommu);
> + return ret;
> + }
> + }
> +
> + if (!allow_unsafe_intrs &&
> + !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
> + __vfio_close_iommu(iommu);
> + return -EFAULT;
> + }
> +
> + iommu->cache = (iommu_domain_has_cap(iommu->domain,
> + IOMMU_CAP_CACHE_COHERENCY) != 0);
> + iommu->mm = current->mm;
> +
> + return 0;
> +}
> +
> +/* Actively try to tear down the iommu and merged groups. If there are no
> + * open iommu or device fds, we close the iommu. If we close the iommu and
> + * there are also no open group fds, we can futher dissolve the group to
> + * iommu association and free the iommu data structure. */
> +static int __vfio_try_dissolve_iommu(struct vfio_iommu *iommu)
> +{
> +
> + if (__vfio_iommu_inuse(iommu))
> + return -EBUSY;
> +
> + __vfio_close_iommu(iommu);
> +
> + if (!__vfio_iommu_groups_inuse(iommu)) {
> + struct list_head *pos, *ppos;
> +
> + list_for_each_safe(pos, ppos, &iommu->group_list) {
> + struct vfio_group *group;
> +
> + group = list_entry(pos, struct vfio_group, iommu_next);
> + __vfio_group_set_iommu(group, NULL);
> + }
> +
> +
> + kfree(iommu);
> + }
> +
> + return 0;
> +}
> +
> +static struct vfio_device *__vfio_lookup_dev(struct device *dev)
> +{
> + struct list_head *gpos;
> + unsigned int groupid;
> +
> + if (iommu_device_group(dev, &groupid))
Hmm, where is this defined? v3.2-rc1 does not seem to have it?
> + return NULL;
> +
> + list_for_each(gpos, &vfio.group_list) {
> + struct vfio_group *group;
> + struct list_head *dpos;
> +
> + group = list_entry(gpos, struct vfio_group, group_next);
> +
> + if (group->groupid != groupid)
> + continue;
> +
> + list_for_each(dpos, &group->device_list) {
> + struct vfio_device *device;
> +
> + device = list_entry(dpos,
> + struct vfio_device, device_next);
> +
> + if (device->dev == dev)
> + return device;
> + }
> + }
> + return NULL;
> +}
> +
> +/* All release paths simply decrement the refcnt, attempt to teardown
> + * the iommu and merged groups, and wakeup anything that might be
> + * waiting if we successfully dissolve anything. */
> +static int vfio_do_release(int *refcnt, struct vfio_iommu *iommu)
> +{
> + bool wake;
> +
> + mutex_lock(&vfio.lock);
> +
> + (*refcnt)--;
> + wake = (__vfio_try_dissolve_iommu(iommu) == 0);
> +
> + mutex_unlock(&vfio.lock);
> +
> + if (wake)
> + wake_up(&vfio.release_q);
> +
> + return 0;
> +}
> +
> +/*
> + * Device fops - passthrough to vfio device driver w/ device_data
> + */
> +static int vfio_device_release(struct inode *inode, struct file *filep)
> +{
> + struct vfio_device *device = filep->private_data;
> +
> + vfio_do_release(&device->refcnt, device->iommu);
> +
> + device->ops->put(device->device_data);
> +
> + return 0;
> +}
> +
> +static long vfio_device_unl_ioctl(struct file *filep,
> + unsigned int cmd, unsigned long arg)
> +{
> + struct vfio_device *device = filep->private_data;
> +
> + return device->ops->ioctl(device->device_data, cmd, arg);
> +}
> +
> +static ssize_t vfio_device_read(struct file *filep, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct vfio_device *device = filep->private_data;
> +
> + return device->ops->read(device->device_data, buf, count, ppos);
> +}
> +
> +static ssize_t vfio_device_write(struct file *filep, const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct vfio_device *device = filep->private_data;
> +
> + return device->ops->write(device->device_data, buf, count, ppos);
> +}
> +
> +static int vfio_device_mmap(struct file *filep, struct vm_area_struct *vma)
> +{
> + struct vfio_device *device = filep->private_data;
> +
> + return device->ops->mmap(device->device_data, vma);
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long vfio_device_compat_ioctl(struct file *filep,
> + unsigned int cmd, unsigned long arg)
> +{
> + arg = (unsigned long)compat_ptr(arg);
> + return vfio_device_unl_ioctl(filep, cmd, arg);
> +}
> +#endif /* CONFIG_COMPAT */
> +
> +const struct file_operations vfio_device_fops = {
> + .owner = THIS_MODULE,
> + .release = vfio_device_release,
> + .read = vfio_device_read,
> + .write = vfio_device_write,
> + .unlocked_ioctl = vfio_device_unl_ioctl,
> +#ifdef CONFIG_COMPAT
> + .compat_ioctl = vfio_device_compat_ioctl,
> +#endif
> + .mmap = vfio_device_mmap,
> +};
> +
> +/*
> + * Group fops
> + */
> +static int vfio_group_open(struct inode *inode, struct file *filep)
> +{
> + struct vfio_group *group;
> + int ret = 0;
> +
> + mutex_lock(&vfio.lock);
> +
> + group = idr_find(&vfio.idr, iminor(inode));
> +
> + if (!group) {
> + ret = -ENODEV;
> + goto out;
> + }
> +
> + filep->private_data = group;
> +
> + if (!group->iommu) {
> + struct vfio_iommu *iommu;
> +
> + iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> + if (!iommu) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + INIT_LIST_HEAD(&iommu->group_list);
> + INIT_LIST_HEAD(&iommu->dm_list);
> + mutex_init(&iommu->dgate);
> + iommu->bus = group->bus;
> + __vfio_group_set_iommu(group, iommu);
> + }
> + group->refcnt++;
> +
> +out:
> + mutex_unlock(&vfio.lock);
> +
> + return ret;
> +}
> +
> +static int vfio_group_release(struct inode *inode, struct file *filep)
> +{
> + struct vfio_group *group = filep->private_data;
> +
> + return vfio_do_release(&group->refcnt, group->iommu);
> +}
> +
> +/* Attempt to merge the group pointed to by fd into group. The merge-ee
> + * group must not have an iommu or any devices open because we cannot
> + * maintain that context across the merge. The merge-er group can be
> + * in use. */
> +static int vfio_group_merge(struct vfio_group *group, int fd)
> +{
> + struct vfio_group *new;
> + struct vfio_iommu *old_iommu;
> + struct file *file;
> + int ret = 0;
> + bool opened = false;
> +
> + mutex_lock(&vfio.lock);
> +
> + file = fget(fd);
> + if (!file) {
> + ret = -EBADF;
> + goto out_noput;
> + }
> +
> + /* Sanity check, is this really our fd? */
> + if (file->f_op != &vfio_group_fops) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + new = file->private_data;
> +
> + if (!new || new == group || !new->iommu ||
> + new->iommu->domain || new->bus != group->bus) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + /* We need to attach all the devices to each domain separately
> + * in order to validate that the capabilities match for both. */
> + ret = __vfio_open_iommu(new->iommu);
> + if (ret)
> + goto out;
> +
> + if (!group->iommu->domain) {
> + ret = __vfio_open_iommu(group->iommu);
> + if (ret)
> + goto out;
> + opened = true;
> + }
> +
> + /* If cache coherency doesn't match we'd potentialy need to
> + * remap existing iommu mappings in the merge-er domain.
> + * Poor return to bother trying to allow this currently. */
> + if (iommu_domain_has_cap(group->iommu->domain,
> + IOMMU_CAP_CACHE_COHERENCY) !=
> + iommu_domain_has_cap(new->iommu->domain,
> + IOMMU_CAP_CACHE_COHERENCY)) {
> + __vfio_close_iommu(new->iommu);
> + if (opened)
> + __vfio_close_iommu(group->iommu);
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + /* Close the iommu for the merge-ee and attach all its devices
> + * to the merge-er iommu. */
> + __vfio_close_iommu(new->iommu);
> +
> + ret = __vfio_iommu_attach_group(group->iommu, new);
> + if (ret)
> + goto out;
> +
> + /* set_iommu unlinks new from the iommu, so save a pointer to it */
> + old_iommu = new->iommu;
> + __vfio_group_set_iommu(new, group->iommu);
> + kfree(old_iommu);
> +
> +out:
> + fput(file);
> +out_noput:
> + mutex_unlock(&vfio.lock);
> + return ret;
> +}
> +
> +/* Unmerge the group pointed to by fd from group. */
> +static int vfio_group_unmerge(struct vfio_group *group, int fd)
> +{
> + struct vfio_group *new;
> + struct vfio_iommu *new_iommu;
> + struct file *file;
> + int ret = 0;
> +
> + /* Since the merge-out group is already opened, it needs to
> + * have an iommu struct associated with it. */
> + new_iommu = kzalloc(sizeof(*new_iommu), GFP_KERNEL);
> + if (!new_iommu)
> + return -ENOMEM;
> +
> + INIT_LIST_HEAD(&new_iommu->group_list);
> + INIT_LIST_HEAD(&new_iommu->dm_list);
> + mutex_init(&new_iommu->dgate);
> + new_iommu->bus = group->bus;
> +
> + mutex_lock(&vfio.lock);
> +
> + file = fget(fd);
> + if (!file) {
> + ret = -EBADF;
> + goto out_noput;
> + }
> +
> + /* Sanity check, is this really our fd? */
> + if (file->f_op != &vfio_group_fops) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + new = file->private_data;
> + if (!new || new == group || new->iommu != group->iommu) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + /* We can't merge-out a group with devices still in use. */
> + if (__vfio_group_devs_inuse(new)) {
> + ret = -EBUSY;
> + goto out;
> + }
> +
> + __vfio_iommu_detach_group(group->iommu, new);
> + __vfio_group_set_iommu(new, new_iommu);
> +
> +out:
> + fput(file);
> +out_noput:
> + if (ret)
> + kfree(new_iommu);
> + mutex_unlock(&vfio.lock);
> + return ret;
> +}
> +
> +/* Get a new iommu file descriptor. This will open the iommu, setting
> + * the current->mm ownership if it's not already set. */
> +static int vfio_group_get_iommu_fd(struct vfio_group *group)
> +{
> + int ret = 0;
> +
> + mutex_lock(&vfio.lock);
> +
> + if (!group->iommu->domain) {
> + ret = __vfio_open_iommu(group->iommu);
> + if (ret)
> + goto out;
> + }
> +
> + ret = anon_inode_getfd("[vfio-iommu]", &vfio_iommu_fops,
> + group->iommu, O_RDWR);
> + if (ret < 0)
> + goto out;
> +
> + group->iommu->refcnt++;
> +out:
> + mutex_unlock(&vfio.lock);
> + return ret;
> +}
> +
> +/* Get a new device file descriptor. This will open the iommu, setting
> + * the current->mm ownership if it's not already set. It's difficult to
> + * specify the requirements for matching a user supplied buffer to a
> + * device, so we use a vfio driver callback to test for a match. For
> + * PCI, dev_name(dev) is unique, but other drivers may require including
> + * a parent device string. */
> +static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
> +{
> + struct vfio_iommu *iommu = group->iommu;
> + struct list_head *gpos;
> + int ret = -ENODEV;
> +
> + mutex_lock(&vfio.lock);
> +
> + if (!iommu->domain) {
> + ret = __vfio_open_iommu(iommu);
> + if (ret)
> + goto out;
> + }
> +
> + list_for_each(gpos, &iommu->group_list) {
> + struct list_head *dpos;
> +
> + group = list_entry(gpos, struct vfio_group, iommu_next);
> +
> + list_for_each(dpos, &group->device_list) {
> + struct vfio_device *device;
> +
> + device = list_entry(dpos,
> + struct vfio_device, device_next);
> +
> + if (device->ops->match(device->dev, buf)) {
> + struct file *file;
> +
> + if (device->ops->get(device->device_data)) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + /* We can't use anon_inode_getfd(), like above
> + * because we need to modify the f_mode flags
> + * directly to allow more than just ioctls */
> + ret = get_unused_fd();
> + if (ret < 0) {
> + device->ops->put(device->device_data);
> + goto out;
> + }
> +
> + file = anon_inode_getfile("[vfio-device]",
> + &vfio_device_fops,
> + device, O_RDWR);
> + if (IS_ERR(file)) {
> + put_unused_fd(ret);
> + ret = PTR_ERR(file);
> + device->ops->put(device->device_data);
> + goto out;
> + }
> +
> + /* Todo: add an anon_inode interface to do
> + * this. Appears to be missing by lack of
> + * need rather than explicitly prevented.
> + * Now there's need. */
> + file->f_mode |= (FMODE_LSEEK |
> + FMODE_PREAD |
> + FMODE_PWRITE);
> +
> + fd_install(ret, file);
> +
> + device->refcnt++;
> + goto out;
> + }
> + }
> + }
> +out:
> + mutex_unlock(&vfio.lock);
> + return ret;
> +}
> +
> +static long vfio_group_unl_ioctl(struct file *filep,
> + unsigned int cmd, unsigned long arg)
> +{
> + struct vfio_group *group = filep->private_data;
> +
> + if (cmd == VFIO_GROUP_GET_FLAGS) {
> + u64 flags = 0;
> +
> + mutex_lock(&vfio.lock);
> + if (__vfio_iommu_viable(group->iommu))
> + flags |= VFIO_GROUP_FLAGS_VIABLE;
> + mutex_unlock(&vfio.lock);
> +
> + if (group->iommu->mm)
> + flags |= VFIO_GROUP_FLAGS_MM_LOCKED;
> +
> + return put_user(flags, (u64 __user *)arg);
> + }
> +
> + /* Below commands are restricted once the mm is set */
> + if (group->iommu->mm && group->iommu->mm != current->mm)
> + return -EPERM;
> +
> + if (cmd == VFIO_GROUP_MERGE || cmd == VFIO_GROUP_UNMERGE) {
> + int fd;
> +
> + if (get_user(fd, (int __user *)arg))
> + return -EFAULT;
> + if (fd < 0)
> + return -EINVAL;
> +
> + if (cmd == VFIO_GROUP_MERGE)
> + return vfio_group_merge(group, fd);
> + else
> + return vfio_group_unmerge(group, fd);
> + } else if (cmd == VFIO_GROUP_GET_IOMMU_FD) {
> + return vfio_group_get_iommu_fd(group);
> + } else if (cmd == VFIO_GROUP_GET_DEVICE_FD) {
> + char *buf;
> + int ret;
> +
> + buf = strndup_user((const char __user *)arg, PAGE_SIZE);
> + if (IS_ERR(buf))
> + return PTR_ERR(buf);
> +
> + ret = vfio_group_get_device_fd(group, buf);
> + kfree(buf);
> + return ret;
> + }
> +
> + return -ENOSYS;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long vfio_group_compat_ioctl(struct file *filep,
> + unsigned int cmd, unsigned long arg)
> +{
> + arg = (unsigned long)compat_ptr(arg);
> + return vfio_group_unl_ioctl(filep, cmd, arg);
> +}
> +#endif /* CONFIG_COMPAT */
> +
> +static const struct file_operations vfio_group_fops = {
> + .owner = THIS_MODULE,
> + .open = vfio_group_open,
> + .release = vfio_group_release,
> + .unlocked_ioctl = vfio_group_unl_ioctl,
> +#ifdef CONFIG_COMPAT
> + .compat_ioctl = vfio_group_compat_ioctl,
> +#endif
> +};
> +
> +/* iommu fd release hook */
> +int vfio_release_iommu(struct vfio_iommu *iommu)
> +{
> + return vfio_do_release(&iommu->refcnt, iommu);
> +}
> +
> +/*
> + * VFIO driver API
> + */
> +
> +/* Add a new device to the vfio framework with associated vfio driver
> + * callbacks. This is the entry point for vfio drivers to register devices.
> */
> +int vfio_group_add_dev(struct device *dev, const struct vfio_device_ops *ops)
> +{
> + struct list_head *pos;
> + struct vfio_group *group = NULL;
> + struct vfio_device *device = NULL;
> + unsigned int groupid;
> + int ret = 0;
> + bool new_group = false;
> +
> + if (!ops)
> + return -EINVAL;
> +
> + if (iommu_device_group(dev, &groupid))
> + return -ENODEV;
> +
> + mutex_lock(&vfio.lock);
> +
> + list_for_each(pos, &vfio.group_list) {
> + group = list_entry(pos, struct vfio_group, group_next);
> + if (group->groupid == groupid)
> + break;
> + group = NULL;
> + }
> +
> + if (!group) {
> + int minor;
> +
> + if (unlikely(idr_pre_get(&vfio.idr, GFP_KERNEL) == 0)) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + group = kzalloc(sizeof(*group), GFP_KERNEL);
> + if (!group) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + group->groupid = groupid;
> + INIT_LIST_HEAD(&group->device_list);
> +
> + ret = idr_get_new(&vfio.idr, group, &minor);
> + if (ret == 0 && minor > MINORMASK) {
> + idr_remove(&vfio.idr, minor);
> + kfree(group);
> + ret = -ENOSPC;
> + goto out;
> + }
> +
> + group->devt = MKDEV(MAJOR(vfio.devt), minor);
> + device_create(vfio.class, NULL, group->devt,
> + group, "%u", groupid);
> +
> + group->bus = dev->bus;
Oh, so that is how the IOMMU iommu_ops get copied! You might
want to mention that - I was not sure where the 'handoff' is
was done to insert a device so that it can do iommu_ops properly.
Ok, so the time when a device is detected whether it can do
IOMMU is when we try to open it - as that is when iommu_domain_alloc
is called which can return NULL if the iommu_ops is not set.
So what about devices that don't have an iommu_ops? Say they
are using SWIOTLB? (like the AMD-Vi sometimes does if the
device is not on its list).
Can we use iommu_present?
> + list_add(&group->group_next, &vfio.group_list);
> + new_group = true;
> + } else {
> + if (group->bus != dev->bus) {
> + printk(KERN_WARNING
> + "Error: IOMMU group ID conflict. Group ID %u "
> + "on both bus %s and %s\n", groupid,
> + group->bus->name, dev->bus->name);
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + list_for_each(pos, &group->device_list) {
> + device = list_entry(pos,
> + struct vfio_device, device_next);
> + if (device->dev == dev)
> + break;
> + device = NULL;
> + }
> + }
> +
> + if (!device) {
> + if (__vfio_group_devs_inuse(group) ||
> + (group->iommu && group->iommu->refcnt)) {
> + printk(KERN_WARNING
> + "Adding device %s to group %u while group is
> already in use!!\n",
> + dev_name(dev), group->groupid);
> + /* XXX How to prevent other drivers from claiming? */
> + }
> +
> + device = kzalloc(sizeof(*device), GFP_KERNEL);
> + if (!device) {
> + /* If we just created this group, tear it down */
> + if (new_group) {
> + list_del(&group->group_next);
> + device_destroy(vfio.class, group->devt);
> + idr_remove(&vfio.idr, MINOR(group->devt));
> + kfree(group);
> + }
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + list_add(&device->device_next, &group->device_list);
> + device->dev = dev;
> + device->ops = ops;
> + device->iommu = group->iommu; /* NULL if new */
> + __vfio_iommu_attach_dev(group->iommu, device);
> + }
> +out:
> + mutex_unlock(&vfio.lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_group_add_dev);
> +
> +/* Remove a device from the vfio framework */
> +void vfio_group_del_dev(struct device *dev)
> +{
> + struct list_head *pos;
> + struct vfio_group *group = NULL;
> + struct vfio_device *device = NULL;
> + unsigned int groupid;
> +
> + if (iommu_device_group(dev, &groupid))
> + return;
> +
> + mutex_lock(&vfio.lock);
> +
> + list_for_each(pos, &vfio.group_list) {
> + group = list_entry(pos, struct vfio_group, group_next);
> + if (group->groupid == groupid)
> + break;
> + group = NULL;
> + }
> +
> + if (!group)
> + goto out;
> +
> + list_for_each(pos, &group->device_list) {
> + device = list_entry(pos, struct vfio_device, device_next);
> + if (device->dev == dev)
> + break;
> + device = NULL;
> + }
> +
> + if (!device)
> + goto out;
> +
> + BUG_ON(device->refcnt);
> +
> + if (device->attached)
> + __vfio_iommu_detach_dev(group->iommu, device);
> +
> + list_del(&device->device_next);
> + kfree(device);
> +
> + /* If this was the only device in the group, remove the group.
> + * Note that we intentionally unmerge empty groups here if the
> + * group fd isn't opened. */
> + if (list_empty(&group->device_list) && group->refcnt == 0) {
> + struct vfio_iommu *iommu = group->iommu;
> +
> + if (iommu) {
> + __vfio_group_set_iommu(group, NULL);
> + __vfio_try_dissolve_iommu(iommu);
> + }
> +
> + device_destroy(vfio.class, group->devt);
> + idr_remove(&vfio.idr, MINOR(group->devt));
> + list_del(&group->group_next);
> + kfree(group);
> + }
> +out:
> + mutex_unlock(&vfio.lock);
> +}
> +EXPORT_SYMBOL_GPL(vfio_group_del_dev);
> +
> +/* When a device is bound to a vfio device driver (ex. vfio-pci), this
> + * entry point is used to mark the device usable (viable). The vfio
> + * device driver associates a private device_data struct with the device
> + * here, which will later be return for vfio_device_fops callbacks. */
> +int vfio_bind_dev(struct device *dev, void *device_data)
> +{
> + struct vfio_device *device;
> + int ret = -EINVAL;
> +
> + BUG_ON(!device_data);
> +
> + mutex_lock(&vfio.lock);
> +
> + device = __vfio_lookup_dev(dev);
> +
> + BUG_ON(!device);
> +
> + ret = dev_set_drvdata(dev, device);
> + if (!ret)
> + device->device_data = device_data;
> +
> + mutex_unlock(&vfio.lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_bind_dev);
> +
> +/* A device is only removeable if the iommu for the group is not in use. */
> +static bool vfio_device_removeable(struct vfio_device *device)
> +{
> + bool ret = true;
> +
> + mutex_lock(&vfio.lock);
> +
> + if (device->iommu && __vfio_iommu_inuse(device->iommu))
> + ret = false;
> +
> + mutex_unlock(&vfio.lock);
> + return ret;
> +}
> +
> +/* Notify vfio that a device is being unbound from the vfio device driver
> + * and return the device private device_data pointer. If the group is
> + * in use, we need to block or take other measures to make it safe for
> + * the device to be removed from the iommu. */
> +void *vfio_unbind_dev(struct device *dev)
> +{
> + struct vfio_device *device = dev_get_drvdata(dev);
> + void *device_data;
> +
> + BUG_ON(!device);
> +
> +again:
> + if (!vfio_device_removeable(device)) {
> + /* XXX signal for all devices in group to be removed or
> + * resort to killing the process holding the device fds.
> + * For now just block waiting for releases to wake us. */
> + wait_event(vfio.release_q, vfio_device_removeable(device));
> + }
> +
> + mutex_lock(&vfio.lock);
> +
> + /* Need to re-check that the device is still removeable under lock. */
> + if (device->iommu && __vfio_iommu_inuse(device->iommu)) {
> + mutex_unlock(&vfio.lock);
> + goto again;
> + }
> +
> + device_data = device->device_data;
> +
> + device->device_data = NULL;
> + dev_set_drvdata(dev, NULL);
> +
> + mutex_unlock(&vfio.lock);
> + return device_data;
> +}
> +EXPORT_SYMBOL_GPL(vfio_unbind_dev);
> +
> +/*
> + * Module/class support
> + */
> +static void vfio_class_release(struct kref *kref)
> +{
> + class_destroy(vfio.class);
> + vfio.class = NULL;
> +}
> +
> +static char *vfio_devnode(struct device *dev, mode_t *mode)
> +{
> + return kasprintf(GFP_KERNEL, "vfio/%s", dev_name(dev));
> +}
> +
> +static int __init vfio_init(void)
> +{
> + int ret;
> +
> + idr_init(&vfio.idr);
> + mutex_init(&vfio.lock);
> + INIT_LIST_HEAD(&vfio.group_list);
> + init_waitqueue_head(&vfio.release_q);
> +
> + kref_init(&vfio.kref);
> + vfio.class = class_create(THIS_MODULE, "vfio");
> + if (IS_ERR(vfio.class)) {
> + ret = PTR_ERR(vfio.class);
> + goto err_class;
> + }
> +
> + vfio.class->devnode = vfio_devnode;
> +
> + /* FIXME - how many minors to allocate... all of them! */
> + ret = alloc_chrdev_region(&vfio.devt, 0, MINORMASK, "vfio");
> + if (ret)
> + goto err_chrdev;
> +
> + cdev_init(&vfio.cdev, &vfio_group_fops);
> + ret = cdev_add(&vfio.cdev, vfio.devt, MINORMASK);
> + if (ret)
> + goto err_cdev;
> +
> + pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
> +
> + return 0;
> +
> +err_cdev:
> + unregister_chrdev_region(vfio.devt, MINORMASK);
> +err_chrdev:
> + kref_put(&vfio.kref, vfio_class_release);
> +err_class:
> + return ret;
> +}
> +
> +static void __exit vfio_cleanup(void)
> +{
> + struct list_head *gpos, *gppos;
> +
> + list_for_each_safe(gpos, gppos, &vfio.group_list) {
> + struct vfio_group *group;
> + struct list_head *dpos, *dppos;
> +
> + group = list_entry(gpos, struct vfio_group, group_next);
> +
> + list_for_each_safe(dpos, dppos, &group->device_list) {
> + struct vfio_device *device;
> +
> + device = list_entry(dpos,
> + struct vfio_device, device_next);
> + vfio_group_del_dev(device->dev);
> + }
> + }
> +
> + idr_destroy(&vfio.idr);
> + cdev_del(&vfio.cdev);
> + unregister_chrdev_region(vfio.devt, MINORMASK);
> + kref_put(&vfio.kref, vfio_class_release);
> +}
> +
> +module_init(vfio_init);
> +module_exit(vfio_cleanup);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/vfio_private.h b/drivers/vfio/vfio_private.h
> new file mode 100644
> index 0000000..350ad67
> --- /dev/null
> +++ b/drivers/vfio/vfio_private.h
> @@ -0,0 +1,34 @@
> +/*
> + * Copyright (C) 2011 Red Hat, Inc. All rights reserved.
> + * Author: Alex Williamson <address@hidden>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> + * Author: Tom Lyon, address@hidden
> + */
> +
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +
> +#ifndef VFIO_PRIVATE_H
> +#define VFIO_PRIVATE_H
> +
> +struct vfio_iommu {
> + struct iommu_domain *domain;
> + struct bus_type *bus;
> + struct mutex dgate;
> + struct list_head dm_list;
> + struct mm_struct *mm;
> + struct list_head group_list;
> + int refcnt;
> + bool cache;
> +};
> +
> +extern int vfio_release_iommu(struct vfio_iommu *iommu);
> +extern void vfio_iommu_unmapall(struct vfio_iommu *iommu);
> +
> +#endif /* VFIO_PRIVATE_H */
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> new file mode 100644
> index 0000000..4269b08
> --- /dev/null
> +++ b/include/linux/vfio.h
> @@ -0,0 +1,155 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> + * Author: Tom Lyon, address@hidden
> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Portions derived from drivers/uio/uio.c:
> + * Copyright(C) 2005, Benedikt Spranger <address@hidden>
> + * Copyright(C) 2005, Thomas Gleixner <address@hidden>
> + * Copyright(C) 2006, Hans J. Koch <address@hidden>
> + * Copyright(C) 2006, Greg Kroah-Hartman <address@hidden>
> + *
> + * Portions derived from drivers/uio/uio_pci_generic.c:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <address@hidden>
> + */
> +#include <linux/types.h>
> +
> +#ifndef VFIO_H
> +#define VFIO_H
> +
> +#ifdef __KERNEL__
> +
> +struct vfio_device_ops {
> + bool (*match)(struct device *, char *);
> + int (*get)(void *);
> + void (*put)(void *);
> + ssize_t (*read)(void *, char __user *,
> + size_t, loff_t *);
> + ssize_t (*write)(void *, const char __user *,
> + size_t, loff_t *);
> + long (*ioctl)(void *, unsigned int, unsigned long);
> + int (*mmap)(void *, struct vm_area_struct *);
> +};
> +
> +extern int vfio_group_add_dev(struct device *device,
> + const struct vfio_device_ops *ops);
> +extern void vfio_group_del_dev(struct device *device);
> +extern int vfio_bind_dev(struct device *device, void *device_data);
> +extern void *vfio_unbind_dev(struct device *device);
> +
> +#endif /* __KERNEL__ */
> +
> +/*
> + * VFIO driver - allow mapping and use of certain devices
> + * in unprivileged user processes. (If IOMMU is present)
> + * Especially useful for Virtual Function parts of SR-IOV devices
> + */
> +
> +
> +/* Kernel & User level defines for ioctls */
> +
> +#define VFIO_GROUP_GET_FLAGS _IOR(';', 100, __u64)
> + #define VFIO_GROUP_FLAGS_VIABLE (1 << 0)
> + #define VFIO_GROUP_FLAGS_MM_LOCKED (1 << 1)
> +#define VFIO_GROUP_MERGE _IOW(';', 101, int)
> +#define VFIO_GROUP_UNMERGE _IOW(';', 102, int)
> +#define VFIO_GROUP_GET_IOMMU_FD _IO(';', 103)
> +#define VFIO_GROUP_GET_DEVICE_FD _IOW(';', 104, char *)
> +
> +/*
> + * Structure for DMA mapping of user buffers
> + * vaddr, dmaaddr, and size must all be page aligned
> + */
> +struct vfio_dma_map {
> + __u64 len; /* length of structure */
> + __u64 vaddr; /* process virtual addr */
> + __u64 dmaaddr; /* desired and/or returned dma address */
> + __u64 size; /* size in bytes */
> + __u64 flags;
> +#define VFIO_DMA_MAP_FLAG_WRITE (1 << 0) /* req writeable DMA
> mem */
> +};
> +
> +#define VFIO_IOMMU_GET_FLAGS _IOR(';', 105, __u64)
> + /* Does the IOMMU support mapping any IOVA to any virtual address? */
> + #define VFIO_IOMMU_FLAGS_MAP_ANY (1 << 0)
> +#define VFIO_IOMMU_MAP_DMA _IOWR(';', 106, struct
> vfio_dma_map)
> +#define VFIO_IOMMU_UNMAP_DMA _IOWR(';', 107, struct
> vfio_dma_map)
> +
> +#define VFIO_DEVICE_GET_FLAGS _IOR(';', 108, __u64)
> + #define VFIO_DEVICE_FLAGS_PCI (1 << 0)
> + #define VFIO_DEVICE_FLAGS_DT (1 << 1)
> + #define VFIO_DEVICE_FLAGS_RESET (1 << 2)
> +#define VFIO_DEVICE_GET_NUM_REGIONS _IOR(';', 109, int)
> +
> +struct vfio_region_info {
> + __u32 len; /* length of structure */
> + __u32 index; /* region number */
> + __u64 size; /* size in bytes of region */
> + __u64 offset; /* start offset of region */
> + __u64 flags;
> +#define VFIO_REGION_INFO_FLAG_MMAP (1 << 0)
> +#define VFIO_REGION_INFO_FLAG_RO (1 << 1)
> +#define VFIO_REGION_INFO_FLAG_PHYS_VALID (1 << 2)
> + __u64 phys; /* physical address of region */
> +};
> +
> +#define VFIO_DEVICE_GET_REGION_INFO _IOWR(';', 110, struct vfio_region_info)
> +
> +#define VFIO_DEVICE_GET_NUM_IRQS _IOR(';', 111, int)
> +
> +struct vfio_irq_info {
> + __u32 len; /* length of structure */
> + __u32 index; /* IRQ number */
> + __u32 count; /* number of individual IRQs */
> + __u32 flags;
> +#define VFIO_IRQ_INFO_FLAG_LEVEL (1 << 0)
> +};
> +
> +#define VFIO_DEVICE_GET_IRQ_INFO _IOWR(';', 112, struct vfio_irq_info)
> +
> +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> +#define VFIO_DEVICE_SET_IRQ_EVENTFDS _IOW(';', 113, int)
> +
> +/* Unmask IRQ index, arg[0] = index */
> +#define VFIO_DEVICE_UNMASK_IRQ _IOW(';', 114, int)
> +
> +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD _IOW(';', 115, int)
> +
> +#define VFIO_DEVICE_RESET _IO(';', 116)
> +
> +struct vfio_dtpath {
> + __u32 len; /* length of structure */
> + __u32 index;
> + __u64 flags;
> +#define VFIO_DTPATH_FLAGS_REGION (1 << 0)
> +#define VFIO_DTPATH_FLAGS_IRQ (1 << 1)
> + char *path;
> +};
> +#define VFIO_DEVICE_GET_DTPATH _IOWR(';', 117, struct
> vfio_dtpath)
> +
> +struct vfio_dtindex {
> + __u32 len; /* length of structure */
> + __u32 index;
> + __u32 prop_type;
> + __u32 prop_index;
> + __u64 flags;
> +#define VFIO_DTINDEX_FLAGS_REGION (1 << 0)
> +#define VFIO_DTINDEX_FLAGS_IRQ (1 << 1)
> +};
> +#define VFIO_DEVICE_GET_DTINDEX _IOWR(';', 118, struct
> vfio_dtindex)
> +
> +#endif /* VFIO_H */
So where is the vfio-pci? Is that a seperate posting?
Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework,
Konrad Rzeszutek Wilk <=
Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework, Scott Wood, 2011/11/16
Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework, Alex Williamson, 2011/11/17
Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework, Scott Wood, 2011/11/11