qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release


From: Neo Jia
Subject: Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
Date: Tue, 26 Jan 2016 02:20:39 -0800
User-agent: Mutt/1.5.24 (2015-08-30)

On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:address@hidden
> > Sent: Tuesday, January 26, 2016 5:30 AM
> > 
> > [cc +Neo @Nvidia]
> > 
> > Hi Jike,
> > 
> > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > I would expect we can spell out next level tasks toward above
> > > > direction, upon which Alex can easily judge whether there are
> > > > some common VFIO framework changes that he can help :-)
> > >
> > > Hi Alex,
> > >
> > > Here is a draft task list after a short discussion w/ Kevin,
> > > would you please have a look?
> > >
> > >   Bus Driver
> > >
> > >           { in i915/vgt/xxx.c }
> > >
> > >           - define a subset of vfio_pci interfaces
> > >           - selective pass-through (say aperture)
> > >           - trap MMIO: interface w/ QEMU
> > 
> > What's included in the subset?  Certainly the bus reset ioctls really
> > don't apply, but you'll need to support the full device interface,
> > right?  That includes the region info ioctl and access through the vfio
> > device file descriptor as well as the interrupt info and setup ioctls.
> 
> That is the next level detail Jike will figure out and discuss soon.
> 
> yes, basic region info/access should be necessary. For interrupt, could
> you elaborate a bit what current interface is doing? If just about creating
> an eventfd for virtual interrupt injection, it applies to vgpu too.
> 
> > 
> > >   IOMMU
> > >
> > >           { in a new vfio_xxx.c }
> > >
> > >           - allocate: struct device & IOMMU group
> > 
> > It seems like the vgpu instance management would do this.
> > 
> > >           - map/unmap functions for vgpu
> > >           - rb-tree to maintain iova/hpa mappings
> > 
> > Yep, pretty much what type1 does now, but without mapping through the
> > IOMMU API.  Essentially just a database of the current userspace
> > mappings that can be accessed for page pinning and IOVA->HPA
> > translation.
> 
> The thought is to reuse iommu_type1.c, by abstracting several underlying
> operations and then put vgpu specific implementation in a vfio_vgpu.c (e.g.
> for map/unmap instead of using IOMMU API, an iova/hpa mapping is updated
> accordingly), etc.
> 
> This file will also connect between VFIO and vendor specific vgpu driver,
> e.g. exposing interfaces to allow the latter querying iova<->hpa and also 
> creating necessary VFIO structures like aforementioned device/IOMMUas...
> 
> > 
> > >           - interacts with kvmgt.c
> > >
> > >
> > >   vgpu instance management
> > >
> > >           { in i915 }
> > >
> > >           - path, create/destroy
> > >
> > 
> > Yes, and since you're creating and destroying the vgpu here, this is
> > where I'd expect a struct device to be created and added to an IOMMU
> > group.  The lifecycle management should really include links between
> > the vGPU and physical GPU, which would be much, much easier to do with
> > struct devices create here rather than at the point where we start
> > doing vfio "stuff".
> 
> It's invoked here, but expecting the function exposed by vfio_vgpu.c. It's
> not good to touch vfio internal structures from another module (such as
> i915.ko)
> 
> > 
> > Nvidia has also been looking at this and has some ideas how we might
> > standardize on some of the interfaces and create a vgpu framework to
> > help share code between vendors and hopefully make a more consistent
> > userspace interface for libvirt as well.  I'll let Neo provide some
> > details.  Thanks,
> > 
> 
> Nice to know that. Neo, please share your thought here.

Hi Alex, Kevin and Jike,

(Seems I shouldn't use attachment, resend it again to the list, patches are
inline at the end)

Thanks for adding me to this technical discussion, a great opportunity
for us to design together which can bring both Intel and NVIDIA vGPU solution to
KVM platform.

Instead of directly jumping to the proposal that we have been working on
recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
quick comments / thoughts regarding the existing discussions on this thread as
fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.

Then we can look at what we have, hopefully we can reach some consensus soon.

> Yes, and since you're creating and destroying the vgpu here, this is
> where I'd expect a struct device to be created and added to an IOMMU
> group.  The lifecycle management should really include links between
> the vGPU and physical GPU, which would be much, much easier to do with
> struct devices create here rather than at the point where we start
> doing vfio "stuff".

Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
can be centralized and done in vfio-vgpu. That also include adding to IOMMU
group and VFIO group.

Graphics driver can register with vfio-vgpu to get management and emulation call
backs to graphics driver.   

We already have struct vgpu_device in our proposal that keeps pointer to
physical device.  

> - vfio_pci will inject an IRQ to guest only when physical IRQ
> generated; whereas vfio_vgpu may inject an IRQ for emulation
> purpose. Anyway they can share the same injection interface;

eventfd to inject the interrupt is known to vfio-vgpu, that fd should be
available to graphics driver so that graphics driver can inject interrupts
directly when physical device triggers interrupt. 

Here is the proposal we have, please review.

Please note the patches we have put out here is mainly for POC purpose to
verify our understanding also can serve the purpose to reduce confusions and 
speed up 
our design, although we are very happy to refine that to something eventually
can be used for both parties and upstreamed.

Linux vGPU kernel design
==================================================================================

Here we are proposing a generic Linux kernel module based on VFIO framework
which allows different GPU vendors to plugin and provide their GPU 
virtualization
solution on KVM, the benefits of having such generic kernel module are:

1) Reuse QEMU VFIO driver, supporting VFIO UAPI

2) GPU HW agnostic management API for upper layer software such as libvirt

3) No duplicated VFIO kernel logic reimplemented by different GPU driver vendor

0. High level overview
==================================================================================

 
  user space:
                                +-----------+  VFIO IOMMU IOCTLs
                      +---------| QEMU VFIO |-------------------------+
        VFIO IOCTLs   |         +-----------+                         |
                      |                                               | 
 ---------------------|-----------------------------------------------|---------
                      |                                               |
  kernel space:       |  +--->----------->---+  (callback)            V
                      |  |                   v                 +------V-----+
  +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
  |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
  | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+ 
  |          |   |          |     | (register)           ^         ||
  +----------+   +-------+--+     |    +-----------+     |         ||
                         V        +----| i915.ko   +-----+     +---VV-------+ 
                         |             +-----^-----+           | TYPE1      |
                         |  (callback)       |                 | IOMMU      |
                         +-->------------>---+                 +------------+
 access flow:

  Guest MMIO / PCI config access
  |
  -------------------------------------------------
  |
  +-----> KVM VM_EXITs  (kernel)
          |
  -------------------------------------------------
          |
          +-----> QEMU VFIO driver (user)
                  | 
  -------------------------------------------------
                  |
                  +---->  VGPU kernel driver (kernel)
                          |  
                          | 
                          +----> vendor driver callback


1. VGPU management interface
==================================================================================

This is the interface allows upper layer software (mostly libvirt) to query and
configure virtual GPU device in a HW agnostic fashion. Also, this management
interface has provided flexibility to underlying GPU vendor to support virtual
device hotplug, multiple virtual devices per VM, multiple virtual devices from
different physical devices, etc.

1.1 Under per-physical device sysfs:
----------------------------------------------------------------------------------

vgpu_supported_types - RO, list the current supported virtual GPU types and its
VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
"vgpu_supported_types".
                            
vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
gpu device on a target physical GPU. idx: virtual device index inside a VM

vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
target physical GPU

1.3 Under vgpu class sysfs:
----------------------------------------------------------------------------------

vgpu_start - WO, input syntax <VM_UUID>, this will trigger the registration
interface to notify the GPU vendor driver to commit virtual GPU resource for
this target VM. 

Also, the vgpu_start function is a synchronized call, the successful return of
this call will indicate all the requested vGPU resource has been fully
committed, the VMM should continue.

vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the registration
interface to notify the GPU vendor driver to release virtual GPU resource of
this target VM.

1.4 Virtual device Hotplug
----------------------------------------------------------------------------------

To support virtual device hotplug, <vgpu_create> and <vgpu_destroy> can be
accessed during VM runtime, and the corresponding registration callback will be
invoked to allow GPU vendor support hotplug.

To support hotplug, vendor driver would take necessary action to handle the
situation when a vgpu_create is done on a VM_UUID after vgpu_start, and that
implies both create and start for that vgpu device.

Same, vgpu_destroy implies a vgpu_shudown on a running VM only if vendor driver
supports vgpu hotplug.

If hotplug is not supported and VM is still running, vendor driver can return
error code to indicate not supported.

Separate create from start gives flixibility to have:

- multiple vgpu instances for single VM and
- hotplug feature.

2. GPU driver vendor registration interface
==================================================================================

2.1 Registration interface definition (include/linux/vgpu.h)
----------------------------------------------------------------------------------

extern int vgpu_register_device(struct pci_dev *dev, 
                                const struct gpu_device_ops *ops);

extern void vgpu_unregister_device(struct pci_dev *dev);

/**
 * struct gpu_device_ops - Structure to be registered for each physical GPU to
 * register the device to vgpu module.
 *
 * @owner:                      The module owner.
 * @vgpu_supported_config:      Called to get information about supported vgpu
 * types.
 *                              @dev : pci device structure of physical GPU. 
 *                              @config: should return string listing supported
 *                              config
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_create:                Called to allocate basic resouces in graphics
 *                              driver for a particular vgpu.
 *                              @dev: physical pci device structure on which
 *                              vgpu 
 *                                    should be created
 *                              @vm_uuid: VM's uuid for which VM it is intended
 *                              to
 *                              @instance: vgpu instance in that VM
 *                              @vgpu_id: This represents the type of vgpu to be
 *                                        created
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_destroy:               Called to free resources in graphics driver for
 *                              a vgpu instance of that VM.
 *                              @dev: physical pci device structure to which
 *                              this vgpu points to.
 *                              @vm_uuid: VM's uuid for which the vgpu belongs
 *                              to.
 *                              @instance: vgpu instance in that VM
 *                              Returns integer: success (0) or error (< 0)
 *                              If VM is running and vgpu_destroy is called 
that 
 *                              means the vGPU is being hotunpluged. Return
 *                              error
 *                              if VM is running and graphics driver doesn't
 *                              support vgpu hotplug.
 * @vgpu_start:                 Called to do initiate vGPU initialization
 *                              process in graphics driver when VM boots before
 *                              qemu starts.
 *                              @vm_uuid: VM's UUID which is booting.
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_shutdown:              Called to teardown vGPU related resources for
 *                              the VM
 *                              @vm_uuid: VM's UUID which is shutting down .
 *                              Returns integer: success (0) or error (< 0)
 * @read:                       Read emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: read buffer
 *                              @count: number bytes to read 
 *                              @address_space: specifies for which address
 *                              space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes read on success or error.
 * @write:                      Write emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: write buffer
 *                              @count: number bytes to be written
 *                              @address_space: specifies for which address
 *                              space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes written on success or
 *                              error.
 * @vgpu_set_irqs:              Called to send about interrupts configuration
 *                              information that qemu set. 
 *                              @vdev: vgpu device structure
 *                              @flags, index, start, count and *data : same as
 *                              that of struct vfio_irq_set of
 *                              VFIO_DEVICE_SET_IRQS API. 
 *
 * Physical GPU that support vGPU should be register with vgpu module with 
 * gpu_device_ops structure.
 */

struct gpu_device_ops {
        struct module   *owner;
        int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
        int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
                               uint32_t instance, uint32_t vgpu_id);
        int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
                                uint32_t instance);
        int     (*vgpu_start)(uuid_le vm_uuid);
        int     (*vgpu_shutdown)(uuid_le vm_uuid);
        ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space, loff_t pos);
        ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space,loff_t pos);
        int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
                                 unsigned index, unsigned start, unsigned count,
                                 void *data);

};

2.2 Details for callbacks we haven't mentioned above.
---------------------------------------------------------------------------------

vgpu_supported_config: allows the vendor driver to specify the supported vGPU
                       type/configuration

vgpu_create          : create a virtual GPU device, can be used for device 
hotplug.

vgpu_destroy         : destroy a virtual GPU device, can be used for device 
hotplug.

vgpu_start           : callback function to notify vendor driver vgpu device
                       come to live for a given virtual machine.

vgpu_shutdown        : callback function to notify vendor driver 

read                 : callback to vendor driver to handle virtual device config
                       space or MMIO read access

write                : callback to vendor driver to handle virtual device config
                       space or MMIO write access

vgpu_set_irqs        : callback to vendor driver to pass along the interrupt
                       information for the target virtual device, then vendor
                       driver can inject interrupt into virtual machine for this
                       device.

2.3 Potential additional virtual device configuration registration interface:
---------------------------------------------------------------------------------

callback function to describe the MMAP behavior of the virtual GPU 

callback function to allow GPU vendor driver to provide PCI config space backing
memory.

3. VGPU TYPE1 IOMMU
==================================================================================

Here we are providing a TYPE1 IOMMU for vGPU which will basically keep track 
the 
<iova, hva, size, flag> and save the QEMU mm for later reference.

You can find the quick/ugly implementation in the attached patch file, which is
actually just a simple version Alex's type1 IOMMU without actual real
mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called. 

We have thought about providing another vendor driver registration interface so
such tracking information will be sent to vendor driver and he will use the QEMU
mm to do the get_user_pages / remap_pfn_range when it is required. After doing a
quick implementation within our driver, I noticed following issues:

1) OS/VFIO logic into vendor driver which will be a maintenance issue.

2) Every driver vendor has to implement their own RB tree, instead of reusing
the common existing VFIO code (vfio_find/link/unlink_dma) 

3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the caller/QEMU,
better not have anything inside a vendor driver that the VFIO caller immediately
depends on.

Based on the above consideration, we decide to implement the DMA tracking logic
within VGPU TYPE1 IOMMU code (ideally, this should be merged into current TYPE1
IOMMU code) and expose two symbols to outside for MMIO mapping and page
translation and pinning. 

Also, with a mmap MMIO interface between virtual and physical, this allows
para-virtualized guest driver can access his virtual MMIO without taking a MMAP
fault hit, also we can support different MMIO size between virtual and physical
device.

int vgpu_map_virtual_bar
(
    uint64_t virt_bar_addr,
    uint64_t phys_bar_addr,
    uint32_t len,
    uint32_t flags
)

EXPORT_SYMBOL(vgpu_map_virtual_bar);

int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)

EXPORT_SYMBOL(vgpu_dma_do_translate);

Still a lot to be added and modified, such as supporting multiple VMs and 
multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
kernel driver, error handling, roll-back and locked memory size per user, etc. 

4. Modules
==================================================================================

Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko

vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
                           TYPE1 v1 and v2 interface. 

vgpu.ko                  - provide registration interface and virtual device
                           VFIO access.

5. QEMU note
==================================================================================

To allow us focus on the VGPU kernel driver prototyping, we have introduced a 
new VFIO 
class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c 
file and 
use it as a reference for our implementation. It is basically just a quick c & p
from vfio/pci.c to quickly meet our needs.

Once this proposal is finalized, we will move to vfio/pci.c instead of a new
class, and probably the only thing required is to have a new way to discover the
device.

6. Examples
==================================================================================

On this server, we have two NVIDIA M60 GPUs.

address@hidden ~]# lspci -d 10de:13f2
86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)

After nvidia.ko gets initialized, we can query the supported vGPU type by
accessing the "vgpu_supported_types" like following:

address@hidden ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
11:GRID M60-0B
12:GRID M60-0Q
13:GRID M60-1B
14:GRID M60-1Q
15:GRID M60-2B
16:GRID M60-2Q
17:GRID M60-4Q
18:GRID M60-8Q

For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
like to create "GRID M60-4Q" VM on it.

echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > 
/sys/bus/pci/devices/0000\:86\:00.0/vgpu_create

Note: the number 0 here is for vGPU device index. So far the change is not 
tested
for multiple vgpu devices yet, but we will support it.

At this moment, if you query the "vgpu_supported_types" it will still show all
supported virtual GPU types as no virtual GPU resource is committed yet.

Starting VM:

echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start

then, the supported vGPU type query will return:

address@hidden /home/cjia]$
> cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
17:GRID M60-4Q

So vgpu_supported_config needs to be called whenever a new virtual device gets
created as the underlying HW might limit the supported types if there are
any existing VM runnings.

Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
GPU driver vendor to clean up resource.

Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
device sysfs.

7. What is not covered:
==================================================================================

7.1 QEMU console VNC

QEMU console VNC is not covered in this RFC as it is a pretty isolated module
and not impacting the basic vGPU functionality, also we already have a good
discussion about the new VFIO interface that Alex is going to introduce to 
allow us 
describe a region for VM surface.

8 Patches
==================================================================================

0001-Add-VGPU-VFIO-driver-class-support-in-QEMU.patch - against QEMU 2.5.0

0001-Add-VGPU-and-its-TYPE1-IOMMU-kernel-module-support.patch  - against 
4.4.0-rc5

Thanks,
Kirti and Neo

>From dc8ca387f7b06c6dfc85fb4bd79a760dca76e831 Mon Sep 17 00:00:00 2001
From: Neo Jia <address@hidden>
Date: Tue, 26 Jan 2016 01:21:11 -0800
Subject: [PATCH] Add VGPU and its TYPE1 IOMMU kernel module support

This is just a quick POV implementation to allow GPU driver vendor to plugin
into VFIO framework to provide their virtual GPU support. This kernel is
providing a registration interface for GPU vendor and generic DMA tracking APIs.

extern int vgpu_register_device(struct pci_dev *dev,
                                const struct gpu_device_ops *ops);

extern void vgpu_unregister_device(struct pci_dev *dev);

/**
 * struct gpu_device_ops - Structure to be registered for each physical GPU to
 * register the device to vgpu module.
 *
 * @owner:                      The module owner.
 * @vgpu_supported_config:      Called to get information about supported vgpu 
types.
 *                              @dev : pci device structure of physical GPU.
 *                              @config: should return string listing supported 
config
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_create:                Called to allocate basic resouces in graphics
 *                              driver for a particular vgpu.
 *                              @dev: physical pci device structure on which 
vgpu
 *                                    should be created
 *                              @vm_uuid: VM's uuid for which VM it is intended 
to
 *                              @instance: vgpu instance in that VM
 *                              @vgpu_id: This represents the type of vgpu to be
 *                                        created
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_destroy:               Called to free resources in graphics driver for
 *                              a vgpu instance of that VM.
 *                              @dev: physical pci device structure to which
 *                              this vgpu points to.
 *                              @vm_uuid: VM's uuid for which the vgpu belongs 
to.
 *                              @instance: vgpu instance in that VM
 *                              Returns integer: success (0) or error (< 0)
 *                              If VM is running and vgpu_destroy is called that
 *                              means the vGPU is being hotunpluged. Return 
error
 *                              if VM is running and graphics driver doesn't
 *                              support vgpu hotplug.
 * @vgpu_start:                 Called to do initiate vGPU initialization
 *                              process in graphics driver when VM boots before
 *                              qemu starts.
 *                              @vm_uuid: VM's UUID which is booting.
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_shutdown:              Called to teardown vGPU related resources for
 *                              the VM
 *                              @vm_uuid: VM's UUID which is shutting down .
 *                              Returns integer: success (0) or error (< 0)
 * @read:                       Read emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: read buffer
 *                              @count: number bytes to read
 *                              @address_space: specifies for which address 
space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes read on success or error.
 * @write:                      Write emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: write buffer
 *                              @count: number bytes to be written
 *                              @address_space: specifies for which address 
space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes written on success or 
error.
 * @vgpu_set_irqs:              Called to send about interrupts configuration
 *                              information that qemu set.
 *                              @vdev: vgpu device structure
 *                              @flags, index, start, count and *data : same as
 *                              that of struct vfio_irq_set of
 *                              VFIO_DEVICE_SET_IRQS API.
 *
 * Physical GPU that support vGPU should be register with vgpu module with
 * gpu_device_ops structure.
 */

struct gpu_device_ops {
        struct module   *owner;
        int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
        int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
                               uint32_t instance, uint32_t vgpu_id);
        int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
                                uint32_t instance);
        int     (*vgpu_start)(uuid_le vm_uuid);
        int     (*vgpu_shutdown)(uuid_le vm_uuid);
        ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space, loff_t pos);
        ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space,loff_t pos);
        int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
                                 unsigned index, unsigned start, unsigned count,
                                 void *data);

};

int vgpu_map_virtual_bar
(
    uint64_t virt_bar_addr,
    uint64_t phys_bar_addr,
    uint32_t len,
    uint32_t flags
)

int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)

Change-Id: Ib70304d9a600c311d5107a94b3fffa938926275b
Signed-off-by: Kirti Wankhede <address@hidden>
Signed-off-by: Neo Jia <address@hidden>
---
 drivers/Kconfig                      |   2 +
 drivers/Makefile                     |   1 +
 drivers/vfio/vfio.c                  |   5 +-
 drivers/vgpu/Kconfig                 |  26 ++
 drivers/vgpu/Makefile                |   5 +
 drivers/vgpu/vfio_iommu_type1_vgpu.c | 511 ++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu_dev.c              | 550 +++++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu_private.h          |  47 +++
 drivers/vgpu/vgpu_sysfs.c            | 322 ++++++++++++++++++++
 drivers/vgpu/vgpu_vfio.c             | 521 +++++++++++++++++++++++++++++++++
 include/linux/vgpu.h                 | 157 ++++++++++
 11 files changed, 2144 insertions(+), 3 deletions(-)
 create mode 100644 drivers/vgpu/Kconfig
 create mode 100644 drivers/vgpu/Makefile
 create mode 100644 drivers/vgpu/vfio_iommu_type1_vgpu.c
 create mode 100644 drivers/vgpu/vgpu_dev.c
 create mode 100644 drivers/vgpu/vgpu_private.h
 create mode 100644 drivers/vgpu/vgpu_sysfs.c
 create mode 100644 drivers/vgpu/vgpu_vfio.c
 create mode 100644 include/linux/vgpu.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index d2ac339de85f..5fd9eae79914 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -122,6 +122,8 @@ source "drivers/uio/Kconfig"
 
 source "drivers/vfio/Kconfig"
 
+source "drivers/vgpu/Kconfig"
+
 source "drivers/vlynq/Kconfig"
 
 source "drivers/virt/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 795d0ca714bf..142256b4358b 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -84,6 +84,7 @@ obj-$(CONFIG_FUSION)          += message/
 obj-y                          += firewire/
 obj-$(CONFIG_UIO)              += uio/
 obj-$(CONFIG_VFIO)             += vfio/
+obj-$(CONFIG_VGPU)              += vgpu/
 obj-y                          += cdrom/
 obj-y                          += auxdisplay/
 obj-$(CONFIG_PCCARD)           += pcmcia/
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6070b793cbcb..af3ab413e119 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -947,19 +947,18 @@ static long vfio_ioctl_set_iommu(struct vfio_container 
*container,
                if (IS_ERR(data)) {
                        ret = PTR_ERR(data);
                        module_put(driver->ops->owner);
-                       goto skip_drivers_unlock;
+                       continue;
                }
 
                ret = __vfio_container_attach_groups(container, driver, data);
                if (!ret) {
                        container->iommu_driver = driver;
                        container->iommu_data = data;
+                       goto skip_drivers_unlock;
                } else {
                        driver->ops->release(data);
                        module_put(driver->ops->owner);
                }
-
-               goto skip_drivers_unlock;
        }
 
        mutex_unlock(&vfio.iommu_drivers_lock);
diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
new file mode 100644
index 000000000000..698ddf907a16
--- /dev/null
+++ b/drivers/vgpu/Kconfig
@@ -0,0 +1,26 @@
+
+menuconfig VGPU
+    tristate "VGPU driver framework"
+    depends on VFIO
+    select VGPU_VFIO
+    select VFIO_IOMMU_TYPE1_VGPU
+    help
+        VGPU provides a framework to virtualize GPU without SR-IOV cap
+        See Documentation/vgpu.txt for more details.
+
+        If you don't know what do here, say N.
+
+config VGPU
+    tristate
+    depends on VFIO
+    default n
+
+config VGPU_VFIO
+    tristate
+    depends on VGPU 
+    default n
+
+config VFIO_IOMMU_TYPE1_VGPU
+    tristate
+    depends on VGPU_VFIO
+    default n
diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
new file mode 100644
index 000000000000..098a3591a535
--- /dev/null
+++ b/drivers/vgpu/Makefile
@@ -0,0 +1,5 @@
+
+vgpu-y := vgpu_sysfs.o vgpu_dev.o vgpu_vfio.o
+
+obj-$(CONFIG_VGPU)     += vgpu.o
+obj-$(CONFIG_VFIO_IOMMU_TYPE1_VGPU) += vfio_iommu_type1_vgpu.o
diff --git a/drivers/vgpu/vfio_iommu_type1_vgpu.c 
b/drivers/vgpu/vfio_iommu_type1_vgpu.c
new file mode 100644
index 000000000000..6b20f1374b3b
--- /dev/null
+++ b/drivers/vgpu/vfio_iommu_type1_vgpu.c
@@ -0,0 +1,511 @@
+/*
+ * VGPU : IOMMU DMA mapping support for VGPU
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <address@hidden>
+ *            Kirti Wankhede <address@hidden>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION "0.1"
+#define DRIVER_AUTHOR  "NVIDIA Corporation"
+#define DRIVER_DESC     "VGPU Type1 IOMMU driver for VFIO"
+
+// VFIO structures
+
+struct vfio_iommu_vgpu {
+       struct mutex lock;
+       struct iommu_group *group;
+       struct vgpu_device *vgpu_dev;
+       struct rb_root dma_list;
+       struct mm_struct * vm_mm;
+};
+
+struct vgpu_vfio_dma {
+       struct rb_node node;
+       dma_addr_t iova;
+       unsigned long vaddr;
+       size_t size;
+       int prot;
+};
+
+/*
+ * VGPU VFIO FOPs definition
+ *
+ */
+
+/*
+ * Duplicated from vfio_link_dma, just quick hack ... should
+ * reuse code later
+ */
+
+static void vgpu_link_dma(struct vfio_iommu_vgpu *iommu,
+                         struct vgpu_vfio_dma *new)
+{
+       struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
+       struct vgpu_vfio_dma *dma;
+
+       while (*link) {
+               parent = *link;
+               dma = rb_entry(parent, struct vgpu_vfio_dma, node);
+
+               if (new->iova + new->size <= dma->iova)
+                       link = &(*link)->rb_left;
+               else
+                       link = &(*link)->rb_right;
+       }
+
+       rb_link_node(&new->node, parent, link);
+       rb_insert_color(&new->node, &iommu->dma_list);
+}
+
+static struct vgpu_vfio_dma *vgpu_find_dma(struct vfio_iommu_vgpu *iommu,
+                                          dma_addr_t start, size_t size)
+{
+       struct rb_node *node = iommu->dma_list.rb_node;
+
+       while (node) {
+               struct vgpu_vfio_dma *dma = rb_entry(node, struct 
vgpu_vfio_dma, node);
+
+               if (start + size <= dma->iova)
+                       node = node->rb_left;
+               else if (start >= dma->iova + dma->size)
+                       node = node->rb_right;
+               else
+                       return dma;
+       }
+
+       return NULL;
+}
+
+static void vgpu_unlink_dma(struct vfio_iommu_vgpu *iommu, struct 
vgpu_vfio_dma *old)
+{
+       rb_erase(&old->node, &iommu->dma_list);
+}
+
+static void vgpu_dump_dma(struct vfio_iommu_vgpu *iommu)
+{
+       struct vgpu_vfio_dma *c, *n;
+       uint32_t i = 0;
+
+       rbtree_postorder_for_each_entry_safe(c, n, &iommu->dma_list, node)
+               printk(KERN_INFO "%s: dma[%d] iova:0x%llx, vaddr:0x%lx, 
size:0x%lx\n",
+                      __FUNCTION__, i++, c->iova, c->vaddr, c->size);
+}
+
+static int vgpu_dma_do_track(struct vfio_iommu_vgpu * vgpu_iommu,
+       struct vfio_iommu_type1_dma_map *map)
+{
+       dma_addr_t iova = map->iova;
+       unsigned long vaddr = map->vaddr;
+       int ret = 0, prot = 0;
+       struct vgpu_vfio_dma *vgpu_dma;
+
+       mutex_lock(&vgpu_iommu->lock);
+
+       if (vgpu_find_dma(vgpu_iommu, map->iova, map->size)) {
+               mutex_unlock(&vgpu_iommu->lock);
+               return -EEXIST;
+       }
+
+       vgpu_dma = kzalloc(sizeof(*vgpu_dma), GFP_KERNEL);
+
+       if (!vgpu_dma) {
+               mutex_unlock(&vgpu_iommu->lock);
+               return -ENOMEM;
+       }
+
+       vgpu_dma->iova = iova;
+       vgpu_dma->vaddr = vaddr;
+       vgpu_dma->prot = prot;
+       vgpu_dma->size = map->size;
+
+       vgpu_link_dma(vgpu_iommu, vgpu_dma);
+
+       mutex_unlock(&vgpu_iommu->lock);
+       return ret;
+}
+
+static int vgpu_dma_do_untrack(struct vfio_iommu_vgpu * vgpu_iommu,
+       struct vfio_iommu_type1_dma_unmap *unmap)
+{
+       struct vgpu_vfio_dma *vgpu_dma;
+       size_t unmapped = 0;
+       int ret = 0;
+
+       mutex_lock(&vgpu_iommu->lock);
+
+       vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, 0);
+       if (vgpu_dma && vgpu_dma->iova != unmap->iova) {
+               ret = -EINVAL;
+               goto unlock;
+       }
+
+       vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova + unmap->size - 1, 0);
+       if (vgpu_dma && vgpu_dma->iova + vgpu_dma->size != unmap->iova + 
unmap->size) {
+               ret = -EINVAL;
+               goto unlock;
+       }
+
+       while (( vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, 
unmap->size))) {
+               unmapped += vgpu_dma->size;
+               vgpu_unlink_dma(vgpu_iommu, vgpu_dma);
+       }
+
+unlock:
+       mutex_unlock(&vgpu_iommu->lock);
+       unmap->size = unmapped;
+
+       return ret;
+}
+
+/* Ugly hack to quickly test single deivce ... */
+
+static struct vfio_iommu_vgpu *_local_iommu = NULL;
+
+int vgpu_map_virtual_bar
+(
+       uint64_t virt_bar_addr,
+        uint64_t phys_bar_addr,
+       uint32_t len,
+       uint32_t flags
+)
+{
+       struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu;
+       unsigned long remote_vaddr = 0;
+       struct vgpu_vfio_dma *vgpu_dma = NULL;
+       struct vm_area_struct *remote_vma = NULL;
+       struct mm_struct *mm = vgpu_iommu->vm_mm;
+       int ret = 0;
+
+       printk(KERN_INFO "%s: >>>>\n", __FUNCTION__);
+
+       mutex_lock(&vgpu_iommu->lock);
+
+       vgpu_dump_dma(vgpu_iommu);
+
+       down_write(&mm->mmap_sem);
+
+       vgpu_dma = vgpu_find_dma(vgpu_iommu, virt_bar_addr, len /*  size */);
+       if (!vgpu_dma) {
+               printk(KERN_INFO "%s: fail locate guest physical:0x%llx\n",
+                      __FUNCTION__, virt_bar_addr);
+               ret = -EINVAL;
+               goto unlock;
+       }
+
+       remote_vaddr = vgpu_dma->vaddr + virt_bar_addr - vgpu_dma->iova;
+
+        remote_vma = find_vma(mm, remote_vaddr);
+
+       if (remote_vma == NULL) {
+               printk(KERN_INFO "%s: fail locate vma, physical addr:0x%llx\n",
+                      __FUNCTION__, virt_bar_addr);
+               ret = -EINVAL;
+               goto unlock;
+       }
+       else {
+               printk(KERN_INFO "%s: locate vma, addr:0x%lx\n",
+                      __FUNCTION__, remote_vma->vm_start);
+       }
+
+       remote_vma->vm_page_prot = pgprot_noncached(remote_vma->vm_page_prot);
+
+       remote_vma->vm_pgoff = phys_bar_addr >> PAGE_SHIFT;
+
+       ret = remap_pfn_range(remote_vma, virt_bar_addr, remote_vma->vm_pgoff,
+                       len, remote_vma->vm_page_prot);
+
+       if (ret) {
+               printk(KERN_INFO "%s: fail to remap vma:%d\n", __FUNCTION__, 
ret);
+               goto unlock;
+       }
+
+unlock:
+
+       up_write(&mm->mmap_sem);
+       mutex_unlock(&vgpu_iommu->lock);
+       printk(KERN_INFO "%s: <<<<\n", __FUNCTION__);
+
+       return ret;
+}
+
+EXPORT_SYMBOL(vgpu_map_virtual_bar);
+
+int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
+{
+       int i = 0, ret = 0, prot = 0;
+       unsigned long remote_vaddr = 0, pfn = 0;
+       struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu;
+       struct vgpu_vfio_dma *vgpu_dma;
+       struct page *page[1];
+       // unsigned long * addr = NULL;
+       struct mm_struct *mm = vgpu_iommu->vm_mm;
+
+       prot = IOMMU_READ | IOMMU_WRITE;
+
+       printk(KERN_INFO "%s: >>>>\n", __FUNCTION__);
+
+       mutex_lock(&vgpu_iommu->lock);
+
+       vgpu_dump_dma(vgpu_iommu);
+
+       for (i = 0; i < count; i++) {
+               dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT;
+               vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /*  size */);
+
+               if (!vgpu_dma) {
+                       printk(KERN_INFO "%s: fail locate iova[%d]:0x%llx\n", 
__FUNCTION__, i, iova);
+                       ret = -EINVAL;
+                       goto unlock;
+               }
+
+               remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova;
+               printk(KERN_INFO "%s: find dma iova[%d]:0x%llx, vaddr:0x%lx, 
size:0x%lx, remote_vaddr:0x%lx\n",
+                       __FUNCTION__, i, vgpu_dma->iova,
+                       vgpu_dma->vaddr, vgpu_dma->size, remote_vaddr);
+
+               if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, 
page) == 1) {
+                       pfn = page_to_pfn(page[0]);
+                       printk(KERN_INFO "%s: pfn[%d]:0x%lx\n", __FUNCTION__, 
i, pfn);
+                       // addr = vmap(page, 1, VM_MAP, PAGE_KERNEL);
+               }
+               else {
+                       printk(KERN_INFO "%s: fail to pin pfn[%d]\n", 
__FUNCTION__, i);
+                       ret = -ENOMEM;
+                       goto unlock;
+               }
+
+               gfn_buffer[i] = pfn;
+               // vunmap(addr);
+
+       }
+
+unlock:
+       mutex_unlock(&vgpu_iommu->lock);
+       printk(KERN_INFO "%s: <<<<\n", __FUNCTION__);
+       return ret;
+}
+
+EXPORT_SYMBOL(vgpu_dma_do_translate);
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+static void *vfio_iommu_vgpu_open(unsigned long arg)
+{
+       struct vfio_iommu_vgpu *iommu;
+
+       iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
+
+       if (!iommu)
+               return ERR_PTR(-ENOMEM);
+
+       mutex_init(&iommu->lock);
+
+       printk(KERN_INFO "%s", __FUNCTION__);
+
+       /* TODO: Keep track the v2 vs. v1, for now only assume
+        * we are v2 due to QEMU code */
+       _local_iommu = iommu;
+       return iommu;
+}
+
+static void vfio_iommu_vgpu_release(void *iommu_data)
+{
+       struct vfio_iommu_vgpu *iommu = iommu_data;
+       kfree(iommu);
+       printk(KERN_INFO "%s", __FUNCTION__);
+}
+
+static long vfio_iommu_vgpu_ioctl(void *iommu_data,
+               unsigned int cmd, unsigned long arg)
+{
+       int ret = 0;
+       unsigned long minsz;
+       struct vfio_iommu_vgpu *vgpu_iommu = iommu_data;
+
+       switch (cmd) {
+       case VFIO_CHECK_EXTENSION:
+       {
+               if ((arg == VFIO_TYPE1_IOMMU) || (arg == VFIO_TYPE1v2_IOMMU))
+                       return 1;
+               else
+                       return 0;
+       }
+
+       case VFIO_IOMMU_GET_INFO:
+       {
+               struct vfio_iommu_type1_info info;
+               minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+
+               if (copy_from_user(&info, (void __user *)arg, minsz))
+                       return -EFAULT;
+
+               if (info.argsz < minsz)
+                       return -EINVAL;
+
+               info.flags = 0;
+
+               return copy_to_user((void __user *)arg, &info, minsz);
+       }
+       case VFIO_IOMMU_MAP_DMA:
+       {
+               // TODO
+               struct vfio_iommu_type1_dma_map map;
+               minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
+
+               if (copy_from_user(&map, (void __user *)arg, minsz))
+                       return -EFAULT;
+
+               if (map.argsz < minsz)
+                       return -EINVAL;
+
+               printk(KERN_INFO "VGPU-IOMMU:MAP_DMA flags:%d, vaddr:0x%llx, 
iova:0x%llx, size:0x%llx\n",
+                       map.flags, map.vaddr, map.iova, map.size);
+
+               /*
+                * TODO: Tracking code is mostly duplicated from TYPE1 IOMMU, 
ideally,
+                * this should be merged into one single file and reuse data
+                * structure
+                *
+                */
+               ret = vgpu_dma_do_track(vgpu_iommu, &map);
+               break;
+       }
+       case VFIO_IOMMU_UNMAP_DMA:
+       {
+               // TODO
+               struct vfio_iommu_type1_dma_unmap unmap;
+
+               minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
+
+               if (copy_from_user(&unmap, (void __user *)arg, minsz))
+                       return -EFAULT;
+
+               if (unmap.argsz < minsz)
+                       return -EINVAL;
+
+               ret = vgpu_dma_do_untrack(vgpu_iommu, &unmap);
+               break;
+       }
+       default:
+       {
+               printk(KERN_INFO "%s cmd default ", __FUNCTION__);
+               ret = -ENOTTY;
+               break;
+       }
+       }
+
+       return ret;
+}
+
+
+static int vfio_iommu_vgpu_attach_group(void *iommu_data,
+                                       struct iommu_group *iommu_group)
+{
+       struct vfio_iommu_vgpu *iommu = iommu_data;
+       struct vgpu_device *vgpu_dev = NULL;
+
+       printk(KERN_INFO "%s", __FUNCTION__);
+
+       vgpu_dev = get_vgpu_device_from_group(iommu_group);
+       if (vgpu_dev) {
+               iommu->vgpu_dev = vgpu_dev;
+               iommu->group = iommu_group;
+
+               /* IOMMU shares the same life cylce as VM MM */
+               iommu->vm_mm = current->mm;
+
+               printk(KERN_INFO "%s index %d", __FUNCTION__, vgpu_dev->minor);
+               return 0;
+       }
+       iommu->group = iommu_group;
+       return 1;
+}
+
+static void vfio_iommu_vgpu_detach_group(void *iommu_data,
+               struct iommu_group *iommu_group)
+{
+       struct vfio_iommu_vgpu *iommu = iommu_data;
+
+       printk(KERN_INFO "%s", __FUNCTION__);
+       iommu->vm_mm = NULL;
+       iommu->group = NULL;
+
+       return;
+}
+
+
+static const struct vfio_iommu_driver_ops vfio_iommu_vgpu_driver_ops = {
+       .name           = "vgpu_vfio",
+       .owner          = THIS_MODULE,
+       .open           = vfio_iommu_vgpu_open,
+       .release        = vfio_iommu_vgpu_release,
+       .ioctl          = vfio_iommu_vgpu_ioctl,
+       .attach_group   = vfio_iommu_vgpu_attach_group,
+       .detach_group   = vfio_iommu_vgpu_detach_group,
+};
+
+
+int vgpu_vfio_iommu_init(void)
+{
+       int rc = vfio_register_iommu_driver(&vfio_iommu_vgpu_driver_ops);
+
+       printk(KERN_INFO "%s\n", __FUNCTION__);
+       if (rc < 0) {
+               printk(KERN_ERR "Error: failed to register vfio iommu, 
err:%d\n", rc);
+       }
+
+       return rc;
+}
+
+void vgpu_vfio_iommu_exit(void)
+{
+       // unregister vgpu_vfio driver
+       vfio_unregister_iommu_driver(&vfio_iommu_vgpu_driver_ops);
+       printk(KERN_INFO "%s\n", __FUNCTION__);
+}
+
+
+module_init(vgpu_vfio_iommu_init);
+module_exit(vgpu_vfio_iommu_exit);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
+
diff --git a/drivers/vgpu/vgpu_dev.c b/drivers/vgpu/vgpu_dev.c
new file mode 100644
index 000000000000..1d4eb235122c
--- /dev/null
+++ b/drivers/vgpu/vgpu_dev.c
@@ -0,0 +1,550 @@
+/*
+ * VGPU core
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <address@hidden>
+ *            Kirti Wankhede <address@hidden>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION "0.1"
+#define DRIVER_AUTHOR  "NVIDIA Corporation"
+#define DRIVER_DESC    "VGPU driver"
+
+/*
+ * #defines
+ */
+
+#define VGPU_CLASS_NAME                "vgpu"
+
+#define VGPU_DEV_NAME          "vgpu"
+
+// TODO remove these defines
+// minor number reserved for control device
+#define VGPU_CONTROL_DEVICE       0
+
+#define VGPU_CONTROL_DEVICE_NAME  "vgpuctl"
+
+/*
+ * Global Structures
+ */
+
+static struct vgpu {
+       dev_t               vgpu_devt;
+       struct class        *class;
+       struct cdev         vgpu_cdev;
+       struct list_head    vgpu_devices_list;  // Head entry for the doubly 
linked vgpu_device list
+       struct mutex        vgpu_devices_lock;
+       struct idr          vgpu_idr;
+       struct list_head    gpu_devices_list;
+       struct mutex        gpu_devices_lock;
+} vgpu;
+
+
+/*
+ * Function prototypes
+ */
+
+static void  vgpu_device_destroy(struct vgpu_device *vgpu_dev);
+
+unsigned int vgpu_poll(struct file *file, poll_table *wait);
+long vgpu_unlocked_ioctl(struct file *file, unsigned int cmd, unsigned long 
i_arg);
+int vgpu_mmap(struct file *file, struct vm_area_struct *vma);
+
+int vgpu_open(struct inode *inode, struct file *file);
+int vgpu_close(struct inode *inode, struct file *file);
+ssize_t vgpu_read(struct file *file, char __user * buf,
+                     size_t len, loff_t * ppos);
+ssize_t vgpu_write(struct file *file, const char __user *data,
+                      size_t len, loff_t *ppos);
+
+/*
+ * Functions
+ */
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group)
+{
+
+       struct vgpu_device *vdev = NULL;
+
+       mutex_lock(&vgpu.vgpu_devices_lock);
+       list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+               if (vdev->group) {
+                       if (iommu_group_id(vdev->group) == 
iommu_group_id(group)) {
+                               mutex_unlock(&vgpu.vgpu_devices_lock);
+                               return vdev;
+                       }
+               }
+       }
+       mutex_unlock(&vgpu.vgpu_devices_lock);
+       return NULL;
+}
+
+EXPORT_SYMBOL_GPL(get_vgpu_device_from_group);
+
+int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops)
+{
+       int ret = 0;
+       struct gpu_device *gpu_dev, *tmp;
+
+       if (!dev)
+               return -EINVAL;
+
+        gpu_dev = kzalloc(sizeof(*gpu_dev), GFP_KERNEL);
+        if (!gpu_dev)
+                return -ENOMEM;
+
+       gpu_dev->dev = dev;
+        gpu_dev->ops = ops;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+
+        /* Check for duplicates */
+        list_for_each_entry(tmp, &vgpu.gpu_devices_list, gpu_next) {
+                if (tmp->dev == dev) {
+                        mutex_unlock(&vgpu.gpu_devices_lock);
+                        kfree(gpu_dev);
+                        return -EINVAL;
+                }
+        }
+
+       ret = vgpu_create_pci_device_files(dev);
+       if (ret) {
+               mutex_unlock(&vgpu.gpu_devices_lock);
+               kfree(gpu_dev);
+               return ret;
+       }
+        list_add(&gpu_dev->gpu_next, &vgpu.gpu_devices_list);
+
+       printk(KERN_INFO "VGPU: Registered dev 0x%x 0x%x, class 0x%x\n", 
dev->vendor, dev->device, dev->class);
+        mutex_unlock(&vgpu.gpu_devices_lock);
+
+        return 0;
+}
+EXPORT_SYMBOL(vgpu_register_device);
+
+void vgpu_unregister_device(struct pci_dev *dev)
+{
+        struct gpu_device *gpu_dev;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+        list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+                if (gpu_dev->dev == dev) {
+                       printk(KERN_INFO "VGPU: Unregistered dev 0x%x 0x%x, 
class 0x%x\n", dev->vendor, dev->device, dev->class);
+                       vgpu_remove_pci_device_files(dev);
+                        list_del(&gpu_dev->gpu_next);
+                        mutex_unlock(&vgpu.gpu_devices_lock);
+                        kfree(gpu_dev);
+                        return;
+                }
+        }
+        mutex_unlock(&vgpu.gpu_devices_lock);
+}
+EXPORT_SYMBOL(vgpu_unregister_device);
+
+
+/*
+ *  Static functions
+ */
+
+static struct file_operations vgpu_fops = {
+       .owner          = THIS_MODULE,
+};
+
+static void  vgpu_device_destroy(struct vgpu_device *vgpu_dev)
+{
+       if (vgpu_dev->dev) {
+               device_destroy(vgpu.class, vgpu_dev->dev->devt);
+               vgpu_dev->dev = NULL;
+       }
+}
+
+/*
+ * Helper Functions
+ */
+
+static struct vgpu_device *vgpu_device_alloc(uuid_le uuid, int instance, char 
*name)
+{
+       struct vgpu_device *vgpu_dev = NULL;
+
+       vgpu_dev = kzalloc(sizeof(*vgpu_dev), GFP_KERNEL);
+       if (!vgpu_dev)
+               return ERR_PTR(-ENOMEM);
+
+       kref_init(&vgpu_dev->kref);
+       memcpy(&vgpu_dev->vm_uuid, &uuid, sizeof(uuid_le));
+       vgpu_dev->vgpu_instance = instance;
+       strcpy(vgpu_dev->dev_name, name);
+
+       mutex_lock(&vgpu.vgpu_devices_lock);
+       list_add(&vgpu_dev->list, &vgpu.vgpu_devices_list);
+       mutex_unlock(&vgpu.vgpu_devices_lock);
+
+       return vgpu_dev;
+}
+
+static void vgpu_device_free(struct vgpu_device *vgpu_dev)
+{
+       mutex_lock(&vgpu.vgpu_devices_lock);
+       list_del(&vgpu_dev->list);
+       mutex_unlock(&vgpu.vgpu_devices_lock);
+       kfree(vgpu_dev);
+}
+
+struct vgpu_device *vgpu_drv_get_vgpu_device_by_uuid(uuid_le uuid, int 
instance)
+{
+       struct vgpu_device *vdev = NULL;
+
+       mutex_lock(&vgpu.vgpu_devices_lock);
+       list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+               if ((uuid_le_cmp(vdev->vm_uuid, uuid) == 0) &&
+                               (vdev->vgpu_instance == instance)) {
+                       mutex_unlock(&vgpu.vgpu_devices_lock);
+                       return vdev;
+               }
+       }
+       mutex_unlock(&vgpu.vgpu_devices_lock);
+       return NULL;
+}
+
+struct vgpu_device *find_vgpu_device(struct device *dev)
+{
+       struct vgpu_device *vdev = NULL;
+
+       mutex_lock(&vgpu.vgpu_devices_lock);
+       list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+               if (vdev->dev == dev) {
+                       mutex_unlock(&vgpu.vgpu_devices_lock);
+                       return vdev;
+               }
+       }
+
+       mutex_unlock(&vgpu.vgpu_devices_lock);
+       return NULL;
+}
+
+int create_vgpu_device(struct pci_dev *pdev, uuid_le vm_uuid, uint32_t 
instance, uint32_t vgpu_id)
+{
+       int minor;
+       char name[64];
+       int numChar = 0;
+       int retval = 0;
+
+       struct iommu_group *group = NULL;
+       struct device *dev = NULL;
+       struct vgpu_device *vgpu_dev = NULL;
+
+       struct gpu_device *gpu_dev;
+
+       printk(KERN_INFO "VGPU: %s: device ", __FUNCTION__);
+
+       numChar = sprintf(name, "%pUb-%d", vm_uuid.b, instance);
+       name[numChar] = '\0';
+
+       vgpu_dev = vgpu_device_alloc(vm_uuid, instance, name);
+       if (IS_ERR(vgpu_dev)) {
+               return PTR_ERR(vgpu_dev);
+       }
+
+       // check if VM device is present
+       // if not present, create with devt=0 and parent=NULL
+       // create device for instance with devt= MKDEV(vgpu.major, minor)
+       // and parent=VM device
+
+       mutex_lock(&vgpu.vgpu_devices_lock);
+
+       vgpu_dev->vgpu_id = vgpu_id;
+
+       // TODO on removing control device change the 3rd parameter to 0
+       minor = idr_alloc(&vgpu.vgpu_idr, vgpu_dev, 1, MINORMASK + 1, 
GFP_KERNEL);
+       if (minor < 0) {
+               retval = minor;
+               goto create_failed;
+       }
+
+       dev = device_create(vgpu.class, NULL, MKDEV(MAJOR(vgpu.vgpu_devt), 
minor), NULL, "%s", name);
+       if (IS_ERR(dev)) {
+               retval = PTR_ERR(dev);
+               goto create_failed1;
+       }
+
+       vgpu_dev->dev = dev;
+       vgpu_dev->minor = minor;
+
+       mutex_lock(&vgpu.gpu_devices_lock);
+       list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+               if (gpu_dev->dev == pdev) {
+                       vgpu_dev->gpu_dev = gpu_dev;
+                       if (gpu_dev->ops->vgpu_create) {
+                               retval = gpu_dev->ops->vgpu_create(pdev, 
vgpu_dev->vm_uuid,
+                                                                  instance, 
vgpu_id);
+                               if (retval)
+                               {
+                                       mutex_unlock(&vgpu.gpu_devices_lock);
+                                       goto create_failed2;
+                               }
+                       }
+                       break;
+               }
+       }
+       mutex_unlock(&vgpu.gpu_devices_lock);
+
+       if (!vgpu_dev->gpu_dev) {
+               retval = -EINVAL;
+               goto create_failed2;
+       }
+
+       mutex_lock(&vgpu.gpu_devices_lock);
+       mutex_unlock(&vgpu.gpu_devices_lock);
+
+       printk(KERN_INFO "UUID %pUb \n", vgpu_dev->vm_uuid.b);
+
+       group = iommu_group_alloc();
+       if (IS_ERR(group)) {
+               printk(KERN_ERR "VGPU: failed to allocate group!\n");
+               retval = PTR_ERR(group);
+               goto create_failed2;
+       }
+
+       retval = iommu_group_add_device(group, dev);
+       if (retval) {
+               printk(KERN_ERR "VGPU: failed to add dev to group!\n");
+               iommu_group_put(group);
+               goto create_failed2;
+       }
+
+       retval = vgpu_group_init(vgpu_dev, group);
+       if (retval) {
+               printk(KERN_ERR "VGPU: failed vgpu_group_init \n");
+               iommu_group_put(group);
+               iommu_group_remove_device(dev);
+               goto create_failed2;
+       }
+
+       vgpu_dev->group = group;
+       printk(KERN_INFO "VGPU: group_id = %d \n", iommu_group_id(group));
+
+       mutex_unlock(&vgpu.vgpu_devices_lock);
+       return retval;
+
+create_failed2:
+       vgpu_device_destroy(vgpu_dev);
+
+create_failed1:
+       idr_remove(&vgpu.vgpu_idr, minor);
+
+create_failed:
+       mutex_unlock(&vgpu.vgpu_devices_lock);
+       vgpu_device_free(vgpu_dev);
+
+       return retval;
+}
+
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev)
+{
+       struct device *dev = vgpu_dev->dev;
+
+       if (!dev) {
+               return;
+       }
+
+       printk(KERN_INFO "VGPU: destroying device %s ", vgpu_dev->dev_name);
+       if (vgpu_dev->gpu_dev->ops->vgpu_destroy) {
+               int retval = 0;
+               retval = 
vgpu_dev->gpu_dev->ops->vgpu_destroy(vgpu_dev->gpu_dev->dev,
+                                                             vgpu_dev->vm_uuid,
+                                                             
vgpu_dev->vgpu_instance);
+       /* if vendor driver doesn't return success that means vendor driver 
doesn't
+        * support hot-unplug */
+               if (retval)
+                       return;
+       }
+
+       mutex_lock(&vgpu.vgpu_devices_lock);
+
+       vgpu_group_free(vgpu_dev);
+       iommu_group_put(dev->iommu_group);
+       iommu_group_remove_device(dev);
+       vgpu_device_destroy(vgpu_dev);
+       idr_remove(&vgpu.vgpu_idr, vgpu_dev->minor);
+
+       mutex_unlock(&vgpu.vgpu_devices_lock);
+       vgpu_device_free(vgpu_dev);
+}
+
+void destroy_vgpu_device_by_uuid(uuid_le uuid, int instance)
+{
+       struct vgpu_device *vdev, *vgpu_dev = NULL;
+
+       mutex_lock(&vgpu.vgpu_devices_lock);
+
+       // search VGPU device
+       list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+               if ((uuid_le_cmp(vdev->vm_uuid, uuid) == 0) &&
+                               (vdev->vgpu_instance == instance)) {
+                       vgpu_dev = vdev;
+                       break;
+               }
+       }
+
+       mutex_unlock(&vgpu.vgpu_devices_lock);
+       if (vgpu_dev)
+               destroy_vgpu_device(vgpu_dev);
+}
+
+void get_vgpu_supported_types(struct device *dev, char *str)
+{
+       struct gpu_device *gpu_dev;
+
+       mutex_lock(&vgpu.gpu_devices_lock);
+       list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+               if (&gpu_dev->dev->dev == dev) {
+                       if (gpu_dev->ops->vgpu_supported_config)
+                               
gpu_dev->ops->vgpu_supported_config(gpu_dev->dev, str);
+                       break;
+               }
+       }
+       mutex_unlock(&vgpu.gpu_devices_lock);
+}
+
+int vgpu_start_callback(struct vgpu_device *vgpu_dev)
+{
+       int ret = 0;
+
+       mutex_lock(&vgpu.gpu_devices_lock);
+       if (vgpu_dev->gpu_dev->ops->vgpu_start)
+               ret = vgpu_dev->gpu_dev->ops->vgpu_start(vgpu_dev->vm_uuid);
+       mutex_unlock(&vgpu.gpu_devices_lock);
+       return ret;
+}
+
+int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev)
+{
+       int ret = 0;
+
+       mutex_lock(&vgpu.gpu_devices_lock);
+       if (vgpu_dev->gpu_dev->ops->vgpu_shutdown)
+               ret = vgpu_dev->gpu_dev->ops->vgpu_shutdown(vgpu_dev->vm_uuid);
+       mutex_unlock(&vgpu.gpu_devices_lock);
+       return ret;
+}
+
+int vgpu_set_irqs_callback(struct vgpu_device *vgpu_dev, uint32_t flags,
+                           unsigned index, unsigned start, unsigned count,
+                           void *data)
+{
+       int ret = 0;
+
+       mutex_lock(&vgpu.gpu_devices_lock);
+       if (vgpu_dev->gpu_dev->ops->vgpu_set_irqs)
+               ret = vgpu_dev->gpu_dev->ops->vgpu_set_irqs(vgpu_dev, flags,
+                                                          index, start, count, 
data);
+       mutex_unlock(&vgpu.gpu_devices_lock);
+       return ret;
+}
+
+char *vgpu_devnode(struct device *dev, umode_t *mode)
+{
+       return kasprintf(GFP_KERNEL, "vgpu/%s", dev_name(dev));
+}
+
+static struct class vgpu_class = {
+       .name           = VGPU_CLASS_NAME,
+       .owner          = THIS_MODULE,
+       .class_attrs    = vgpu_class_attrs,
+       .dev_groups     = vgpu_dev_groups,
+       .devnode        = vgpu_devnode,
+};
+
+static int __init vgpu_init(void)
+{
+       int rc = 0;
+
+       memset(&vgpu, 0 , sizeof(vgpu));
+
+       idr_init(&vgpu.vgpu_idr);
+       mutex_init(&vgpu.vgpu_devices_lock);
+       INIT_LIST_HEAD(&vgpu.vgpu_devices_list);
+       mutex_init(&vgpu.gpu_devices_lock);
+       INIT_LIST_HEAD(&vgpu.gpu_devices_list);
+
+       // get major number from kernel
+       rc = alloc_chrdev_region(&vgpu.vgpu_devt, 0, MINORMASK, VGPU_DEV_NAME);
+
+       if (rc < 0) {
+               printk(KERN_ERR "Error: failed to register vgpu drv, err:%d\n", 
rc);
+               return rc;
+       }
+
+       cdev_init(&vgpu.vgpu_cdev, &vgpu_fops);
+       cdev_add(&vgpu.vgpu_cdev, vgpu.vgpu_devt, MINORMASK);
+
+       printk(KERN_ALERT "major_number:%d is allocated for vgpu\n", 
MAJOR(vgpu.vgpu_devt));
+
+       rc = class_register(&vgpu_class);
+       if (rc < 0) {
+               printk(KERN_ERR "Error: failed to register vgpu class\n");
+               goto failed1;
+       }
+
+       vgpu.class = &vgpu_class;
+
+       return rc;
+
+failed1:
+       cdev_del(&vgpu.vgpu_cdev);
+       unregister_chrdev_region(vgpu.vgpu_devt, MINORMASK);
+
+       return rc;
+}
+
+static void __exit vgpu_exit(void)
+{
+       // TODO: Release all unclosed fd
+       struct vgpu_device *vdev = NULL, *tmp;
+
+       mutex_lock(&vgpu.vgpu_devices_lock);
+       list_for_each_entry_safe(vdev, tmp, &vgpu.vgpu_devices_list, list) {
+               printk(KERN_INFO "VGPU: exit destroying device %s ", 
vdev->dev_name);
+               mutex_unlock(&vgpu.vgpu_devices_lock);
+               destroy_vgpu_device(vdev);
+               mutex_lock(&vgpu.vgpu_devices_lock);
+       }
+       mutex_unlock(&vgpu.vgpu_devices_lock);
+
+       idr_destroy(&vgpu.vgpu_idr);
+       cdev_del(&vgpu.vgpu_cdev);
+       unregister_chrdev_region(vgpu.vgpu_devt, MINORMASK);
+       class_destroy(vgpu.class);
+       vgpu.class = NULL;
+}
+
+module_init(vgpu_init)
+module_exit(vgpu_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vgpu/vgpu_private.h b/drivers/vgpu/vgpu_private.h
new file mode 100644
index 000000000000..7e3c400d29f7
--- /dev/null
+++ b/drivers/vgpu/vgpu_private.h
@@ -0,0 +1,47 @@
+/*
+ * VGPU interal definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_PRIVATE_H
+#define VGPU_PRIVATE_H
+
+int vgpu_group_init(struct vgpu_device *vgpu_dev, struct iommu_group * group);
+
+int vgpu_group_free(struct vgpu_device *vgpu_dev);
+
+struct vgpu_device *find_vgpu_device(struct device *dev);
+
+struct vgpu_device *vgpu_drv_get_vgpu_device_by_uuid(uuid_le uuid, int 
instance);
+
+int create_vgpu_device(struct pci_dev *pdev, uuid_le vm_uuid, uint32_t 
instance, uint32_t vgpu_id);
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev);
+void destroy_vgpu_device_by_uuid(uuid_le uuid, int instance);
+
+/* Function prototypes for vgpu_sysfs */
+
+extern struct class_attribute vgpu_class_attrs[];
+extern const struct attribute_group *vgpu_dev_groups[];
+
+int vgpu_create_status_file(struct vgpu_device *vgpu_dev);
+void vgpu_notify_status_file(struct vgpu_device *vgpu_dev);
+void vgpu_remove_status_file(struct vgpu_device *vgpu_dev);
+
+int vgpu_create_pci_device_files(struct pci_dev *dev);
+void vgpu_remove_pci_device_files(struct pci_dev *dev);
+
+void get_vgpu_supported_types(struct device *dev, char *str);
+int vgpu_start_callback(struct vgpu_device *vgpu_dev);
+int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev);
+
+int vgpu_set_irqs_callback(struct vgpu_device *vgpu_dev, uint32_t flags,
+                           unsigned index, unsigned start, unsigned count,
+                           void *data);
+
+#endif /* VGPU_PRIVATE_H */
diff --git a/drivers/vgpu/vgpu_sysfs.c b/drivers/vgpu/vgpu_sysfs.c
new file mode 100644
index 000000000000..e48cbcd6948d
--- /dev/null
+++ b/drivers/vgpu/vgpu_sysfs.c
@@ -0,0 +1,322 @@
+/*
+ * File attributes for vGPU devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <address@hidden>
+ *            Kirti Wankhede <address@hidden>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+/* Prototypes */
+
+static ssize_t vgpu_supported_types_show(struct device *dev, struct 
device_attribute *attr, char *buf);
+static DEVICE_ATTR_RO(vgpu_supported_types);
+
+static ssize_t vgpu_create_store(struct device *dev, struct device_attribute 
*attr, const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_create);
+
+static ssize_t vgpu_destroy_store(struct device *dev, struct device_attribute 
*attr, const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_destroy);
+
+
+/* Static functions */
+
+static bool is_uuid_sep(char sep)
+{
+       if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+               return true;
+       return false;
+}
+
+
+static int uuid_parse(const char *str, uuid_le *uuid)
+{
+       int i;
+
+       if (strlen(str) < 36)
+               return -1;
+
+       for (i = 0; i < 16; i++) {
+               if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+                       printk(KERN_ERR "%s err", __FUNCTION__);
+                       return -EINVAL;
+               }
+
+               uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
+               str += 2;
+               if (is_uuid_sep(*str))
+                       str++;
+       }
+
+       return 0;
+}
+
+
+/* Functions */
+static ssize_t vgpu_supported_types_show(struct device *dev, struct 
device_attribute *attr, char *buf)
+{
+       char *str;
+       ssize_t n;
+
+        str = kzalloc(sizeof(*str) * 512, GFP_KERNEL);
+        if (!str)
+                return -ENOMEM;
+
+       get_vgpu_supported_types(dev, str);
+
+       n = sprintf(buf,"%s\n", str);
+       kfree(str);
+
+       return n;
+}
+
+static ssize_t vgpu_create_store(struct device *dev, struct device_attribute 
*attr, const char *buf, size_t count)
+{
+       char *vm_uuid_str, *instance_str, *str;
+       uuid_le vm_uuid;
+       uint32_t instance, vgpu_id;
+       struct pci_dev *pdev;
+
+       str = kstrndup(buf, count, GFP_KERNEL);
+
+       if (!str)
+               return -ENOMEM;
+
+       if ((vm_uuid_str = strsep(&str, ":")) == NULL) {
+               printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, 
buf);
+               return -EINVAL;
+       }
+
+       if (str == NULL) {
+               printk(KERN_ERR "%s vgpu type and instance not specified %s 
\n", __FUNCTION__, buf);
+               return -EINVAL;
+       }
+
+       if ((instance_str = strsep(&str, ":")) == NULL) {
+               printk(KERN_ERR "%s Empty instance or string %s \n", 
__FUNCTION__, buf);
+               return -EINVAL;
+       }
+
+       if (str == NULL) {
+               printk(KERN_ERR "%s vgpu type not specified %s \n", 
__FUNCTION__, buf);
+               return -EINVAL;
+
+       }
+
+       instance = (unsigned int)simple_strtoul(instance_str, NULL, 0);
+
+       vgpu_id = (unsigned int)simple_strtoul(str, NULL, 0);
+
+       if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+               printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, 
buf);
+               return -EINVAL;
+       }
+
+       if (dev_is_pci(dev)) {
+               pdev = to_pci_dev(dev);
+
+               if (create_vgpu_device(pdev, vm_uuid, instance, vgpu_id) < 0) {
+                       printk(KERN_ERR "%s vgpu create error \n", 
__FUNCTION__);
+                       return -EINVAL;
+               }
+       }
+
+       return count;
+}
+
+static ssize_t vgpu_destroy_store(struct device *dev, struct device_attribute 
*attr, const char *buf, size_t count)
+{
+       char *vm_uuid_str, *str;
+       uuid_le vm_uuid;
+       unsigned int instance;
+
+       str = kstrndup(buf, count, GFP_KERNEL);
+
+       if (!str)
+               return -ENOMEM;
+
+       if ((vm_uuid_str = strsep(&str, ":")) == NULL) {
+               printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, 
buf);
+               return -EINVAL;
+       }
+
+       if (str == NULL) {
+               printk(KERN_ERR "%s instance not specified %s \n", 
__FUNCTION__, buf);
+               return -EINVAL;
+       }
+
+       instance = (unsigned int)simple_strtoul(str, NULL, 0);
+
+       if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+               printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, 
buf);
+               return -EINVAL;
+       }
+
+       printk(KERN_INFO "%s UUID %pUb - %d \n", __FUNCTION__, vm_uuid.b, 
instance);
+
+       destroy_vgpu_device_by_uuid(vm_uuid, instance);
+
+       return count;
+}
+
+static ssize_t
+vgpu_vm_uuid_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+       struct vgpu_device *drv = find_vgpu_device(dev);
+
+       if (drv)
+               return sprintf(buf, "%pUb \n", drv->vm_uuid.b);
+
+       return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_vm_uuid);
+
+static ssize_t
+vgpu_group_id_show(struct device *dev, struct device_attribute *attr, char 
*buf)
+{
+       struct vgpu_device *drv = find_vgpu_device(dev);
+
+       if (drv && drv->group)
+               return sprintf(buf, "%d \n", iommu_group_id(drv->group));
+
+       return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_group_id);
+
+
+static struct attribute *vgpu_dev_attrs[] = {
+       &dev_attr_vgpu_vm_uuid.attr,
+       &dev_attr_vgpu_group_id.attr,
+       NULL,
+};
+
+static const struct attribute_group vgpu_dev_group = {
+       .attrs = vgpu_dev_attrs,
+};
+
+const struct attribute_group *vgpu_dev_groups[] = {
+       &vgpu_dev_group,
+       NULL,
+};
+
+
+ssize_t vgpu_start_store(struct class *class, struct class_attribute *attr,
+               const char *buf, size_t count)
+{
+       char *vm_uuid_str;
+       uuid_le vm_uuid;
+       struct vgpu_device *vgpu_dev = NULL;
+       int ret;
+
+       vm_uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+       if (!vm_uuid_str)
+               return -ENOMEM;
+
+       if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+               printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, 
buf);
+               return -EINVAL;
+       }
+
+       vgpu_dev = vgpu_drv_get_vgpu_device_by_uuid(vm_uuid, 0);
+
+       if (vgpu_dev && vgpu_dev->dev) {
+               kobject_uevent(&vgpu_dev->dev->kobj, KOBJ_ONLINE);
+
+               ret = vgpu_start_callback(vgpu_dev);
+               if (ret < 0) {
+                       printk(KERN_ERR "%s vgpu_start callback failed  %d \n", 
__FUNCTION__, ret);
+                       return ret;
+               }
+       }
+
+       return count;
+}
+
+ssize_t vgpu_shutdown_store(struct class *class, struct class_attribute *attr,
+               const char *buf, size_t count)
+{
+       char *vm_uuid_str;
+       uuid_le vm_uuid;
+       struct vgpu_device *vgpu_dev = NULL;
+       int ret;
+
+       vm_uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+       if (!vm_uuid_str)
+               return -ENOMEM;
+
+       if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+               printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, 
buf);
+               return -EINVAL;
+       }
+       vgpu_dev = vgpu_drv_get_vgpu_device_by_uuid(vm_uuid, 0);
+
+       if (vgpu_dev && vgpu_dev->dev) {
+               kobject_uevent(&vgpu_dev->dev->kobj, KOBJ_OFFLINE);
+
+               ret = vgpu_shutdown_callback(vgpu_dev);
+               if (ret < 0) {
+                       printk(KERN_ERR "%s vgpu_shutdown callback failed  %d 
\n", __FUNCTION__, ret);
+                       return ret;
+               }
+       }
+
+       return count;
+}
+
+struct class_attribute vgpu_class_attrs[] = {
+       __ATTR_WO(vgpu_start),
+       __ATTR_WO(vgpu_shutdown),
+       __ATTR_NULL
+};
+
+int vgpu_create_pci_device_files(struct pci_dev *dev)
+{
+       int retval;
+
+       retval = sysfs_create_file(&dev->dev.kobj, 
&dev_attr_vgpu_supported_types.attr);
+       if (retval) {
+               printk(KERN_ERR "VGPU-VFIO: failed to create 
vgpu_supported_types sysfs entry\n");
+               return retval;
+       }
+
+       retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+       if (retval) {
+               printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_create sysfs 
entry\n");
+               return retval;
+       }
+
+       retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+       if (retval) {
+               printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_destroy sysfs 
entry\n");
+               return retval;
+       }
+
+       return 0;
+}
+
+
+void vgpu_remove_pci_device_files(struct pci_dev *dev)
+{
+       sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
+       sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+       sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+}
+
diff --git a/drivers/vgpu/vgpu_vfio.c b/drivers/vgpu/vgpu_vfio.c
new file mode 100644
index 000000000000..ef0833140d84
--- /dev/null
+++ b/drivers/vgpu/vgpu_vfio.c
@@ -0,0 +1,521 @@
+/*
+ * VGPU VFIO device
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <address@hidden>
+ *            Kirti Wankhede <address@hidden>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)  (off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index)        ((u64)(index) << 
VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK   (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+struct vfio_vgpu_device {
+       struct iommu_group *group;
+       struct vgpu_device *vgpu_dev;
+};
+
+static int vgpu_dev_open(void *device_data)
+{
+       printk(KERN_INFO "%s ", __FUNCTION__);
+       return 0;
+}
+
+static void vgpu_dev_close(void *device_data)
+{
+
+}
+
+static uint64_t resource_len(struct vgpu_device *vgpu_dev, int bar_index)
+{
+       uint64_t size = 0;
+
+       switch (bar_index) {
+       case VFIO_PCI_BAR0_REGION_INDEX:
+               size = 16 * 1024 * 1024;
+               break;
+       case VFIO_PCI_BAR1_REGION_INDEX:
+               size = 256 * 1024 * 1024;
+               break;
+       case VFIO_PCI_BAR2_REGION_INDEX:
+               size = 32 * 1024 * 1024;
+               break;
+       case VFIO_PCI_BAR5_REGION_INDEX:
+               size = 128;
+               break;
+       default:
+               size = 0;
+               break;
+       }
+       return size;
+}
+
+static int vgpu_get_irq_count(struct vfio_vgpu_device *vdev, int irq_type)
+{
+       return 1;
+}
+
+static long vgpu_dev_unlocked_ioctl(void *device_data,
+               unsigned int cmd, unsigned long arg)
+{
+       int ret = 0;
+       struct vfio_vgpu_device *vdev = device_data;
+       unsigned long minsz;
+
+       switch (cmd)
+       {
+               case VFIO_DEVICE_GET_INFO:
+               {
+                       struct vfio_device_info info;
+                       printk(KERN_INFO "%s VFIO_DEVICE_GET_INFO cmd index = 
%d", __FUNCTION__, vdev->vgpu_dev->minor);
+                       minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+                       if (copy_from_user(&info, (void __user *)arg, minsz))
+                               return -EFAULT;
+
+                       if (info.argsz < minsz)
+                               return -EINVAL;
+
+                       info.flags = VFIO_DEVICE_FLAGS_PCI;
+                       info.num_regions = VFIO_PCI_NUM_REGIONS;
+                       info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+                       return copy_to_user((void __user *)arg, &info, minsz);
+               }
+
+               case VFIO_DEVICE_GET_REGION_INFO:
+               {
+                       struct vfio_region_info info;
+
+                       printk(KERN_INFO "%s VFIO_DEVICE_GET_REGION_INFO cmd", 
__FUNCTION__);
+
+                       minsz = offsetofend(struct vfio_region_info, offset);
+
+                       if (copy_from_user(&info, (void __user *)arg, minsz))
+                               return -EFAULT;
+
+                       if (info.argsz < minsz)
+                               return -EINVAL;
+
+                       switch (info.index) {
+                               case VFIO_PCI_CONFIG_REGION_INDEX:
+                                       info.offset = 
VFIO_PCI_INDEX_TO_OFFSET(info.index);
+                                       info.size = 0x100;     // 4K
+                                       //                    info.size = 
sizeof(vdev->vgpu_dev->config_space);
+                                       info.flags = VFIO_REGION_INFO_FLAG_READ 
|
+                                                       
VFIO_REGION_INFO_FLAG_WRITE;
+                                       break;
+                               case VFIO_PCI_BAR0_REGION_INDEX ... 
VFIO_PCI_BAR5_REGION_INDEX:
+                                       info.offset = 
VFIO_PCI_INDEX_TO_OFFSET(info.index);
+                                       info.size = 
resource_len(vdev->vgpu_dev, info.index);
+                                       if (!info.size) {
+                                               info.flags = 0;
+                                               break;
+                                       }
+
+                                       info.flags = VFIO_REGION_INFO_FLAG_READ 
|
+                                               VFIO_REGION_INFO_FLAG_WRITE;
+
+                                       if ((info.index == 
VFIO_PCI_BAR1_REGION_INDEX) ||
+                                            (info.index == 
VFIO_PCI_BAR2_REGION_INDEX)) {
+                                               info.flags |= 
PCI_BASE_ADDRESS_MEM_PREFETCH;
+                                       }
+
+                                       /* TODO: provides configurable setups to
+                                        * GPU vendor
+                                        */
+
+                                       if (info.index == 
VFIO_PCI_BAR1_REGION_INDEX)
+                                               info.flags = 
VFIO_REGION_INFO_FLAG_MMAP;
+
+                                       break;
+                               case VFIO_PCI_VGA_REGION_INDEX:
+                                       info.offset = 
VFIO_PCI_INDEX_TO_OFFSET(info.index);
+                                       info.size = 0xc0000;
+                                       info.flags = VFIO_REGION_INFO_FLAG_READ 
|
+                                               VFIO_REGION_INFO_FLAG_WRITE;
+                                       break;
+
+                               case VFIO_PCI_ROM_REGION_INDEX:
+                               default:
+                                       return -EINVAL;
+                       }
+
+                       return copy_to_user((void __user *)arg, &info, minsz);
+
+               }
+               case VFIO_DEVICE_GET_IRQ_INFO:
+               {
+                       struct vfio_irq_info info;
+
+                       printk(KERN_INFO "%s VFIO_DEVICE_GET_IRQ_INFO cmd", 
__FUNCTION__);
+                       minsz = offsetofend(struct vfio_irq_info, count);
+
+                       if (copy_from_user(&info, (void __user *)arg, minsz))
+                               return -EFAULT;
+
+                       if (info.argsz < minsz || info.index >= 
VFIO_PCI_NUM_IRQS)
+                               return -EINVAL;
+
+                       switch (info.index) {
+                               case VFIO_PCI_INTX_IRQ_INDEX ... 
VFIO_PCI_MSIX_IRQ_INDEX:
+                               case VFIO_PCI_REQ_IRQ_INDEX:
+                                       break;
+                                       /* pass thru to return error */
+                               default:
+                                       return -EINVAL;
+                       }
+
+                       info.count = VFIO_PCI_NUM_IRQS;
+
+                       info.flags = VFIO_IRQ_INFO_EVENTFD;
+                       info.count = vgpu_get_irq_count(vdev, info.index);
+
+                       if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+                               info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+                                               VFIO_IRQ_INFO_AUTOMASKED);
+                       else
+                               info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+                       return copy_to_user((void __user *)arg, &info, minsz);
+               }
+
+               case VFIO_DEVICE_SET_IRQS:
+               {
+                       struct vfio_irq_set hdr;
+                       u8 *data = NULL;
+                       int ret = 0;
+
+                       minsz = offsetofend(struct vfio_irq_set, count);
+
+                       if (copy_from_user(&hdr, (void __user *)arg, minsz))
+                               return -EFAULT;
+
+                       if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS 
||
+                                       hdr.flags & 
~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+                                               VFIO_IRQ_SET_ACTION_TYPE_MASK))
+                               return -EINVAL;
+
+                       if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+                               size_t size;
+                               int max = vgpu_get_irq_count(vdev, hdr.index);
+
+                               if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+                                       size = sizeof(uint8_t);
+                               else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+                                       size = sizeof(int32_t);
+                               else
+                                       return -EINVAL;
+
+                               if (hdr.argsz - minsz < hdr.count * size ||
+                                   hdr.start >= max || hdr.start + hdr.count > 
max)
+                                       return -EINVAL;
+
+                               data = memdup_user((void __user *)(arg + minsz),
+                                               hdr.count * size);
+                               if (IS_ERR(data))
+                                       return PTR_ERR(data);
+
+                       }
+                       ret = vgpu_set_irqs_callback(vdev->vgpu_dev, hdr.flags, 
hdr.index,
+                                       hdr.start, hdr.count, data);
+                       kfree(data);
+
+
+                       return ret;
+               }
+
+               default:
+                       return -EINVAL;
+       }
+       return ret;
+}
+
+
+ssize_t vgpu_dev_config_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+               size_t count, loff_t *ppos, bool iswrite)
+{
+       struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+       int cfg_size = sizeof(vgpu_dev->config_space);
+       int ret = 0;
+       uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+       if (pos < 0 || pos >= cfg_size ||
+           pos + count > cfg_size) {
+               printk(KERN_ERR "%s pos 0x%llx out of range\n", __FUNCTION__, 
pos);
+               ret = -EFAULT;
+               goto config_rw_exit;
+       }
+
+       if (iswrite) {
+               char *user_data = kmalloc(count, GFP_KERNEL);
+
+               if (user_data == NULL) {
+                       ret = -ENOMEM;
+                       goto config_rw_exit;
+               }
+
+               if (copy_from_user(user_data, buf, count)) {
+                       ret = -EFAULT;
+                       kfree(user_data);
+                       goto config_rw_exit;
+               }
+
+               /* FIXME: Need to save the BAR value properly */
+               switch (pos) {
+               case PCI_BASE_ADDRESS_0:
+                       vgpu_dev->bar[0].start = *((uint32_t *)user_data);
+                       break;
+               case PCI_BASE_ADDRESS_1:
+                       vgpu_dev->bar[1].start = *((uint32_t *)user_data);
+                       break;
+               case PCI_BASE_ADDRESS_2:
+                       vgpu_dev->bar[2].start = *((uint32_t *)user_data);
+                       break;
+               }
+
+               if (vgpu_dev->gpu_dev->ops->write) {
+                       ret = vgpu_dev->gpu_dev->ops->write(vgpu_dev,
+                                                           user_data,
+                                                           count,
+                                                           
vgpu_emul_space_config,
+                                                           pos);
+               }
+
+               kfree(user_data);
+       }
+       else
+       {
+               char *ret_data = kmalloc(count, GFP_KERNEL);
+
+               if (ret_data == NULL) {
+                       ret = -ENOMEM;
+                       goto config_rw_exit;
+               }
+
+               memset(ret_data, 0, count);
+
+               if (vgpu_dev->gpu_dev->ops->read) {
+                       ret = vgpu_dev->gpu_dev->ops->read(vgpu_dev,
+                                                          ret_data,
+                                                          count,
+                                                          
vgpu_emul_space_config,
+                                                          pos);
+               }
+
+               if (ret > 0 ) {
+                       if (copy_to_user(buf, ret_data, ret)) {
+                               ret = -EFAULT;
+                               kfree(ret_data);
+                               goto config_rw_exit;
+                       }
+               }
+               kfree(ret_data);
+       }
+
+config_rw_exit:
+
+       return ret;
+}
+
+ssize_t vgpu_dev_bar_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+               size_t count, loff_t *ppos, bool iswrite)
+{
+       struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+       loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
+       loff_t pos;
+       int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+       uint64_t end;
+       int ret = 0;
+
+       if (!vgpu_dev->bar[bar_index].start) {
+               ret = -EINVAL;
+               goto bar_rw_exit;
+       }
+
+       end = resource_len(vgpu_dev, bar_index);
+
+       if (offset >= end) {
+               ret = -EINVAL;
+               goto bar_rw_exit;
+       }
+
+       pos = vgpu_dev->bar[bar_index].start + offset;
+       if (iswrite) {
+               char *user_data = kmalloc(count, GFP_KERNEL);
+
+               if (user_data == NULL) {
+                       ret = -ENOMEM;
+                       goto bar_rw_exit;
+               }
+
+               if (copy_from_user(user_data, buf, count)) {
+                       ret = -EFAULT;
+                       kfree(user_data);
+                       goto bar_rw_exit;
+               }
+
+               if (vgpu_dev->gpu_dev->ops->write) {
+                       ret = vgpu_dev->gpu_dev->ops->write(vgpu_dev,
+                                                           user_data,
+                                                           count,
+                                                           
vgpu_emul_space_mmio,
+                                                           pos);
+               }
+
+               kfree(user_data);
+       }
+       else
+       {
+               char *ret_data = kmalloc(count, GFP_KERNEL);
+
+               if (ret_data == NULL) {
+                       ret = -ENOMEM;
+                       goto bar_rw_exit;
+               }
+
+               memset(ret_data, 0, count);
+
+               if (vgpu_dev->gpu_dev->ops->read) {
+                       ret = vgpu_dev->gpu_dev->ops->read(vgpu_dev,
+                                                          ret_data,
+                                                          count,
+                                                          vgpu_emul_space_mmio,
+                                                          pos);
+               }
+
+               if (ret > 0 ) {
+                       if (copy_to_user(buf, ret_data, ret)) {
+                               ret = -EFAULT;
+                       }
+               }
+               kfree(ret_data);
+       }
+
+bar_rw_exit:
+       return ret;
+}
+
+
+static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
+               size_t count, loff_t *ppos, bool iswrite)
+{
+       unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+       struct vfio_vgpu_device *vdev = device_data;
+
+       if (index >= VFIO_PCI_NUM_REGIONS)
+               return -EINVAL;
+
+       switch (index) {
+               case VFIO_PCI_CONFIG_REGION_INDEX:
+                       return vgpu_dev_config_rw(vdev, buf, count, ppos, 
iswrite);
+
+
+               case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+                       return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
+
+               case VFIO_PCI_ROM_REGION_INDEX:
+               case VFIO_PCI_VGA_REGION_INDEX:
+                       break;
+       }
+
+       return -EINVAL;
+}
+
+
+static ssize_t vgpu_dev_read(void *device_data, char __user *buf,
+               size_t count, loff_t *ppos)
+{
+       int ret = 0;
+
+       if (count)
+               ret = vgpu_dev_rw(device_data, buf, count, ppos, false);
+
+       return ret;
+}
+
+static ssize_t vgpu_dev_write(void *device_data, const char __user *buf,
+               size_t count, loff_t *ppos)
+{
+       int ret = 0;
+
+       if (count)
+               ret = vgpu_dev_rw(device_data, (char *)buf, count, ppos, true);
+
+       return ret;
+}
+
+/* Just create an invalid mapping without providing a fault handler */
+
+static int vgpu_dev_mmap(void *device_data, struct vm_area_struct *vma)
+{
+       printk(KERN_INFO "%s ", __FUNCTION__);
+       return 0;
+}
+
+static const struct vfio_device_ops vgpu_vfio_dev_ops = {
+       .name           = "vfio-vgpu-grp",
+       .open           = vgpu_dev_open,
+       .release        = vgpu_dev_close,
+       .ioctl          = vgpu_dev_unlocked_ioctl,
+       .read           = vgpu_dev_read,
+       .write          = vgpu_dev_write,
+       .mmap           = vgpu_dev_mmap,
+};
+
+int vgpu_group_init(struct vgpu_device *vgpu_dev, struct iommu_group *group)
+{
+       struct vfio_vgpu_device *vdev;
+       int ret = 0;
+
+       vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+       if (!vdev) {
+               return -ENOMEM;
+       }
+
+       vdev->group = group;
+       vdev->vgpu_dev = vgpu_dev;
+
+       ret = vfio_add_group_dev(vgpu_dev->dev, &vgpu_vfio_dev_ops, vdev);
+       if (ret)
+               kfree(vdev);
+
+       return ret;
+}
+
+
+int vgpu_group_free(struct vgpu_device *vgpu_dev)
+{
+       struct vfio_vgpu_device *vdev;
+
+       vdev = vfio_del_group_dev(vgpu_dev->dev);
+       if (!vdev)
+               return -1;
+
+       kfree(vdev);
+       return 0;
+}
+
diff --git a/include/linux/vgpu.h b/include/linux/vgpu.h
new file mode 100644
index 000000000000..a2861c3f42e5
--- /dev/null
+++ b/include/linux/vgpu.h
@@ -0,0 +1,157 @@
+/*
+ * VGPU definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_H
+#define VGPU_H
+
+// Common Data structures
+
+struct pci_bar_info {
+       uint64_t start;
+       uint64_t end;
+       int flags;
+};
+
+enum vgpu_emul_space_e {
+       vgpu_emul_space_config = 0, /*!< PCI configuration space */
+       vgpu_emul_space_io = 1,     /*!< I/O register space */
+       vgpu_emul_space_mmio = 2    /*!< Memory-mapped I/O space */
+};
+
+struct gpu_device;
+
+/*
+ * VGPU device
+ */
+struct vgpu_device {
+       struct kref             kref;
+       struct device           *dev;
+       int minor;
+       struct gpu_device       *gpu_dev;
+       struct iommu_group      *group;
+#define DEVICE_NAME_LEN                (64)
+       char                    dev_name[DEVICE_NAME_LEN];
+       uuid_le                 vm_uuid;
+       uint32_t                vgpu_instance;
+       uint32_t                vgpu_id;
+       atomic_t                usage_count;
+       char                    config_space[0x100];          // 4KB PCI cfg 
space
+       struct pci_bar_info     bar[VFIO_PCI_NUM_REGIONS];
+       struct device_attribute *dev_attr_vgpu_status;
+       int                     vgpu_device_status;
+
+       struct list_head        list;
+};
+
+
+/**
+ * struct gpu_device_ops - Structure to be registered for each physical GPU to
+ * register the device to vgpu module.
+ *
+ * @owner:                     The module owner.
+ * @vgpu_supported_config:     Called to get information about supported vgpu 
types.
+ *                             @dev : pci device structure of physical GPU.
+ *                             @config: should return string listing supported 
config
+ *                             Returns integer: success (0) or error (< 0)
+ * @vgpu_create:               Called to allocate basic resouces in graphics
+ *                             driver for a particular vgpu.
+ *                             @dev: physical pci device structure on which 
vgpu
+ *                                   should be created
+ *                             @vm_uuid: VM's uuid for which VM it is intended 
to
+ *                             @instance: vgpu instance in that VM
+ *                             @vgpu_id: This represents the type of vgpu to be
+ *                                       created
+ *                             Returns integer: success (0) or error (< 0)
+ * @vgpu_destroy:              Called to free resources in graphics driver for
+ *                             a vgpu instance of that VM.
+ *                             @dev: physical pci device structure to which
+ *                             this vgpu points to.
+ *                             @vm_uuid: VM's uuid for which the vgpu belongs 
to.
+ *                             @instance: vgpu instance in that VM
+ *                             Returns integer: success (0) or error (< 0)
+ *                             If VM is running and vgpu_destroy is called that
+ *                             means the vGPU is being hotunpluged. Return 
error
+ *                             if VM is running and graphics driver doesn't
+ *                             support vgpu hotplug.
+ * @vgpu_start:                        Called to do initiate vGPU 
initialization
+ *                             process in graphics driver when VM boots before
+ *                             qemu starts.
+ *                             @vm_uuid: VM's UUID which is booting.
+ *                             Returns integer: success (0) or error (< 0)
+ * @vgpu_shutdown:             Called to teardown vGPU related resources for
+ *                             the VM
+ *                             @vm_uuid: VM's UUID which is shutting down .
+ *                             Returns integer: success (0) or error (< 0)
+ * @read:                      Read emulation callback
+ *                             @vdev: vgpu device structure
+ *                             @buf: read buffer
+ *                             @count: number bytes to read
+ *                             @address_space: specifies for which address 
space
+ *                             the request is: pci_config_space, IO register
+ *                             space or MMIO space.
+ *                             Retuns number on bytes read on success or error.
+ * @write:                     Write emulation callback
+ *                             @vdev: vgpu device structure
+ *                             @buf: write buffer
+ *                             @count: number bytes to be written
+ *                             @address_space: specifies for which address 
space
+ *                             the request is: pci_config_space, IO register
+ *                             space or MMIO space.
+ *                             Retuns number on bytes written on success or 
error.
+ * @vgpu_set_irqs:             Called to send about interrupts configuration
+ *                             information that qemu set.
+ *                             @vdev: vgpu device structure
+ *                             @flags, index, start, count and *data : same as
+ *                             that of struct vfio_irq_set of
+ *                             VFIO_DEVICE_SET_IRQS API.
+ *
+ * Physical GPU that support vGPU should be register with vgpu module with
+ * gpu_device_ops structure.
+ */
+
+struct gpu_device_ops {
+       struct module   *owner;
+       int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
+       int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
+                              uint32_t instance, uint32_t vgpu_id);
+       int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
+                               uint32_t instance);
+       int     (*vgpu_start)(uuid_le vm_uuid);
+       int     (*vgpu_shutdown)(uuid_le vm_uuid);
+       ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
+                        uint32_t address_space, loff_t pos);
+       ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
+                        uint32_t address_space,loff_t pos);
+       int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
+                                unsigned index, unsigned start, unsigned count,
+                                void *data);
+
+};
+
+/*
+ * Physical GPU
+ */
+struct gpu_device {
+       struct pci_dev                  *dev;
+       const struct gpu_device_ops     *ops;
+       struct list_head                gpu_next;
+};
+
+extern int vgpu_register_device(struct pci_dev *dev, const struct 
gpu_device_ops *ops);
+extern void vgpu_unregister_device(struct pci_dev *dev);
+
+extern int vgpu_map_virtual_bar(uint64_t virt_bar_addr, uint64_t 
phys_bar_addr, uint32_t len, uint32_t flags);
+extern int vgpu_dma_do_translate(dma_addr_t * gfn_buffer, uint32_t count);
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
+
+#endif /* VGPU_H */
+
-- 
1.8.1.4

>From 380156ade7053664bdb318af0659708357f40050 Mon Sep 17 00:00:00 2001
From: Neo Jia <address@hidden>
Date: Sun, 24 Jan 2016 11:24:13 -0800
Subject: [PATCH] Add VGPU VFIO driver class support in QEMU

This is just a quick POV change to allow us experiment the VGPU VFIO support,
the next step is to merge this into the current vfio/pci.c which currently has a
physical backing devices.

Within current POC implementation, we have copy & paste lots function directly
from the vfio/pci.c code, we should merge them together later.

    - Basic MMIO and PCI config apccess are supported

    - MMAP'ed GPU bar is supported

    - INTx and MSI using eventfd is supported, don't think we should support
      interrupt when vector->kvm_interrupt is not enabled.

Change-Id: I99c34ac44524cd4d7d2abbcc4d43634297b96e80

Signed-off-by: Neo Jia <address@hidden>
Signed-off-by: Kirti Wankhede <address@hidden>
---
 hw/vfio/Makefile.objs |   1 +
 hw/vfio/vgpu.c        | 991 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci/pci.h  |   3 +
 3 files changed, 995 insertions(+)
 create mode 100644 hw/vfio/vgpu.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index d324863..17f2ef1 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,6 +1,7 @@
 ifeq ($(CONFIG_LINUX), y)
 obj-$(CONFIG_SOFTMMU) += common.o
 obj-$(CONFIG_PCI) += pci.o pci-quirks.o
+obj-$(CONFIG_PCI) += vgpu.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
 endif
diff --git a/hw/vfio/vgpu.c b/hw/vfio/vgpu.c
new file mode 100644
index 0000000..56ebce0
--- /dev/null
+++ b/hw/vfio/vgpu.c
@@ -0,0 +1,991 @@
+/*
+ * vGPU VFIO device
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <dirent.h>
+#include <linux/vfio.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "config.h"
+#include "exec/address-spaces.h"
+#include "exec/memory.h"
+#include "hw/pci/msi.h"
+#include "hw/pci/msix.h"
+#include "hw/pci/pci.h"
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "qemu/event_notifier.h"
+#include "qemu/queue.h"
+#include "qemu/range.h"
+#include "sysemu/kvm.h"
+#include "sysemu/sysemu.h"
+#include "trace.h"
+#include "hw/vfio/vfio.h"
+#include "hw/vfio/pci.h"
+#include "hw/vfio/vfio-common.h"
+#include "qmp-commands.h"
+
+#define TYPE_VFIO_VGPU "vfio-vgpu"
+
+typedef struct VFIOvGPUDevice {
+    PCIDevice pdev;
+    VFIODevice vbasedev;
+    VFIOINTx intx;
+    VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
+    uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */
+    unsigned int config_size;
+    char  *vgpu_type;
+    char *vm_uuid;
+    off_t config_offset; /* Offset of config space region within device fd */
+    int msi_cap_size;
+    EventNotifier req_notifier;
+    int nr_vectors; /* Number of MSI/MSIX vectors currently in use */
+    int interrupt; /* Current interrupt type */
+    VFIOMSIVector *msi_vectors;
+} VFIOvGPUDevice;
+
+/*
+ * Local functions
+ */
+
+// function prototypes
+static void vfio_vgpu_disable_interrupts(VFIOvGPUDevice *vdev);
+static uint32_t vfio_vgpu_read_config(PCIDevice *pdev, uint32_t addr, int len);
+
+
+// INTx functions
+
+static void vfio_vgpu_intx_interrupt(void *opaque)
+{
+    VFIOvGPUDevice *vdev = opaque;
+
+    if (!event_notifier_test_and_clear(&vdev->intx.interrupt)) {
+        return;
+    }
+
+    vdev->intx.pending = true;
+    pci_irq_assert(&vdev->pdev);
+//    vfio_mmap_set_enabled(vdev, false);
+
+}
+
+static void vfio_vgpu_intx_eoi(VFIODevice *vbasedev)
+{
+    VFIOvGPUDevice *vdev = container_of(vbasedev, VFIOvGPUDevice, vbasedev);
+
+    if (!vdev->intx.pending) {
+        return;
+    }
+
+    trace_vfio_intx_eoi(vbasedev->name);
+
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+    vfio_unmask_single_irqindex(vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+}
+
+static void vfio_vgpu_intx_enable_kvm(VFIOvGPUDevice *vdev)
+{
+#ifdef CONFIG_KVM
+    struct kvm_irqfd irqfd = {
+        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
+        .gsi = vdev->intx.route.irq,
+        .flags = KVM_IRQFD_FLAG_RESAMPLE,
+    };
+    struct vfio_irq_set *irq_set;
+    int ret, argsz;
+    int32_t *pfd;
+
+    if (!kvm_irqfds_enabled() ||
+        vdev->intx.route.mode != PCI_INTX_ENABLED ||
+        !kvm_resamplefds_enabled()) {
+        return;
+    }
+
+    /* Get to a known interrupt state */
+    qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
+    vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+
+    /* Get an eventfd for resample/unmask */
+    if (event_notifier_init(&vdev->intx.unmask, 0)) {
+        error_report("vfio: Error: event_notifier_init failed eoi");
+        goto fail;
+    }
+
+    /* KVM triggers it, VFIO listens for it */
+    irqfd.resamplefd = event_notifier_get_fd(&vdev->intx.unmask);
+
+    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+        error_report("vfio: Error: Failed to setup resample irqfd: %m");
+        goto fail_irqfd;
+    }
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_UNMASK;
+    irq_set->index = VFIO_PCI_INTX_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+
+    *pfd = irqfd.resamplefd;
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    g_free(irq_set);
+    if (ret) {
+        error_report("vfio: Error: Failed to setup INTx unmask fd: %m");
+        goto fail_vfio;
+    }
+
+    /* Let'em rip */
+    vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+
+    vdev->intx.kvm_accel = true;
+
+    trace_vfio_intx_enable_kvm(vdev->vbasedev.name);
+
+    return;
+
+fail_vfio:
+    irqfd.flags = KVM_IRQFD_FLAG_DEASSIGN;
+    kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd);
+fail_irqfd:
+    event_notifier_cleanup(&vdev->intx.unmask);
+fail:
+    qemu_set_fd_handler(irqfd.fd, vfio_vgpu_intx_interrupt, NULL, vdev);
+    vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+#endif
+}
+
+static void vfio_vgpu_intx_disable_kvm(VFIOvGPUDevice *vdev)
+{
+#ifdef CONFIG_KVM
+    struct kvm_irqfd irqfd = {
+        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
+        .gsi = vdev->intx.route.irq,
+        .flags = KVM_IRQFD_FLAG_DEASSIGN,
+    };
+
+    if (!vdev->intx.kvm_accel) {
+        return;
+    }
+
+    /*
+     * Get to a known state, hardware masked, QEMU ready to accept new
+     * interrupts, QEMU IRQ de-asserted.
+     */
+    vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+
+    /* Tell KVM to stop listening for an INTx irqfd */
+    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+        error_report("vfio: Error: Failed to disable INTx irqfd: %m");
+    }
+
+    /* We only need to close the eventfd for VFIO to cleanup the kernel side */
+    event_notifier_cleanup(&vdev->intx.unmask);
+
+    /* QEMU starts listening for interrupt events. */
+    qemu_set_fd_handler(irqfd.fd, vfio_vgpu_intx_interrupt, NULL, vdev);
+
+    vdev->intx.kvm_accel = false;
+
+    /* If we've missed an event, let it re-fire through QEMU */
+    vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+
+    trace_vfio_intx_disable_kvm(vdev->vbasedev.name);
+#endif
+}
+
+static void vfio_vgpu_intx_update(PCIDevice *pdev)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    PCIINTxRoute route;
+
+    if (vdev->interrupt != VFIO_INT_INTx) {
+        return;
+    }
+
+    route = pci_device_route_intx_to_irq(&vdev->pdev, vdev->intx.pin);
+
+    if (!pci_intx_route_changed(&vdev->intx.route, &route)) {
+        return; /* Nothing changed */
+    }
+
+    trace_vfio_intx_update(vdev->vbasedev.name,
+                           vdev->intx.route.irq, route.irq);
+
+    vfio_vgpu_intx_disable_kvm(vdev);
+
+    vdev->intx.route = route;
+
+    if (route.mode != PCI_INTX_ENABLED) {
+        return;
+    }
+
+    vfio_vgpu_intx_enable_kvm(vdev);
+
+    /* Re-enable the interrupt in cased we missed an EOI */
+    vfio_vgpu_intx_eoi(&vdev->vbasedev);
+}
+
+static int vfio_vgpu_intx_enable(VFIOvGPUDevice *vdev)
+{
+    uint8_t pin = vfio_vgpu_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
+    int ret, argsz;
+    struct vfio_irq_set *irq_set;
+    int32_t *pfd;
+
+    if (!pin) {
+        return 0;
+    }
+
+    vfio_vgpu_disable_interrupts(vdev);
+
+    vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
+    pci_config_set_interrupt_pin(vdev->pdev.config, pin);
+
+#ifdef CONFIG_KVM
+    /*
+     * Only conditional to avoid generating error messages on platforms
+     * where we won't actually use the result anyway.
+     */
+    if (kvm_irqfds_enabled() && kvm_resamplefds_enabled()) {
+        vdev->intx.route = pci_device_route_intx_to_irq(&vdev->pdev,
+                                                        vdev->intx.pin);
+    }
+#endif
+
+    ret = event_notifier_init(&vdev->intx.interrupt, 0);
+    if (ret) {
+        error_report("vfio: Error: event_notifier_init failed");
+        return ret;
+    }
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = VFIO_PCI_INTX_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+
+    *pfd = event_notifier_get_fd(&vdev->intx.interrupt);
+    qemu_set_fd_handler(*pfd, vfio_vgpu_intx_interrupt, NULL, vdev);
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    g_free(irq_set);
+    if (ret) {
+        error_report("vfio: Error: Failed to setup INTx fd: %m");
+        qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
+        event_notifier_cleanup(&vdev->intx.interrupt);
+        return -errno;
+    }
+
+    vfio_vgpu_intx_enable_kvm(vdev);
+
+    vdev->interrupt = VFIO_INT_INTx;
+
+    trace_vfio_intx_enable(vdev->vbasedev.name);
+
+    return 0;
+}
+
+static void vfio_vgpu_intx_disable(VFIOvGPUDevice *vdev)
+{
+    int fd;
+
+    vfio_vgpu_intx_disable_kvm(vdev);
+    vfio_disable_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+//    vfio_mmap_set_enabled(vdev, true);
+
+    fd = event_notifier_get_fd(&vdev->intx.interrupt);
+    qemu_set_fd_handler(fd, NULL, NULL, vdev);
+    event_notifier_cleanup(&vdev->intx.interrupt);
+
+    vdev->interrupt = VFIO_INT_NONE;
+
+    trace_vfio_intx_disable(vdev->vbasedev.name);
+}
+
+//MSI functions
+static void vfio_vgpu_remove_kvm_msi_virq(VFIOMSIVector *vector)
+{
+    kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
+                                          vector->virq);
+    kvm_irqchip_release_virq(kvm_state, vector->virq);
+    vector->virq = -1;
+    event_notifier_cleanup(&vector->kvm_interrupt);
+}
+
+static void vfio_vgpu_msi_disable_common(VFIOvGPUDevice *vdev)
+{
+    int i;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        if (vdev->msi_vectors[i].use) {
+            if (vector->virq >= 0) {
+                vfio_vgpu_remove_kvm_msi_virq(vector);
+            }
+            qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
+                                NULL, NULL, NULL);
+            event_notifier_cleanup(&vector->interrupt);
+        }
+    }
+
+    g_free(vdev->msi_vectors);
+    vdev->msi_vectors = NULL;
+    vdev->nr_vectors = 0;
+    vdev->interrupt = VFIO_INT_NONE;
+
+   vfio_vgpu_intx_enable(vdev);
+}
+
+static void vfio_vgpu_msi_disable(VFIOvGPUDevice *vdev)
+{
+    vfio_disable_irqindex(&vdev->vbasedev, VFIO_PCI_MSI_IRQ_INDEX);
+    vfio_vgpu_msi_disable_common(vdev);
+}
+
+static void vfio_vgpu_disable_interrupts(VFIOvGPUDevice *vdev)
+{
+
+    if (vdev->interrupt == VFIO_INT_MSI) {
+        vfio_vgpu_msi_disable(vdev);
+    }
+
+    if (vdev->interrupt == VFIO_INT_INTx) {
+        vfio_vgpu_intx_disable(vdev);
+    }
+}
+
+
+static void vfio_vgpu_msi_interrupt(void *opaque)
+{
+    VFIOMSIVector *vector = opaque;
+    VFIOvGPUDevice *vdev = (VFIOvGPUDevice *)vector->vdev;
+    MSIMessage (*get_msg)(PCIDevice *dev, unsigned vector);
+    void (*notify)(PCIDevice *dev, unsigned vector);
+    MSIMessage msg;
+    int nr = vector - vdev->msi_vectors;
+
+    if (!event_notifier_test_and_clear(&vector->interrupt)) {
+        return;
+    }
+
+    if (vdev->interrupt == VFIO_INT_MSIX) {
+        get_msg = msix_get_message;
+        notify = msix_notify;
+    } else if (vdev->interrupt == VFIO_INT_MSI) {
+        get_msg = msi_get_message;
+        notify = msi_notify;
+    } else {
+        abort();
+    }
+
+    msg = get_msg(&vdev->pdev, nr);
+    trace_vfio_msi_interrupt(vdev->vbasedev.name, nr, msg.address, msg.data);
+    notify(&vdev->pdev, nr);
+}
+
+static int vfio_vgpu_enable_vectors(VFIOvGPUDevice *vdev, bool msix)
+{
+    struct vfio_irq_set *irq_set;
+    int ret = 0, i, argsz;
+    int32_t *fds;
+
+    argsz = sizeof(*irq_set) + (vdev->nr_vectors * sizeof(*fds));
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = msix ? VFIO_PCI_MSIX_IRQ_INDEX : VFIO_PCI_MSI_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = vdev->nr_vectors;
+    fds = (int32_t *)&irq_set->data;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        int fd = -1;
+
+        /*
+         * MSI vs MSI-X - The guest has direct access to MSI mask and pending
+         * bits, therefore we always use the KVM signaling path when setup.
+         * MSI-X mask and pending bits are emulated, so we want to use the
+         * KVM signaling path only when configured and unmasked.
+         */
+        if (vdev->msi_vectors[i].use) {
+            if (vdev->msi_vectors[i].virq < 0 ||
+                (msix && msix_is_masked(&vdev->pdev, i))) {
+                fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
+            } else {
+                fd = 
event_notifier_get_fd(&vdev->msi_vectors[i].kvm_interrupt);
+            }
+        }
+
+        fds[i] = fd;
+    }
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+
+    g_free(irq_set);
+
+    return ret;
+}
+
+static void vfio_vgpu_add_kvm_msi_virq(VFIOvGPUDevice *vdev, VFIOMSIVector 
*vector,
+                                  MSIMessage *msg, bool msix)
+{
+    int virq;
+
+    if (!msg) {
+        return;
+    }
+
+    if (event_notifier_init(&vector->kvm_interrupt, 0)) {
+        return;
+    }
+
+    virq = kvm_irqchip_add_msi_route(kvm_state, *msg, &vdev->pdev);
+    if (virq < 0) {
+        event_notifier_cleanup(&vector->kvm_interrupt);
+        return;
+    }
+
+    if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
+                                       NULL, virq) < 0) {
+        kvm_irqchip_release_virq(kvm_state, virq);
+        event_notifier_cleanup(&vector->kvm_interrupt);
+        return;
+    }
+
+    vector->virq = virq;
+}
+
+static void vfio_vgpu_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage 
msg,
+                                     PCIDevice *pdev)
+{
+    kvm_irqchip_update_msi_route(kvm_state, vector->virq, msg, pdev);
+}
+
+
+static void vfio_vgpu_msi_enable(VFIOvGPUDevice *vdev)
+{
+   int ret, i;
+
+    vfio_vgpu_disable_interrupts(vdev);
+
+    vdev->nr_vectors = msi_nr_vectors_allocated(&vdev->pdev);
+retry:
+    vdev->msi_vectors = g_new0(VFIOMSIVector, vdev->nr_vectors);
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        MSIMessage msg = msi_get_message(&vdev->pdev, i);
+
+        vector->vdev = (VFIOPCIDevice *)vdev;
+        vector->virq = -1;
+        vector->use = true;
+
+        if (event_notifier_init(&vector->interrupt, 0)) {
+            error_report("vfio: Error: event_notifier_init failed");
+        }
+        qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
+                            vfio_vgpu_msi_interrupt, NULL, vector);
+
+        /*
+         * Attempt to enable route through KVM irqchip,
+         * default to userspace handling if unavailable.
+         */
+        vfio_vgpu_add_kvm_msi_virq(vdev, vector, &msg, false);
+    }
+
+    /* Set interrupt type prior to possible interrupts */
+    vdev->interrupt = VFIO_INT_MSI;
+
+    ret = vfio_vgpu_enable_vectors(vdev, false);
+    if (ret) {
+        if (ret < 0) {
+            error_report("vfio: Error: Failed to setup MSI fds: %m");
+        } else if (ret != vdev->nr_vectors) {
+            error_report("vfio: Error: Failed to enable %d "
+                         "MSI vectors, retry with %d", vdev->nr_vectors, ret);
+        }
+
+        for (i = 0; i < vdev->nr_vectors; i++) {
+            VFIOMSIVector *vector = &vdev->msi_vectors[i];
+            if (vector->virq >= 0) {
+                vfio_vgpu_remove_kvm_msi_virq(vector);
+            }
+            qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
+                                NULL, NULL, NULL);
+            event_notifier_cleanup(&vector->interrupt);
+        }
+
+        g_free(vdev->msi_vectors);
+
+        if (ret > 0 && ret != vdev->nr_vectors) {
+            vdev->nr_vectors = ret;
+            goto retry;
+        }
+        vdev->nr_vectors = 0;
+
+        /*
+         * Failing to setup MSI doesn't really fall within any specification.
+         * Let's try leaving interrupts disabled and hope the guest figures
+         * out to fall back to INTx for this device.
+         */
+        error_report("vfio: Error: Failed to enable MSI");
+        vdev->interrupt = VFIO_INT_NONE;
+
+        return;
+    }
+}
+
+static void vfio_vgpu_update_msi(VFIOvGPUDevice *vdev)
+{
+    int i;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        MSIMessage msg;
+
+        if (!vector->use || vector->virq < 0) {
+            continue;
+        }
+
+        msg = msi_get_message(&vdev->pdev, i);
+        vfio_vgpu_update_kvm_msi_virq(vector, msg, &vdev->pdev);
+    }
+}
+
+static int vfio_vgpu_msi_setup(VFIOvGPUDevice *vdev, int pos)
+{
+    uint16_t ctrl;
+    bool msi_64bit, msi_maskbit;
+    int ret, entries;
+
+    if (pread(vdev->vbasedev.fd, &ctrl, sizeof(ctrl),
+              vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
+        return -errno;
+    }
+    ctrl = le16_to_cpu(ctrl);
+
+    msi_64bit = !!(ctrl & PCI_MSI_FLAGS_64BIT);
+    msi_maskbit = !!(ctrl & PCI_MSI_FLAGS_MASKBIT);
+    entries = 1 << ((ctrl & PCI_MSI_FLAGS_QMASK) >> 1);
+
+    ret = msi_init(&vdev->pdev, pos, entries, msi_64bit, msi_maskbit);
+    if (ret < 0) {
+        if (ret == -ENOTSUP) {
+            return 0;
+        }
+        error_report("vfio: msi_init failed");
+        return ret;
+    }
+    vdev->msi_cap_size = 0xa + (msi_maskbit ? 0xa : 0) + (msi_64bit ? 0x4 : 0);
+
+    return 0;
+}
+
+
+static int vfio_vgpu_msi_init(VFIOvGPUDevice *vdev)
+{
+    uint8_t pos;
+    int ret;
+
+    pos = pci_find_capability(&vdev->pdev, PCI_CAP_ID_MSI);
+    if (!pos) {
+        return 0;
+    }
+
+    ret = vfio_vgpu_msi_setup(vdev, pos);
+    if (ret < 0) {
+        error_report("vgpu: Error setting address@hidden: %d", pos, ret);
+        return ret;
+    }
+
+    return 0;
+}
+
+/*
+ * VGPU device class functions
+ */
+
+static void vfio_vgpu_reset(DeviceState *dev)
+{
+
+
+}
+
+static void vfio_vgpu_eoi(VFIODevice *vbasedev)
+{
+    return;
+}
+
+static int vfio_vgpu_hot_reset_multi(VFIODevice *vbasedev)
+{
+    // Nothing to be reset 
+    return 0;
+}
+
+static void vfio_vgpu_compute_needs_reset(VFIODevice *vbasedev)
+{
+    vbasedev->needs_reset = false;
+}
+
+static VFIODeviceOps vfio_vgpu_ops = {
+    .vfio_compute_needs_reset = vfio_vgpu_compute_needs_reset,
+    .vfio_hot_reset_multi = vfio_vgpu_hot_reset_multi,
+    .vfio_eoi = vfio_vgpu_eoi,
+};
+
+static int vfio_vgpu_populate_device(VFIOvGPUDevice *vdev)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
+    int i, ret = -1;
+
+    for (i = VFIO_PCI_BAR0_REGION_INDEX; i < VFIO_PCI_ROM_REGION_INDEX; i++) {
+        reg_info.index = i;
+
+        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+        if (ret) {
+            error_report("vfio: Error getting region %d info: %m", i);
+            return ret;
+        }
+
+        trace_vfio_populate_device_region(vbasedev->name, i,
+                                          (unsigned long)reg_info.size,
+                                          (unsigned long)reg_info.offset,
+                                          (unsigned long)reg_info.flags);
+
+        vdev->bars[i].region.vbasedev = vbasedev;
+        vdev->bars[i].region.flags = reg_info.flags;
+        vdev->bars[i].region.size = reg_info.size;
+        vdev->bars[i].region.fd_offset = reg_info.offset;
+        vdev->bars[i].region.nr = i;
+        QLIST_INIT(&vdev->bars[i].quirks);
+    }
+
+    reg_info.index = VFIO_PCI_CONFIG_REGION_INDEX;
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+    if (ret) {
+        error_report("vfio: Error getting config info: %m");
+        return ret;
+    }
+
+    vdev->config_size = reg_info.size;
+    if (vdev->config_size == PCI_CONFIG_SPACE_SIZE) {
+        vdev->pdev.cap_present &= ~QEMU_PCI_CAP_EXPRESS;
+    }
+    vdev->config_offset = reg_info.offset;
+
+    return 0;
+}
+
+static void vfio_vgpu_create_virtual_bar(VFIOvGPUDevice *vdev, int nr)
+{
+    VFIOBAR *bar = &vdev->bars[nr];
+    uint64_t size = bar->region.size;
+    char name[64];
+    uint32_t pci_bar;
+    uint8_t type;
+    int ret;
+
+    /* Skip both unimplemented BARs and the upper half of 64bit BARS. */
+    if (!size) 
+        return;
+
+    /* Determine what type of BAR this is for registration */
+    ret = pread(vdev->vbasedev.fd, &pci_bar, sizeof(pci_bar),
+                vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr));
+    if (ret != sizeof(pci_bar)) {
+        error_report("vfio: Failed to read BAR %d (%m)", nr);
+        return;
+    }
+
+    pci_bar = le32_to_cpu(pci_bar);
+    bar->ioport = (pci_bar & PCI_BASE_ADDRESS_SPACE_IO);
+    bar->mem64 = bar->ioport ? 0 : (pci_bar & PCI_BASE_ADDRESS_MEM_TYPE_64);
+    type = pci_bar & (bar->ioport ? ~PCI_BASE_ADDRESS_IO_MASK :
+                                    ~PCI_BASE_ADDRESS_MEM_MASK);
+
+    /* A "slow" read/write mapping underlies all BARs */
+    memory_region_init_io(&bar->region.mem, OBJECT(vdev), &vfio_region_ops,
+                          bar, name, size);
+    pci_register_bar(&vdev->pdev, nr, type, &bar->region.mem);
+
+    // Create an invalid BAR1 mapping
+    if (bar->region.flags & VFIO_REGION_INFO_FLAG_MMAP) {
+        strncat(name, " mmap", sizeof(name) - strlen(name) - 1);
+        vfio_mmap_region(OBJECT(vdev), &bar->region, &bar->region.mem,
+                         &bar->region.mmap_mem, &bar->region.mmap,
+                         size, 0, name);
+    }
+}
+
+static void vfio_vgpu_create_virtual_bars(VFIOvGPUDevice *vdev)
+{
+
+    int i = 0;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        vfio_vgpu_create_virtual_bar(vdev, i);
+    }
+}
+
+static int vfio_vgpu_initfn(PCIDevice *pdev)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    VFIOGroup *group;
+    ssize_t len;
+    int groupid;
+    struct stat st;
+    char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
+    int ret;
+    UuidInfo *uuid_info;
+
+    uuid_info = qmp_query_uuid(NULL);
+    if (strcmp(uuid_info->UUID, UUID_NONE) == 0) {
+        return -EINVAL;
+    } else {
+        vdev->vm_uuid = uuid_info->UUID;
+    }
+
+
+    snprintf(path, sizeof(path), 
+             "/sys/devices/virtual/vgpu/%s-0/", vdev->vm_uuid);
+
+    if (stat(path, &st) < 0) {
+        error_report("vfio-vgpu: error: no such vgpu device: %s", path);
+        return -errno;
+    } 
+
+    vdev->vbasedev.ops = &vfio_vgpu_ops;
+
+    vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
+    vdev->vbasedev.name = g_strdup_printf("%s-0", vdev->vm_uuid);
+
+    strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);
+
+    len = readlink(path, iommu_group_path, sizeof(path));
+    if (len <= 0 || len >= sizeof(path)) {
+        error_report("vfio-vgpu: error no iommu_group for device");
+        return len < 0 ? -errno : -ENAMETOOLONG;
+    }
+
+    iommu_group_path[len] = 0;
+    group_name = basename(iommu_group_path);
+
+    if (sscanf(group_name, "%d", &groupid) != 1) {
+        error_report("vfio-vgpu: error reading %s: %m", path);
+        return -errno;
+    }
+
+    // TODO: This will only work if we *only* have VFIO_VGPU_IOMMU enabled
+
+    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev));
+    if (!group) {
+        error_report("vfio: failed to get group %d", groupid);
+        return -ENOENT;
+    }
+
+    snprintf(path, sizeof(path), "%s-0", vdev->vm_uuid);
+
+    ret = vfio_get_device(group, path, &vdev->vbasedev);
+    if (ret) {
+        error_report("vfio-vgpu; failed to get device %s", vdev->vgpu_type);
+        vfio_put_group(group);
+        return ret;
+    }
+
+    ret = vfio_vgpu_populate_device(vdev);
+    if (ret) {
+        return ret;
+    }
+
+    /* Get a copy of config space */
+    ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
+                MIN(pci_config_size(&vdev->pdev), vdev->config_size),
+                vdev->config_offset);
+    if (ret < (int)MIN(pci_config_size(&vdev->pdev), vdev->config_size)) {
+        ret = ret < 0 ? -errno : -EFAULT;
+        error_report("vfio: Failed to read device config space");
+        return ret;
+    }
+
+    vfio_vgpu_create_virtual_bars(vdev);
+
+    ret = vfio_vgpu_msi_init(vdev);
+    if (ret < 0) {
+        error_report("%s: Error setting MSI %d", __FUNCTION__, ret);
+        return ret;
+    }
+
+    if (vfio_vgpu_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1)) {
+        pci_device_set_intx_routing_notifier(&vdev->pdev, 
vfio_vgpu_intx_update);
+        ret = vfio_vgpu_intx_enable(vdev);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+
+static void vfio_vgpu_exitfn(PCIDevice *pdev)
+{
+
+
+}
+
+static uint32_t vfio_vgpu_read_config(PCIDevice *pdev, uint32_t addr, int len)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    ssize_t ret;
+    uint32_t val = 0;
+
+    ret = pread(vdev->vbasedev.fd, &val, len, vdev->config_offset + addr);
+
+    if (ret != len) {
+        error_report("%s: failed at offset:0x%0x %m", __func__, addr);
+        return 0xFFFFFFFF;
+    }
+
+    // memcpy(&vdev->emulated_config_bits + addr, &val, len);
+    return val;
+}
+
+static void vfio_vgpu_write_config(PCIDevice *pdev, uint32_t addr,
+                                  uint32_t val, int len)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    ssize_t ret;
+
+    ret = pwrite(vdev->vbasedev.fd, &val, len, vdev->config_offset + addr);
+
+    if (ret != len) {
+        error_report("%s: failed at offset:0x%0x, val:0x%0x %m",
+                     __func__, addr, val);
+        return;
+    }
+
+    if (pdev->cap_present & QEMU_PCI_CAP_MSI &&
+        ranges_overlap(addr, len, pdev->msi_cap, vdev->msi_cap_size)) {
+        int is_enabled, was_enabled = msi_enabled(pdev);
+
+        pci_default_write_config(pdev, addr, val, len);
+
+        is_enabled = msi_enabled(pdev);
+
+        if (!was_enabled) {
+            if (is_enabled) {
+                vfio_vgpu_msi_enable(vdev);
+            }
+        } else {
+            if (!is_enabled) {
+                vfio_vgpu_msi_disable(vdev);
+            } else {
+                vfio_vgpu_update_msi(vdev);
+            }
+        }
+    }
+    else {
+        /* Write everything to QEMU to keep emulated bits correct */
+        pci_default_write_config(pdev, addr, val, len);
+    }
+
+    pci_default_write_config(pdev, addr, val, len);
+
+    return;
+}
+
+static const VMStateDescription vfio_vgpu_vmstate = {
+    .name = TYPE_VFIO_VGPU,
+    .unmigratable = 1,
+};
+
+//
+// We don't actually need the vfio_vgpu_properties
+// as we can just simply rely on VM UUID to find
+// the IOMMU group for this VM
+//
+
+
+static Property vfio_vgpu_properties[] = {
+
+    DEFINE_PROP_STRING("vgpu", VFIOvGPUDevice, vgpu_type),
+    DEFINE_PROP_END_OF_LIST()
+};
+
+#if 0
+
+static void vfio_vgpu_instance_init(Object *obj)
+{
+
+}
+
+static void vfio_vgpu_instance_finalize(Object *obj)
+{
+
+
+}
+
+#endif
+
+static void vfio_vgpu_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
+    // vgpudc->parent_realize = dc->realize;
+    // dc->realize = calxeda_xgmac_realize;
+    dc->desc = "VFIO-based vGPU";
+    dc->vmsd = &vfio_vgpu_vmstate;
+    dc->reset = vfio_vgpu_reset;
+    // dc->cannot_instantiate_with_device_add_yet = true; 
+    dc->props = vfio_vgpu_properties;
+    set_bit(DEVICE_CATEGORY_DISPLAY, dc->categories);
+    pdc->init = vfio_vgpu_initfn;
+    pdc->exit = vfio_vgpu_exitfn;
+    pdc->config_read = vfio_vgpu_read_config;
+    pdc->config_write = vfio_vgpu_write_config;
+    pdc->is_express = 0; /* For now, we are not */
+
+    pdc->vendor_id = PCI_DEVICE_ID_NVIDIA;
+    // pdc->device_id = 0x11B0;
+    pdc->class_id = PCI_CLASS_DISPLAY_VGA;
+}
+
+static const TypeInfo vfio_vgpu_dev_info = {
+    .name = TYPE_VFIO_VGPU,
+    .parent = TYPE_PCI_DEVICE,
+    .instance_size = sizeof(VFIOvGPUDevice),
+    .class_init = vfio_vgpu_class_init,
+};
+
+static void register_vgpu_dev_type(void)
+{
+    type_register_static(&vfio_vgpu_dev_info);
+}
+
+type_init(register_vgpu_dev_type)
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 379b6e1..9af5e17 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -64,6 +64,9 @@
 #define PCI_DEVICE_ID_VMWARE_IDE         0x1729
 #define PCI_DEVICE_ID_VMWARE_VMXNET3     0x07B0
 
+/* NVIDIA (0x10de) */
+#define PCI_DEVICE_ID_NVIDIA             0x10de
+
 /* Intel (0x8086) */
 #define PCI_DEVICE_ID_INTEL_82551IT      0x1209
 #define PCI_DEVICE_ID_INTEL_82557        0x1229
-- 
1.8.3.1



> 
> Jike will provide next level API definitions based on KVMGT requirement. 
> We can further refine it to match requirements of multi-vendors.
> 
> Thanks
> Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]