qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 2/3] VFIO driver for mediated PCI device


From: Alex Williamson
Subject: Re: [Qemu-devel] [PATCH 2/3] VFIO driver for mediated PCI device
Date: Fri, 24 Jun 2016 13:45:56 -0600

On Sat, 25 Jun 2016 00:04:27 +0530
Kirti Wankhede <address@hidden> wrote:

> Thanks Alex.
> 
> 
> On 6/22/2016 4:18 AM, Alex Williamson wrote:
> > On Mon, 20 Jun 2016 22:01:47 +0530
> > Kirti Wankhede <address@hidden> wrote:
> >   
> >> +
> >> +static int get_mdev_region_info(struct mdev_device *mdev,
> >> +                          struct pci_region_info *vfio_region_info,
> >> +                          int index)
> >> +{
> >> +  int ret = -EINVAL;
> >> +  struct parent_device *parent = mdev->parent;
> >> +
> >> +  if (parent && dev_is_pci(parent->dev) && parent->ops->get_region_info) {
> >> +          mutex_lock(&mdev->ops_lock);
> >> +          ret = parent->ops->get_region_info(mdev, index,
> >> +                                              vfio_region_info);
> >> +          mutex_unlock(&mdev->ops_lock);  
> > 
> > Why do we have two ops_lock, one on the parent_device and one on the
> > mdev_device?!  Is this one actually locking anything or also just
> > providing serialization?  Why do some things get serialized at the
> > parent level and some things at the device level?  Very confused by
> > ops_lock.
> >  
> 
> There are two sets of callback:
> * parent device callbacks: supported_config, create, destroy, start, stop
> * mdev device callbacks: read, write, set_irqs, get_region_info,
> validate_map_request
> 
> parent->ops_lock is to serialize per parent device callbacks.
> mdev->ops_lock is to serialize per mdev device callbacks.
> 
> I'll add above comment in mdev.h.

As mentioned in the other reply, I don't think serialization policy
should be imposed in the driver core.  If a given mediated driver needs
to serialize, they can do it internally.

> >> +
> >> +static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
> >> +{
> >> +  /* Don't support MSIX for now */
> >> +  if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
> >> +          return -1;
> >> +
> >> +  return 1;  
> > 
> > Too much hard coding here, the mediated driver should define this.
> >   
> 
> I'm testing INTX and MSI, I don't have a way to test MSIX for now. So we
> thought we can add supported for MSIX later. Till then hard code it to 1.

To me it screams that there needs to be an interface to the mediated
device here.  How do you even know that the mediated device intends to
support MSI?  What if it wants to emulated a VF and not support INTx?
This is basically just a big "TODO" flag that needs to be addressed
before a non-RFC.

> >> +
> >> +          if (parent && parent->ops->set_irqs) {
> >> +                  mutex_lock(&mdev->ops_lock);
> >> +                  ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
> >> +                                              hdr.start, hdr.count, data);
> >> +                  mutex_unlock(&mdev->ops_lock);  
> > 
> > Device level serialization on set_irqs... interesting.
> >   
> 
> Hope answer above helps to clarify this.
> 
> 
> >> +          }
> >> +
> >> +          kfree(ptr);
> >> +          return ret;
> >> +  }
> >> +  }
> >> +  return -ENOTTY;
> >> +}
> >> +
> >> +ssize_t mdev_dev_config_rw(struct vfio_mdev *vmdev, char __user *buf,
> >> +                     size_t count, loff_t *ppos, bool iswrite)
> >> +{
> >> +  struct mdev_device *mdev = vmdev->mdev;
> >> +  struct parent_device *parent = mdev->parent;
> >> +  int size = vmdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
> >> +  int ret = 0;
> >> +  uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> >> +
> >> +  if (pos < 0 || pos >= size ||
> >> +      pos + count > size) {
> >> +          pr_err("%s pos 0x%llx out of range\n", __func__, pos);
> >> +          ret = -EFAULT;
> >> +          goto config_rw_exit;
> >> +  }
> >> +
> >> +  if (iswrite) {
> >> +          char *usr_data, *ptr;
> >> +
> >> +          ptr = usr_data = memdup_user(buf, count);
> >> +          if (IS_ERR(usr_data)) {
> >> +                  ret = PTR_ERR(usr_data);
> >> +                  goto config_rw_exit;
> >> +          }
> >> +
> >> +          ret = parent->ops->write(mdev, usr_data, count,
> >> +                                    EMUL_CONFIG_SPACE, pos);  
> > 
> > No serialization on this ops, thank goodness, but why?
> >  
> 
> Its there at caller of mdev_dev_rw().
> 
> 
> > This read/write interface still seems strange to me...
> >   
> 
> Replied on this in 1st Patch.
> 
> >> +
> >> +          memcpy((void *)(vmdev->vconfig + pos), (void *)usr_data, count);
> >> +          kfree(ptr);
> >> +  } else {
> >> +          char *ret_data, *ptr;
> >> +
> >> +          ptr = ret_data = kzalloc(count, GFP_KERNEL);
> >> +
> >> +          if (IS_ERR(ret_data)) {
> >> +                  ret = PTR_ERR(ret_data);
> >> +                  goto config_rw_exit;
> >> +          }
> >> +
> >> +          ret = parent->ops->read(mdev, ret_data, count,
> >> +                                  EMUL_CONFIG_SPACE, pos);
> >> +
> >> +          if (ret > 0) {
> >> +                  if (copy_to_user(buf, ret_data, ret))
> >> +                          ret = -EFAULT;
> >> +                  else
> >> +                          memcpy((void *)(vmdev->vconfig + pos),
> >> +                                  (void *)ret_data, count);
> >> +          }
> >> +          kfree(ptr);  
> > 
> > So vconfig caches all of config space for the mdev, but we only ever
> > use it to read the BAR address via mdev_read_base()... why?  I hope the
> > mdev driver doesn't freak out if the user reads the mmio region before
> > writing a base address (remember the vfio API aspect of the interface
> > doesn't necessarily follow the VM PCI programming API)
> >   
> 
> How could user read mmio region from guest before writing base address?
> Isn't that would be a bug?

The user is never to be trusted.  The possibility that the user is
either clueless or malicious needs to be accounted for to protect the
host kernel.

> From our driver if pos is not within the base address range, then we
> return error for read/write.
> 
> 
> >> +  }
> >> +config_rw_exit:
> >> +
> >> +  if (ret > 0)
> >> +          *ppos += ret;
> >> +
> >> +  return ret;
> >> +}
> >> +
> >> +ssize_t mdev_dev_bar_rw(struct vfio_mdev *vmdev, char __user *buf,
> >> +                  size_t count, loff_t *ppos, bool iswrite)
> >> +{
> >> +  struct mdev_device *mdev = vmdev->mdev;
> >> +  struct parent_device *parent = mdev->parent;
> >> +  loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
> >> +  loff_t pos;
> >> +  int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> >> +  int ret = 0;
> >> +
> >> +  if (!vmdev->vfio_region_info[bar_index].start)
> >> +          mdev_read_base(vmdev);
> >> +
> >> +  if (offset >= vmdev->vfio_region_info[bar_index].size) {
> >> +          ret = -EINVAL;
> >> +          goto bar_rw_exit;
> >> +  }
> >> +
> >> +  count = min(count,
> >> +              (size_t)(vmdev->vfio_region_info[bar_index].size - offset));
> >> +
> >> +  pos = vmdev->vfio_region_info[bar_index].start + offset;  
> > 
> > In the case of a mpci dev, @start is the vconfig BAR value, so it's
> > user (guest) writable, and the mediated driver is supposed to
> > understand that?  I suppose is saw the config write too, if there was
> > one, but the mediated driver gives us region info based on region index.
> > We have the region index here.  Why wouldn't we do reads and writes
> > based on region index and offset and eliminate vconfig?  Seems like
> > that would consolidate a lot of this, we don't care what we're reading
> > and writing, just pass it through.  Mediated pci drivers would simply
> > need to match indexes to those already defined for vfio-pci.
> >   
> 
> Ok, looking at it. so this will remove vconfig completely.

Thanks,
Alex



reply via email to

[Prev in Thread] Current Thread [Next in Thread]