qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2 1/5] msix_init: assert programming error


From: Alex Williamson
Subject: Re: [Qemu-devel] [PATCH v2 1/5] msix_init: assert programming error
Date: Tue, 13 Sep 2016 08:49:11 -0600

On Tue, 13 Sep 2016 08:16:20 +0200
Markus Armbruster <address@hidden> wrote:

> Cc: Alex for device assignment expertise.
> 
> Cao jin <address@hidden> writes:
> 
> > On 09/12/2016 09:29 PM, Markus Armbruster wrote:  
> >> Cao jin <address@hidden> writes:
> >>  
> >>> The input parameters is used for creating the msix capable device, so
> >>> they must obey the PCI spec, or else, it should be programming error.  
> >>
> >> True when the the parameters come from a device model attempting to
> >> define a PCI device violating the spec.  But what if the parameters come
> >> from an actual PCI device violating the spec, via device assignment?  
> >
> > Before the patch, on invalid param, the vfio behaviour is:
> >   error_report("vfio: msix_init failed");
> >   then, device create fail.
> >
> > After the patch, its behaviour is:
> >   asserted.
> >
> > Do you mean we should still report some useful info to user on invalid
> > params?  
> 
> In the normal case, asking msix_init() to create MSI-X that are out of
> spec is a programming error: the code that does it is broken and needs
> fixing.
> 
> Device assignment might be the exception: there, the parameters for
> msix_init() come from the assigned device, not the program.  If they
> violate the spec, the device is broken.  This wouldn't be a programming
> error.  Alex, can this happen?
> 
> If yes, we may want to handle it by failing device assignment.


Generally, I think the entire premise of these sorts of patches is
flawed.  We take a working error path that allows a driver to robustly
abort on unexpected date and turn it into a time bomb.  Often the
excuse for this is that "error handling is hard".  Tough.  Now a
hot-add of a device that triggers this changes from a simple failure to
a denial of service event.  Furthermore, we base that time bomb on our
interpretation of the spec, which we can only validate against in-tree
devices.

We have actually had assigned devices that fail the sanity test here,
there's a quirk in vfio_msix_early_setup() for a Chelsio device with
this bug.  Do we really want user experiencing aborts when a simple
device initialization failure is sufficient?

Generally abort code paths like this cause me to do my own sanity
testing, which is really poor practice since we should have that sanity
testing in the common code.  Thanks,

Alex



reply via email to

[Prev in Thread] Current Thread [Next in Thread]