qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH RFC v11 3/4] vfio-pci: pass the aer error to gue


From: Cao jin
Subject: Re: [Qemu-devel] [PATCH RFC v11 3/4] vfio-pci: pass the aer error to guest
Date: Fri, 20 Jan 2017 14:50:26 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0


On 01/19/2017 06:31 AM, Alex Williamson wrote:
> On Sat, 31 Dec 2016 17:13:07 +0800
> Cao jin <address@hidden> wrote:
> 
>> From: Chen Fan <address@hidden>
>>

>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 76a8ac3..9861f72 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -2470,21 +2470,55 @@ static void vfio_put_device(VFIOPCIDevice *vdev)
>>  static void vfio_err_notifier_handler(void *opaque)
>>  {
>>      VFIOPCIDevice *vdev = opaque;
>> +    PCIDevice *dev = &vdev->pdev;
>> +    PCIEAERMsg msg = {
>> +        .severity = 0,
>> +        .source_id = (pci_bus_num(dev->bus) << 8) | dev->devfn,
>> +    };
>> +    int len;
>> +    uint64_t uncor_status;
>> +
>> +    /* Read uncorrectable error status from driver */
>> +    len = read(vdev->err_notifier.rfd, &uncor_status, sizeof(uncor_status));
> 
> Whoa this seems bogus.  In the kernel eventfd_signal() adds the value
> to the internal counter.  We can't guarantee that the user is going to
> immediately respond to every signal, multiple signals might occur, each
> incrementing the counter.  I don't think we can use the eventfd like
> this.
> 

Ah, got your concern, make sense. Michael had the same comments as you,
I didn't got him at that time...

I explained the reason in v10 cover letter(5 of changelog): I find that
error status register reading in error handler sometime returns ALL F's.
 I may want to take a look at v10, incorporate your comments, and test
to see if the issue still exists.  Currently if we only handle non-fatal
error, we can still use eventfd like before.

-- 
Sincerely,
Cao jin
>> +    if (len != sizeof(uncor_status)) {
>> +        error_report("vfio-pci: uncor error status reading returns"
>> +                     " invalid number of bytes: %d", len);
>> +        return; //Or goto stop?
> 
> It's bogus use of the eventfd anyway afaict, but
> event_notifier_test_and_clear() certainly handles at least EINTR.
> 
>> +    }
>> +
>> +    if (!(vdev->features & VFIO_FEATURE_ENABLE_AER)) {
>> +        goto stop;
>> +    }
> 
> I'm not sure this should be user selected anymore.
> 
>> +
>> +    /* Populate the aer msg and send it to root port */
>> +    if (dev->exp.aer_cap) {
>> +        uint8_t *aer_cap = dev->config + dev->exp.aer_cap;
>> +        bool isfatal = uncor_status &
>> +                       pci_get_long(aer_cap + PCI_ERR_UNCOR_SEVER);
>> +
>> +    if (isfatal) {
>> +        goto stop;
>> +    }
> 
> QEMU uses spaces not tabs.
> 
>> +
>> +        msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
>> +                                 PCI_ERR_ROOT_CMD_NONFATAL_EN;
> 
> We don't get here if isfatal.
> 
>>  
>> -    if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
>> +        error_report("vfio-pci device %d sending AER to root port. uncor"
>> +                     " status = 0x%"PRIx64, dev->devfn, uncor_status);
>> +        pcie_aer_msg(dev, &msg);
>>          return;
>>      }
>>  
>> +stop:
>>      /*
>> -     * TBD. Retrieve the error details and decide what action
>> -     * needs to be taken. One of the actions could be to pass
>> -     * the error to the guest and have the guest driver recover
>> -     * from the error. This requires that PCIe capabilities be
>> -     * exposed to the guest. For now, we just terminate the
>> -     * guest to contain the error.
>> +     * Terminate the guest in case of
>> +     * 1. AER capability is not exposed to guest.
>> +     * 2. AER capability is exposed, but error is fatal, only non-fatal
>> +     * error is handled now.
> 
> You're also currently requiring the user to enable aer per device.
> 
>>       */
>>  
>> -    error_report("%s(%s) Unrecoverable error detected. Please collect any 
>> data possible and then kill the guest", __func__, vdev->vbasedev.name);
>> +    error_report("%s(%s) fatal error detected. Please collect any data"
>> +            " possible and then kill the guest", __func__, 
>> vdev->vbasedev.name);
>>  
>>      vm_stop(RUN_STATE_INTERNAL_ERROR);
>>  }
> 
> 
> 
> .
> 







reply via email to

[Prev in Thread] Current Thread [Next in Thread]