[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH RFC v11 3/4] vfio-pci: pass the aer error to gue
From: |
Cao jin |
Subject: |
Re: [Qemu-devel] [PATCH RFC v11 3/4] vfio-pci: pass the aer error to guest |
Date: |
Fri, 20 Jan 2017 14:50:26 +0800 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 |
On 01/19/2017 06:31 AM, Alex Williamson wrote:
> On Sat, 31 Dec 2016 17:13:07 +0800
> Cao jin <address@hidden> wrote:
>
>> From: Chen Fan <address@hidden>
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 76a8ac3..9861f72 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -2470,21 +2470,55 @@ static void vfio_put_device(VFIOPCIDevice *vdev)
>> static void vfio_err_notifier_handler(void *opaque)
>> {
>> VFIOPCIDevice *vdev = opaque;
>> + PCIDevice *dev = &vdev->pdev;
>> + PCIEAERMsg msg = {
>> + .severity = 0,
>> + .source_id = (pci_bus_num(dev->bus) << 8) | dev->devfn,
>> + };
>> + int len;
>> + uint64_t uncor_status;
>> +
>> + /* Read uncorrectable error status from driver */
>> + len = read(vdev->err_notifier.rfd, &uncor_status, sizeof(uncor_status));
>
> Whoa this seems bogus. In the kernel eventfd_signal() adds the value
> to the internal counter. We can't guarantee that the user is going to
> immediately respond to every signal, multiple signals might occur, each
> incrementing the counter. I don't think we can use the eventfd like
> this.
>
Ah, got your concern, make sense. Michael had the same comments as you,
I didn't got him at that time...
I explained the reason in v10 cover letter(5 of changelog): I find that
error status register reading in error handler sometime returns ALL F's.
I may want to take a look at v10, incorporate your comments, and test
to see if the issue still exists. Currently if we only handle non-fatal
error, we can still use eventfd like before.
--
Sincerely,
Cao jin
>> + if (len != sizeof(uncor_status)) {
>> + error_report("vfio-pci: uncor error status reading returns"
>> + " invalid number of bytes: %d", len);
>> + return; //Or goto stop?
>
> It's bogus use of the eventfd anyway afaict, but
> event_notifier_test_and_clear() certainly handles at least EINTR.
>
>> + }
>> +
>> + if (!(vdev->features & VFIO_FEATURE_ENABLE_AER)) {
>> + goto stop;
>> + }
>
> I'm not sure this should be user selected anymore.
>
>> +
>> + /* Populate the aer msg and send it to root port */
>> + if (dev->exp.aer_cap) {
>> + uint8_t *aer_cap = dev->config + dev->exp.aer_cap;
>> + bool isfatal = uncor_status &
>> + pci_get_long(aer_cap + PCI_ERR_UNCOR_SEVER);
>> +
>> + if (isfatal) {
>> + goto stop;
>> + }
>
> QEMU uses spaces not tabs.
>
>> +
>> + msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
>> + PCI_ERR_ROOT_CMD_NONFATAL_EN;
>
> We don't get here if isfatal.
>
>>
>> - if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
>> + error_report("vfio-pci device %d sending AER to root port. uncor"
>> + " status = 0x%"PRIx64, dev->devfn, uncor_status);
>> + pcie_aer_msg(dev, &msg);
>> return;
>> }
>>
>> +stop:
>> /*
>> - * TBD. Retrieve the error details and decide what action
>> - * needs to be taken. One of the actions could be to pass
>> - * the error to the guest and have the guest driver recover
>> - * from the error. This requires that PCIe capabilities be
>> - * exposed to the guest. For now, we just terminate the
>> - * guest to contain the error.
>> + * Terminate the guest in case of
>> + * 1. AER capability is not exposed to guest.
>> + * 2. AER capability is exposed, but error is fatal, only non-fatal
>> + * error is handled now.
>
> You're also currently requiring the user to enable aer per device.
>
>> */
>>
>> - error_report("%s(%s) Unrecoverable error detected. Please collect any
>> data possible and then kill the guest", __func__, vdev->vbasedev.name);
>> + error_report("%s(%s) fatal error detected. Please collect any data"
>> + " possible and then kill the guest", __func__,
>> vdev->vbasedev.name);
>>
>> vm_stop(RUN_STATE_INTERNAL_ERROR);
>> }
>
>
>
> .
>