qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 11/12] vfio: register aer resume notification ha


From: Zhou Jie
Subject: Re: [Qemu-devel] [PATCH 11/12] vfio: register aer resume notification handler for aer resume
Date: Thu, 19 May 2016 10:41:34 +0800
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0



On 2016/5/19 10:18, Alex Williamson wrote:
On Thu, 19 May 2016 09:49:00 +0800
Zhou Jie <address@hidden> wrote:

On 2016/5/19 2:26, Alex Williamson wrote:
On Wed, 18 May 2016 11:31:09 +0800
Zhou Jie <address@hidden> wrote:

From: Chen Fan <address@hidden>

For supporting aer recovery, host and guest would run the same aer
recovery code, that would do the secondary bus reset if the error
is fatal, the aer recovery process:
  1. error_detected
  2. reset_link (if fatal)
  3. slot_reset/mmio_enabled
  4. resume

It indicates that host will do secondary bus reset to reset
the physical devices under bus in step 2, that would cause
devices in D3 status in a short time. But in qemu, we register
an error detected handler, that would be invoked as host broadcasts
the error-detected event in step 1, in order to avoid guest do
reset_link when host do reset_link simultaneously. it may cause
fatal error. we introduce a resmue notifier to assure host reset
completely.
In qemu, the aer recovery process:
  1. Detect support for resume notification
     If host vfio driver does not support for resume notification,
     directly fail to boot up VM as with aer enabled.
  2. Immediately notify the VM on error detected.
  3. Stall any access to the device until resume is signaled.

The code below doesn't actually do this, mmaps are disabled, but that
just means any device access will use the slow path through QEMU.  That
path is still fully enabled and so is config space access.
I will modify the code and find other way to
stall any access to the device.

  4. Delay the guest directed bus reset.

It's delayed, but not the way I expected.  The guest goes on executing
and then we do the reset at some point later?  More comments below...

  5. Wait for resume notification.
     If we don't get the resume notification from the host after
     some timeout, we would abort the guest directed bus reset
     altogether and unplug of the device to prevent it from further
     interacting with the VM.
  6. After get the resume notification reset bus.

Signed-off-by: Chen Fan <address@hidden>
Signed-off-by: Zhou Jie <address@hidden>
---
 hw/vfio/pci.c              | 182 ++++++++++++++++++++++++++++++++++++++++++++-
 hw/vfio/pci.h              |   5 ++
 linux-headers/linux/vfio.h |   1 +
 3 files changed, 186 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 6877a3d..39a9a9f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -35,6 +35,7 @@
 #include "trace.h"

 #define MSIX_CAP_LENGTH 12
+#define VFIO_RESET_TIMEOUT 1000

It deserves at least a comment as to why this value was chosen.
OK. I will add a comment here.

 static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
@@ -1937,6 +1938,14 @@ static void vfio_check_hot_bus_reset(VFIOPCIDevice 
*vdev, Error **errp)
     VFIOGroup *group;
     int ret, i, devfn, range_limit;

+    if (!vdev->vfio_resume_cap) {
+        error_setg(errp, "vfio: Cannot enable AER for device %s,"
+                   " host vfio driver does not support for"
+                   " resume notification",
+                   vdev->vbasedev.name);
+        return;
+    }
+
     ret = vfio_get_hot_reset_info(vdev, &info);
     if (ret) {
         error_setg(errp, "vfio: Cannot enable AER for device %s,"
@@ -2594,6 +2603,23 @@ static int vfio_populate_device(VFIOPCIDevice *vdev)
                      vbasedev->name);
     }

+    irq_info.index = VFIO_PCI_RESUME_IRQ_INDEX;
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
+    if (ret) {
+        /* This can fail for an old kernel or legacy PCI dev */
+        trace_vfio_populate_device_get_irq_info_failure();
+        ret = 0;
+    } else if (irq_info.count == 1) {
+        vdev->vfio_resume_cap = true;

"vfio" in the name is redundant, ERR_IRQ uses pci_aer, so a logical
name might be pci_aer_has_resume.
OK.

+    } else {
+        error_report("vfio: %s "
+                     "Could not enable error recovery for the device,"
+                     " because host vfio driver does not support for"
+                     " resume notification",
+                     vbasedev->name);
+    }

This error_report makes sense for ERR_IRQ because halt-on-AER is setup
transparently, but I don't think it makes sense here.  If the user has
specified to enable AER then it should either work or they should get
an error message.  If they have not specified to enable AER, why does
the user care if there's an inconsistency here?
OK. I will delete the error report at here.

+
 error:
     return ret;
 }
@@ -2661,6 +2687,17 @@ static void vfio_err_notifier_handler(void *opaque)
         msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
                                  PCI_ERR_ROOT_CMD_NONFATAL_EN;

+        if (isfatal) {
+            PCIDevice *dev_0 = pci_get_function_0(dev);
+            vdev->vfio_resume_wait = true;
+            vdev->vfio_resume_arrived = false;

Possible names:

pci_aer_error_signaled
pci_aer_resume_signaled
OK.

+            vfio_mmap_set_enabled(vdev, false);
+            if (dev_0 != dev) {
+                VFIOPCIDevice *vdev_0 = DO_UPCAST(VFIOPCIDevice, pdev, dev_0);
+                vdev_0->vfio_resume_wait = true;
+                vdev_0->vfio_resume_arrived = false;
+            }

Why is function 0 special here?  Don't we expect that it will also get
an ERR_IRQ?
I tested in this condition.
The device is multifunction.
And function 0 is OK, function 1 occured an error.
function 0 got an ERR_IRQ, but returned at following code.
if (!(uncor_status & ~0UL)) {
     return;
}
If don't set the function 0 at here, it will not wait in vfio_pci_reset.

And I also tested the nofatal error.
The aer will send the nofatal error notification to qemu,
but the guest will not invoke vfio_pci_reset.
So, I need know whether the error is fatal,
and then set the pci_aer_error_signaled.

Do we need to worry about some functions getting a resume while others
don't?  Should the reset stall until all functions that received a
fatal error receive a resume?  Maybe all tracking should be done with
an 8-byte (256-bit to support ARI) atomic bitmap on function 0 and
reset is blocked until the mask of all functions receiving a fatal
error is equal to the mask of all functions receiving a resume.
It is a good idea. I will try it.

+        }
         pcie_aer_msg(dev, &msg);
         return;
     }
@@ -2756,6 +2793,96 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice 
*vdev)
     event_notifier_cleanup(&vdev->err_notifier);
 }

+static void vfio_resume_notifier_handler(void *opaque)

Please use "aer" in the name, otherwise resume might refer to
suspend-resume.
OK.

+{
+    VFIOPCIDevice *vdev = opaque;
+
+    if (!event_notifier_test_and_clear(&vdev->resume_notifier)) {
+        return;
+    }
+
+    vdev->vfio_resume_arrived = true;
+    if (vdev->reset_timer != NULL) {

pci_aer_reset_blocked_timer
OK

+        timer_mod(vdev->reset_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL));
+    }
+}
+
+static void vfio_register_aer_resume_notifier(VFIOPCIDevice *vdev)
+{
+    int ret;
+    int argsz;
+    struct vfio_irq_set *irq_set;
+    int32_t *pfd;
+
+    if (!(vdev->features & VFIO_FEATURE_ENABLE_AER) ||
+        !vdev->vfio_resume_cap) {
+        return;
+    }
+
+    if (event_notifier_init(&vdev->resume_notifier, 0)) {
+        error_report("vfio: Unable to init event notifier for"
+                     " resume notification");
+        vdev->vfio_resume_cap = false;
+        return;
+    }
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
+                     VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+
+    *pfd = event_notifier_get_fd(&vdev->resume_notifier);
+    qemu_set_fd_handler(*pfd, vfio_resume_notifier_handler, NULL, vdev);
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    if (ret) {
+        error_report("vfio: Failed to set up resume notification");
+        qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
+        event_notifier_cleanup(&vdev->resume_notifier);
+        vdev->vfio_resume_cap = false;
+    }
+    g_free(irq_set);
+}
+
+static void vfio_unregister_aer_resume_notifier(VFIOPCIDevice *vdev)
+{
+    int argsz;
+    struct vfio_irq_set *irq_set;
+    int32_t *pfd;
+    int ret;
+
+    if (!vdev->vfio_resume_cap) {
+        return;
+    }
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
+                     VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+    *pfd = -1;
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    if (ret) {
+        error_report("vfio: Failed to de-assign error fd: %m");
+    }
+    g_free(irq_set);
+    qemu_set_fd_handler(event_notifier_get_fd(&vdev->resume_notifier),
+                        NULL, NULL, vdev);
+    event_notifier_cleanup(&vdev->resume_notifier);
+}
+
 static void vfio_req_notifier_handler(void *opaque)
 {
     VFIOPCIDevice *vdev = opaque;
@@ -3061,6 +3188,7 @@ static int vfio_initfn(PCIDevice *pdev)
     }

     vfio_register_err_notifier(vdev);
+    vfio_register_aer_resume_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);

@@ -3091,6 +3219,7 @@ static void vfio_exitfn(PCIDevice *pdev)
     VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);

     vfio_unregister_req_notifier(vdev);
+    vfio_unregister_aer_resume_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
     vfio_disable_interrupts(vdev);
@@ -3101,6 +3230,34 @@ static void vfio_exitfn(PCIDevice *pdev)
     vfio_bars_exit(vdev);
 }

+static void vfio_pci_delayed_reset(void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+    PCIDevice *pdev = &vdev->pdev;
+
+    timer_free(vdev->reset_timer);
+    vdev->reset_timer = NULL;

Racy, the timer gets freed before set to NULL so the test above could
see it non-NULL as it's being freed, assuming QEMU supports that sort
of concurrency.
I will use sem to avoid race condition.
More comments below...

Seems like you could just reverse the order, cache vdev->reset_timer,
set it to NULL, then call timer_free() on the cached value.  But as I
question below, it'd be more simple to not have a timer.

I meant I will use sem instead of timer.

+
+    if (!vdev->vfio_resume_wait) {
+        return;
+    }
+    vdev->vfio_resume_wait = false;
+
+    if (vdev->vfio_resume_arrived) {
+        vfio_mmap_set_enabled(vdev, true);

How do you know if mmap was enabled when you started?  This could
interfere with other cases where mmaps get disabled.
Yes, I will modify the code.

+        if (pci_get_function_0(pdev) == pdev) {
+            vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
+        }
+    } else {
+        uint32_t uncor_status;
+        uncor_status = vfio_pci_read_config(pdev,
+                           pdev->exp.aer_cap + PCI_ERR_UNCOR_STATUS, 4);
+        if (uncor_status & ~0UL) {
+            qdev_unplug(&vdev->pdev.qdev, NULL);
+        }
+    }
+}
+
 static void vfio_pci_reset(DeviceState *dev)
 {
     PCIDevice *pdev = DO_UPCAST(PCIDevice, qdev, dev);
@@ -3113,13 +3270,34 @@ static void vfio_pci_reset(DeviceState *dev)

         if ((pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
              PCI_BRIDGE_CTL_BUS_RESET)) {
-            if (pci_get_function_0(pdev) == pdev) {
-                vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
+            if (!vdev->vfio_resume_wait) {
+                if (pci_get_function_0(pdev) == pdev) {
+                    vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
+                }
+            } else {
+                if (vdev->vfio_resume_arrived) {
+                    vdev->vfio_resume_wait = false;
+                    vfio_mmap_set_enabled(vdev, true);

mmap is getting restored in too many places, it should be disabled on
ERR_IRQ and re-enabled on ERR_RESUME, no more.
OK.

+                    if (pci_get_function_0(pdev) == pdev) {
+                        vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
+                    }
+                } else {
+                    vdev->reset_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
+                                            vfio_pci_delayed_reset, vdev);
+                    timer_mod(vdev->reset_timer,
+                              qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL)
+                              + VFIO_RESET_TIMEOUT);
+                }

Wait, so we just set a timer and return pretending the reset occurred
when actually it might occur at some point in the future?  How is that
supposed to work?  I thought the plan was to block here.
How about invoke sem_post at vfio_resume_notifier_handler,
and wait here use sem_timedwait?

Do we even need a timer?  What if we simply spin?

for (i = 0; i < 1000; i++) {
    if (vdev->pci_aer_resume_signaled) {
        break;
    }
    g_usleep(1000);
}

if (i == 1000) {
    /* We failed */
} else {
    /* Proceed with reset */
}

Does QEMU have enough concurrency to do this?  Thanks,

I meant I will use sem instead of timer.
But think about the new idea you proposed.
I will use g_usleep and check bitmap to
wait mask of all functions receiving a fatal
error is equal to the mask of all functions receiving a resume.

Sincerely,
Zhou Jie


Alex

             }
             return;
         }
     }

+    if (vdev->vfio_resume_wait) {
+        vdev->vfio_resume_wait = false;
+        vfio_mmap_set_enabled(vdev, true);
+    }

Again, these are getting changed in too many places, the state machine
is too complicated.  Thanks,
In my test this code will never be invoked.
But I add this code to clear vfio_resume_wait if the guest don't
  reset the bus.

Sincerely,
Zhou Jie



Alex

+
     vfio_pci_pre_reset(vdev);

     if (vdev->resetfn && !vdev->resetfn(vdev)) {
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 9fb0206..49e28d8 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -119,6 +119,7 @@ typedef struct VFIOPCIDevice {
     VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
     PCIHostDeviceAddress host;
     EventNotifier err_notifier;
+    EventNotifier resume_notifier;
     EventNotifier req_notifier;
     int (*resetfn)(struct VFIOPCIDevice *);
     uint32_t vendor_id;
@@ -144,6 +145,10 @@ typedef struct VFIOPCIDevice {
     bool no_kvm_msi;
     bool no_kvm_msix;
     bool single_depend_dev;
+    bool vfio_resume_cap;
+    bool vfio_resume_wait;
+    bool vfio_resume_arrived;
+    QEMUTimer *reset_timer;
 } VFIOPCIDevice;

 uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 759b850..01dfd5d 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -433,6 +433,7 @@ enum {
        VFIO_PCI_MSIX_IRQ_INDEX,
        VFIO_PCI_ERR_IRQ_INDEX,
        VFIO_PCI_REQ_IRQ_INDEX,
+        VFIO_PCI_RESUME_IRQ_INDEX,
        VFIO_PCI_NUM_IRQS
 };




.






.


--
------------------------------------------------
周潔
Dept 1
No. 6 Wenzhu Road,
Nanjing, 210012, China
TEL:+86+25-86630566-8557
FUJITSU INTERNAL:7998-8557
address@hidden
------------------------------------------------





reply via email to

[Prev in Thread] Current Thread [Next in Thread]