[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Question] SR-IOV VF 'surprise removal' and vfio_reset behavior in pSeri
From: |
Daniel Henrique Barboza |
Subject: |
[Question] SR-IOV VF 'surprise removal' and vfio_reset behavior in pSeries |
Date: |
Mon, 4 Jan 2021 10:35:45 -0300 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 |
Hi,
This question came up while I was investigating a Libvirt bug [1], where an
user is removing
VFs from the host while Libvirt domains was using them, causing Libvirt to
remain in
an inconsistent state. I'm trying to alleviate the effects of this in Libvirt
(see [2] if curious),
but QEMU is throwing some messages in the terminal that, although it appears to
be benign,
I'm not sure if it's a symptom of something that should be fixed.
In a Power 9 server running a Mellanox MT28800 SR-IOV netcard I have the
following IOMMU
settings, where the physical card is at Group 0 and all the VFs are allocated
from Group 12 and
onwards:
IOMMU Group 0 0000:01:00.0 Infiniband controller [0207]: Mellanox Technologies
MT28800 Family [ConnectX-5 Ex] [15b3:1019]
(...)
IOMMU Group 12 0000:01:00.2 Infiniband controller [0207]: Mellanox Technologies
MT27800 Family [ConnectX-5 Virtual Function] [15b3:1018]
IOMMU Group 13 0000:01:00.3 Infiniband controller [0207]: Mellanox Technologies
MT27800 Family [ConnectX-5 Virtual Function] [15b3:1018]
(...)
Creating a guest with the Group 12 VF and trying to remove the VF from the host
via
echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
Makes the guest remove the VF card, but throwing a warning/error message in
QEMU log:
"qemu-system-ppc64: vfio: Cannot reset device 0000:01:00.2, depends on group 0 which
is not owned."
I found this message confusing because the VF was occupying IOMMU group 12, but
the message is
claiming that the reset wasn't possible because Group 0 wasn't owned by the
process.
Digging it a bit, the hotunplug is fired up via the poweroff state of the card
triggering pSeries internals,
and then reaching spapr_pci_unplug() in hw/ppc/spapr_pci.c. The body of the
function reads:
-------
/* some version guests do not wait for completion of a device
* cleanup (generally done asynchronously by the kernel) before
* signaling to QEMU that the device is safe, but instead sleep
* for some 'safe' period of time. unfortunately on a busy host
* this sleep isn't guaranteed to be long enough, resulting in
* bad things like IRQ lines being left asserted during final
* device removal. to deal with this we call reset just prior
* to finalizing the device, which will put the device back into
* an 'idle' state, as the device cleanup code expects.
*/
pci_device_reset(PCI_DEVICE(plugged_dev));
-------
My first question is right at this point: do we need PCI reset for a VF
removal? I am not sure about
handling IRQ lines asserted for a device that the kernel is going to remove.
Going on further to the origin on the warning message we get to hw/vfio/pci.c,
vfio_pci_hot_reset().
The VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl() is returning all VFs of the
card, including
the physical function, in the vfio_pci_hot_reset_info struct. Then, down where
it verifies if all
IOMMU groups required for reset belongs to the process, it fails to reset the
VF because QEMU
does not have Group 0 ownership:
-------
ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
if (ret) {
ret = -errno;
error_report("vfio: hot reset info failed: %m");
goto out_single;
}
(...)
QLIST_FOREACH(group, &vfio_group_list, next) {
if (group->groupid == devices[i].group_id) {
break;
}
}
if (!group) {
if (!vdev->has_pm_reset) {
error_report("vfio: Cannot reset device %s, "
"depends on group %d which is not owned.",
vdev->vbasedev.name, devices[i].group_id);
}
ret = -EPERM;
goto out;
}
-------
This message is not clear to me because I'm aware that the VF was in Group 12,
but apparently the
code is demanding ownership of all IOMMU Groups related to the card to allow
the reset.
The second question: is this intended? If not, then someone is behaving badly
(perhaps the card driver,
mlx5_core) and reporting wrong info to that VFIO ioctl(). If this reset
behavior is intended, then I
might insert a code in spapr_pci_unplug() to skip resetting the VF in this
particular case to avoid the
error message (assuming that we really can live without a reset in this case).
Thanks,
DHB
[1] https://gitlab.com/libvirt/libvirt/-/issues/72
[2] https://www.redhat.com/archives/libvir-list/2021-January/msg00028.html
- [Question] SR-IOV VF 'surprise removal' and vfio_reset behavior in pSeries,
Daniel Henrique Barboza <=