Hi Alex,
on kernel side, I think if we don't trust the user behaviors, we
should
disable the access of vfio-pci interface once vfio-pci driver got the
error_detected,
we should disable all access to vfio fd regardless whether the vfio-pci
was assigned to a VM, we also can return a EAGAIN error if user try
to access it during the reset period until the host reset finished.
on qemu side, when we got a error_detect, we pass through the
aer error to guest directly, ignore all access to vfio-pci during this
time,
when qemu need to do a hot reset, we can retry to get the info from
the get info ioctl until we got the info that vfio-pci has been reset
finished,
then do the hot_reset ioctl if need, the kernel should ensure the ioctl
become
//// accessible after host reset completed.
That sounds pretty thorough, the sticky point there is always disabling
the device mmaps w/o a revoke interface. Do we invalidate the pfn
range and setup a fault handler that blocks on access? I don't think
we have a whole lot of options, either block or sigbus, but having such
a mechanism might allow us to easily put a device in a "dead" state
where the user can't touch it, which could be useful for other purposes
too. QEMU would also need to timeout after some number of reset
attempts and assume the device is not coming back. Plus we'd need a
device flag to indicate this behavior. Thanks,
Alex