[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC
From: |
Alex Williamson |
Subject: |
Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC |
Date: |
Mon, 17 Aug 2020 15:44:03 -0600 |
On Mon, 17 Aug 2020 17:20:57 -0400
Steven Sistare <steven.sistare@oracle.com> wrote:
> On 8/17/2020 4:48 PM, Alex Williamson wrote:
> > On Mon, 17 Aug 2020 14:30:51 -0400
> > Steven Sistare <steven.sistare@oracle.com> wrote:
> >
> >> On 7/30/2020 11:14 AM, Steve Sistare wrote:
> >>> Anonymous memory segments used by the guest are preserved across a re-exec
> >>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
> >>> in the Linux kernel. For the madvise patches, see:
> >>>
> >>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> >>>
> >>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>> ---
> >>> include/qemu/osdep.h | 7 +++++++
> >>> 1 file changed, 7 insertions(+)
> >>
> >> Hi Alex,
> >> The MADV_DOEXEC functionality, which is a pre-requisite for the entire
> >> qemu
> >> live update series, is getting a chilly reception on lkml. We could
> >> instead
> >> create guest memory using memfd_create and preserve the fd across exec.
> >> However,
> >> the subsequent mmap(fd) will return a different VA than was used
> >> previously,
> >> which is a problem for memory that was registered with vfio, as the
> >> original VA
> >> is remembered in the kernel struct vfio_dma and used in various kernel
> >> functions,
> >> such as vfio_iommu_replay.
> >>
> >> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size,
> >> and
> >> new_vaddr. The implementation finds an exact match for (iova, size) and
> >> replaces
> >> vaddr with new_vaddr. Flags cannot be changed.
> >>
> >> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
> >> vfio on any form of shared memory (shm, dax, etc) could also be preserved
> >> across
> >> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
> >>
> >> What do you think
> >
> > Your new REMAP ioctl would have parameters identical to the MAP_DMA
> > ioctl, so I think we should just use one of the flag bits on the
> > existing MAP_DMA ioctl for this variant.
>
> Sounds good.
>
> > Reading through the discussion on the kernel side there seems to be
> > some confusion around why vfio needs the vaddr beyond the user call to
> > MAP_DMA though. Originally this was used to test for virtually
> > contiguous mappings for merging and splitting purposes. This is
> > defunct in the v2 interface, however the vaddr is now used largely for
> > mdev devices. If an mdev device is not backed by an IOMMU device and
> > does not share a container with an IOMMU device, then a user MAP_DMA
> > ioctl essentially just registers the translation within the vfio
> > container. The mdev vendor driver can then later either request pages
> > to be pinned for device DMA or can perform copy_to/from_user() to
> > simulate DMA via the CPU.
> >
> > Therefore I don't see that there's a simple re-architecture of the vfio
> > IOMMU backend that could drop vaddr use.
>
> Yes. I did not explain on lkml as you do here (thanks), but I reached the
> same conclusion.
>
> > I'm a bit concerned this new
> > remap proposal also raises the question of how do we prevent userspace
> > remapping vaddrs racing with asynchronous kernel use of the previous
> > vaddrs.
>
> Agreed. After a quick glance at the code, holding iommu->lock during
> remap might be sufficient, but it needs more study.
Unless you're suggesting an extended hold of the lock across the entire
re-exec of QEMU, that's only going to prevent a race between a remap
and a vendor driver pin or access, the time between the previous vaddr
becoming invalid and the remap is unprotected.
> > Are we expecting guest drivers/agents to quiesce the device,
> > or maybe relying on clearing bus-master, for PCI devices, to halt DMA?
>
> No. We want to support any guest, and the guest is not aware that qemu
> live update is occurring.
>
> > The vfio migration interface we've developed does have a mechanism to
> > stop a device, would we need to use this here? If we do have a
> > mechanism to quiesce the device, is the only reason we're not unmapping
> > everything and remapping it into the new address space the latency in
> > performing that operation? Thanks,
>
> Same answer - we don't require that the guest has vfio migration support.
QEMU toggling the runstate of the device via the vfio migration
interface could be done transparently to the guest, but if your
intention is to support any device (where none currently support the
migration interface) perhaps it's a moot point. It seems like this
scheme only works with IOMMU backed devices where the device can
continue to operate against pinned pages, anything that might need to
dynamically pin pages against the process vaddr as it's running async
to the QEMU re-exec needs to be blocked or stopped. Thanks,
Alex