[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC
From: |
Steven Sistare |
Subject: |
Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC |
Date: |
Mon, 17 Aug 2020 17:20:57 -0400 |
User-agent: |
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0 |
On 8/17/2020 4:48 PM, Alex Williamson wrote:
> On Mon, 17 Aug 2020 14:30:51 -0400
> Steven Sistare <steven.sistare@oracle.com> wrote:
>
>> On 7/30/2020 11:14 AM, Steve Sistare wrote:
>>> Anonymous memory segments used by the guest are preserved across a re-exec
>>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
>>> in the Linux kernel. For the madvise patches, see:
>>>
>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> include/qemu/osdep.h | 7 +++++++
>>> 1 file changed, 7 insertions(+)
>>
>> Hi Alex,
>> The MADV_DOEXEC functionality, which is a pre-requisite for the entire
>> qemu
>> live update series, is getting a chilly reception on lkml. We could instead
>> create guest memory using memfd_create and preserve the fd across exec.
>> However,
>> the subsequent mmap(fd) will return a different VA than was used previously,
>> which is a problem for memory that was registered with vfio, as the
>> original VA
>> is remembered in the kernel struct vfio_dma and used in various kernel
>> functions,
>> such as vfio_iommu_replay.
>>
>> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
>> new_vaddr. The implementation finds an exact match for (iova, size) and
>> replaces
>> vaddr with new_vaddr. Flags cannot be changed.
>>
>> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
>> vfio on any form of shared memory (shm, dax, etc) could also be preserved
>> across
>> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
>>
>> What do you think
>
> Your new REMAP ioctl would have parameters identical to the MAP_DMA
> ioctl, so I think we should just use one of the flag bits on the
> existing MAP_DMA ioctl for this variant.
Sounds good.
> Reading through the discussion on the kernel side there seems to be
> some confusion around why vfio needs the vaddr beyond the user call to
> MAP_DMA though. Originally this was used to test for virtually
> contiguous mappings for merging and splitting purposes. This is
> defunct in the v2 interface, however the vaddr is now used largely for
> mdev devices. If an mdev device is not backed by an IOMMU device and
> does not share a container with an IOMMU device, then a user MAP_DMA
> ioctl essentially just registers the translation within the vfio
> container. The mdev vendor driver can then later either request pages
> to be pinned for device DMA or can perform copy_to/from_user() to
> simulate DMA via the CPU.
>
> Therefore I don't see that there's a simple re-architecture of the vfio
> IOMMU backend that could drop vaddr use.
Yes. I did not explain on lkml as you do here (thanks), but I reached the
same conclusion.
> I'm a bit concerned this new
> remap proposal also raises the question of how do we prevent userspace
> remapping vaddrs racing with asynchronous kernel use of the previous
> vaddrs.
Agreed. After a quick glance at the code, holding iommu->lock during
remap might be sufficient, but it needs more study.
> Are we expecting guest drivers/agents to quiesce the device,
> or maybe relying on clearing bus-master, for PCI devices, to halt DMA?
No. We want to support any guest, and the guest is not aware that qemu
live update is occurring.
> The vfio migration interface we've developed does have a mechanism to
> stop a device, would we need to use this here? If we do have a
> mechanism to quiesce the device, is the only reason we're not unmapping
> everything and remapping it into the new address space the latency in
> performing that operation? Thanks,
Same answer - we don't require that the guest has vfio migration support.
- Steve