[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
From: |
Haozhong Zhang |
Subject: |
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion |
Date: |
Fri, 21 Jul 2017 17:51:31 +0800 |
User-agent: |
NeoMutt/20170428 (1.8.2) |
On 07/21/17 02:56 -0400, Pankaj Gupta wrote:
>
> Hello,
>
> We shared a proposal for 'KVM fake DAX flushing interface'.
>
> https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html
>
In above link,
"Overall goal of project
is to increase the number of virtual machines that can be
run on a physical machine, in order to *increase the density*
of customer virtual machines"
Is the fake persistent memory used as normal RAM in guest? If no, how
is it expected to be used in guest?
> We did initial POC in which we used 'virtio-blk' device to perform
> a device flush on pmem fsync on ext4 filesystem. They are few hacks
> to make things work. We need suggestions on below points before we
> start actual implementation.
>
> A] Problems to solve:
> ------------------
>
> 1] We are considering two approaches for 'fake DAX flushing interface'.
>
> 1.1] fake dax with NVDIMM flush hints & KVM async page fault
>
> - Existing interface.
>
> - The approach to use flush hint address is already nacked upstream.
>
> - Flush hint not queued interface for flushing. Applications might
> avoid to use it.
>
> - Flush hint address traps from guest to host and do an entire fsync
> on backing file which itself is costly.
>
> - Can be used to flush specific pages on host backing disk. We can
> send data(pages information) equal to cache-line size(limitation)
> and tell host to sync corresponding pages instead of entire disk sync.
>
> - This will be an asynchronous operation and vCPU control is returned
> quickly.
>
>
> 1.2] Using additional para virt device in addition to pmem device(fake dax
> with device flush)
>
> - New interface
>
> - Guest maintains information of DAX dirty pages as exceptional entries
> in
> radix tree.
>
> - If we want to flush specific pages from guest to host, we need to send
> list of the dirty pages corresponding to file on which we are doing
> fsync.
>
> - This will require implementation of new interface, a new paravirt
> device
> for sending flush requests.
>
> - Host side will perform fsync/fdatasync on list of dirty pages or
> entire
> block device backed file.
>
> 2] Questions:
> -----------
>
> 2.1] Not sure why WPQ flush is not a queued interface? We can force
> applications
> to call this? device DAX neither calls fsync/msync?
>
> 2.2] Depending upon interface we decide, we need optimal solution to sync
> range of pages?
>
> - Send range of pages from guest to host to sync asynchronously instead
> of syncing entire block device?
e.g. a new virtio device to deliver sync requests to host?
>
> - Other option is to sync entire disk backing file to make sure all the
> writes are persistent. In our case, backing file is a regular file on
> non NVDIMM device so host page cache has list of dirty pages which
> can be used either with fsync or similar interface.
As the amount of dirty pages can be variant, the latency of each host
fsync is likely to vary in a large range.
>
> 2.3] If we do host fsync on entire disk we will be flushing all the dirty
> data
> to backend file. Just thinking what would be better approach, flushing
> pages on corresponding guest file fsync or entire block device?
>
> 2.4] If we decide to choose one of the above approaches, we need to consider
> all DAX supporting filesystems(ext4/xfs). Would hooking code to
> corresponding
> fsync code of fs seems reasonable? Just thinking for flush hint address
> use-case?
> Or how flush hint addresses would be invoked with fsync or similar api?
>
> 2.5] Also with filesystem journalling and other mount options like barriers,
> ordered etc, how we decide to use page flush hint or regular fsync on
> file?
>
> 2.6] If at guest side we have PFN of all the dirty pages in radixtree? and
> we send
> these to to host? At host side would we able to find corresponding page
> and flush
> them all?
That may require the host file system provides API to flush specified
blocks/extents and their meta data in the file system. I'm not
familiar with this part and don't know whether such API exists.
Haozhong
- [Qemu-devel] KVM "fake DAX" flushing interface - discussion, Pankaj Gupta, 2017/07/21
- Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion,
Haozhong Zhang <=
- Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion, Stefan Hajnoczi, 2017/07/21
- Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion, Pankaj Gupta, 2017/07/21
- Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion, Rik van Riel, 2017/07/21
- Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion, Stefan Hajnoczi, 2017/07/21
- Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion, Dan Williams, 2017/07/22
- Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion, Rik van Riel, 2017/07/23
- Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion, Dan Williams, 2017/07/23
- Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion, Rik van Riel, 2017/07/23
- Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion, Dan Williams, 2017/07/23