[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address struc
From: |
Haozhong Zhang |
Subject: |
Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure |
Date: |
Tue, 11 Apr 2017 16:41:33 +0800 |
User-agent: |
Mutt/1.6.2-neo (2016-08-21) |
On 04/06/17 20:02 +0800, Xiao Guangrong wrote:
>
>
> On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote:
> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> > > This patch series constructs the flush hint address structures for
> > > nvdimm devices in QEMU.
> > >
> > > It's of course not for 2.9. I send it out early in order to get
> > > comments on one point I'm uncertain (see the detailed explanation
> > > below). Thanks for any comments in advance!
> > > Background
> > > ---------------
> >
> > Extra background:
> >
> > Flush Hint Addresses are necessary because:
> >
> > 1. Some hardware configurations may require them. In other words, a
> > cache flush instruction is not enough to persist data.
> >
> > 2. The host file system may need fsync(2) calls (e.g. to persist
> > metadata changes).
> >
> > Without Flush Hint Addresses only some NVDIMM configurations actually
> > guarantee data persistence.
> >
> > > Flush hint address structure is a substructure of NFIT and specifies
> > > one or more addresses, namely Flush Hint Addresses. Software can write
> > > to any one of these flush hint addresses to cause any preceding writes
> > > to the NVDIMM region to be flushed out of the intervening platform
> > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
> >
> > Do you have performance data? I'm concerned that Flush Hint Address
> > hardware interface is not virtualization-friendly.
> >
> > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
> >
> > wmb();
> > for (i = 0; i < nd_region->ndr_mappings; i++)
> > if (ndrd_get_flush_wpq(ndrd, i, 0))
> > writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
> > wmb();
> >
> > That looks pretty lightweight - it's an MMIO write between write
> > barriers.
> >
> > This patch implements the MMIO write like this:
> >
> > void nvdimm_flush(NVDIMMDevice *nvdimm)
> > {
> > if (nvdimm->backend_fd != -1) {
> > /*
> > * If the backend store is a physical NVDIMM device, fsync()
> > * will trigger the flush via the flush hint on the host device.
> > */
> > fsync(nvdimm->backend_fd);
> > }
> > }
> >
> > The MMIO store instruction turned into a synchronous fsync(2) system
> > call plus vmexit/vmenter and QEMU userspace context switch:
> >
> > 1. The vcpu blocks during the fsync(2) system call. The MMIO write
> > instruction has an unexpected and huge latency.
> >
> > 2. The vcpu thread holds the QEMU global mutex so all other threads
> > (including the monitor) are blocked during fsync(2). Other vcpu
> > threads may block if they vmexit.
> >
> > It is hard to implement this efficiently in QEMU. This is why I said
> > the hardware interface is not virtualization-friendly. It's cheap on
> > real hardware but expensive under virtualization.
> >
> > We should think about the optimal way of implementing Flush Hint
> > Addresses in QEMU. But if there is no reasonable way to implement them
> > then I think it's better *not* to implement them, just like the Block
> > Window feature which is also not virtualization-friendly. Users who
> > want a block device can use virtio-blk. I don't think NVDIMM Block
> > Window can achieve better performance than virtio-blk under
> > virtualization (although I'm happy to be proven wrong).
> >
> > Some ideas for a faster implementation:
> >
> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
> > global mutex. Little synchronization is necessary as long as the
> > NVDIMM device isn't hot unplugged (not yet supported anyway).
> >
> > 2. Can the host kernel provide a way to mmap Address Flush Hints from
> > the physical NVDIMM in cases where the configuration does not require
> > host kernel interception? That way QEMU can map the physical
> > NVDIMM's Address Flush Hints directly into the guest. The hypervisor
> > is bypassed and performance would be good.
> >
> > I'm not sure there is anything we can do to make the case where the host
> > kernel wants an fsync(2) fast :(.
>
> Good point.
>
> We can assume flush-CPU-cache-to-make-persistence is always
> available on Intel's hardware so that flush-hint-table is not
> needed if the vNVDIMM is based on a real Intel's NVDIMM device.
>
We can let users of qemu (e.g. libvirt) detect whether the backend
device supports ADR, and pass 'flush-hint' option to qemu only if ADR
is not supported.
> If the vNVDIMM device is based on the regular file, i think
> fsync is the bottleneck rather than this mmio-virtualization. :(
>
Yes, fsync() on the regular file is the bottleneck. We may either
1/ perform the host-side flush in an asynchronous way which will not
block vcpu too long,
or
2/ not provide strong durability guarantee for non-NVDIMM backend and
not emulate flush-hint for guest at all. (I know 1/ does not
provide strong durability guarantee either).
Haozhong
Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure, Haozhong Zhang, 2017/04/11