qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [


From: Niels de Vos
Subject: Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
Date: Wed, 20 Jul 2016 15:35:17 +0200
User-agent: Mutt/1.6.1 (2016-04-27)

On Wed, Jul 20, 2016 at 10:30:25PM +1000, Dave Chinner wrote:
> On Wed, Jul 20, 2016 at 05:19:37AM -0400, Paolo Bonzini wrote:
> > Adding ext4 and XFS guys (Lukas and Dave respectively).  As a quick recap, 
> > the
> > issue here is the semantics of FIEMAP and SEEK_HOLE/SEEK_DATA, which we use 
> > in
> > "qemu-img map".  This command prints metadata about a virtual disk 
> > image---which
> > in the case of a raw image amounts to detecting holes and unwritten extents.
> > 
> > First, it seems like SEEK_HOLE and SEEK_DATA really should be 
> > "SEEK_NONZERO" and
> > "SEEK_ZERO", on both ext4 and XFS.    You can see that unwritten extents are
> > reported by "qemu-img map" as holes:
> 
> Correctly so. seek hole/data knows nothing about the underlying
> filesystem storage implementation. A "hole" is defined as a region
> of zeros, and the filesystem is free call anything it kows for
> certain contains zeros as a hole.
> 
> FYI, SEEK_HOLE/DATA were named after the the pre-existing Solaris
> seek API that uses these definitions for hole and data....
> 
> >     $ dd if=/dev/urandom of=test.img bs=1M count=100
> >     $ fallocate -z -o 10M -l 10M test.img
> >     $ du -h test.img
> >     $ qemu-img map --output=json test.img
> >     [{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "data": 
> > true, "offset": 0},
> >     { "start": 10485760, "length": 10485760, "depth": 0, "zero": true, 
> > "data": false, "offset": 10485760},
> >     { "start": 20971520, "length": 83886080, "depth": 0, "zero": false, 
> > "data": true, "offset": 20971520}]
> 
> > 
> > On the second line, zero=true data=false identifies a hole.  The right 
> > output
> > would either have zero=true data=true (unwritten extent) or just
> > 
> >     [{ "start": 0, "length": 104857600, "depth": 0, "zero": false, "data": 
> > true, "offset": 0},
> >
> > since the zero flag is advisory (it doesn't try to detect zeroes beyond 
> > what the
> > filesystem says).
> 
> Not from SEEK_HOLE/SEEK_DATA. All it conveys is "this is the start
> of a range of zeros" and "this is the start of a range of data". And
> for filesystems that don't specifically implement these seek
> operations, zeros are considered data. i.e. SEEK_HOLE will take you
> to the end of file, SEEK_DATA returns the current position....
> 
> i.e. unwritten extents contain no data, so they are semantically
> identical to holes for the purposes of seeking and hence SEEK_DATA
> can skip over them.
> 
> > The reason why we disabled FIEMAP was a combination of a corruption and 
> > performance
> > issue.  The data corruption bug was at 
> > https://bugs.launchpad.net/qemu/+bug/1368815
> > and it was reported on Ubuntu Trusty (kernel 3.13 based on the release 
> > notes at
> > https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes).  We corrected that by 
> > using
> > FIEMAP_FLAG_SYNC, based on a similar patch to coreutils.  This turned out 
> > to be too
> > slow, so we dropped FIEMAP altogether.
> 
> Yes, because FIEMAP output is only useful for diagnostic purposes as
> it can be stale even before the syscall returns to userspace. i.e.
> it can't be used in userspace for optimising file copies....

Oh... And I was surprised to learn that "cp" does use FIEMAP and not
SEEK_HOLE/SEEK_DATA. You give a strong suggestion to fix that.
  http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/extent-scan.c#n87

> Finding regions of valid data in a file is what SEEK_HOLE/SEEK_DATA
> is for, not FIEMAP. FIEMAP only reports the physical layout of the
> file, now where the currently valid data in the file lies.
> 
> > However, today I found kernel commit 91dd8c114499 ("ext4: prevent race 
> > while walking
> > extent tree for fiemap", 2012-11-28) whose commit message says:
> > 
> >     Moreover the extent currently in delayed allocation might be allocated
> >     after we search the extent tree and before we search extent status tree
> >     delayed buffers resulting in those delayed buffers being completely
> >     missed, even though completely written and allocated.
> > 
> > This seems pretty much like our data corruption bug; it would mean that
> > using FIEMAP_FLAG_SYNC was only working around a bug and delayed allocations
> > _should_ be reported as usual by FIEMAP.
> 
> What do you think you can do with the delayed allocation regions in
> the file?
> 
> Indeed, with specualtive delayed allocation, XFS can report delalloc
> regions that have no actual resemblence to the layout of dirty data
> in the file. SEEK_DATA will iterate the delalloc regions containing
> data precisely, however.
> 
> > Except that the commit went in kernel 3.8 and as said above Trusty had 3.13.
> > So either there are other bugs, or my understanding of the commit is not 
> > correct.
> > So the questions for Lukas and Dave are:
> > 
> > 1) is it expected that SEEK_HOLE skips unwritten extents?
> 
> There are multiple answers to this, all of which are correct depending
> on current context and state:
> 
> 1. No - some filesystems will report clean unwritten extents as holes.
> 
> 2. Yes - some filesystems will report clean unwritten extents as
> data.
> 
> 3.  Maybe - if there is written data in memory over the unwritten
> extent on disk (i.e. hasn't been flushed to disk, it will be
> considered a data region with non-zero data. (FIEMAP will still
> report is as unwritten)
> 
> > If not, would
> > it be acceptable to introduce Linux-specific SEEK_ZERO/SEEK_NONZERO, which
> > would be similar to what SEEK_HOLE/SEEK_DATA do now?
> 
> To solve what problem? You haven't explained what problem you are
> trying to solve yet.

I think this is about optimizing file copy. There is a difference
between pre-allocated (but unwritten) and zero'd out extents. In the
Gluster network protocol, we could use FALLOCATE for the first, and
ZEROFILL for the second. Depending on what the underlaying filesystem
and storage support, performance would differ.

But, maybe this distinction is not important enough to optimize for?
That is not something I can judge for the QEMU use-case.

> > 2) for FIEMAP do we really need FIEMAP_FLAG_SYNC?  And if not, for what
> > filesystems and kernel releases is it really not needed?
> 
> I can't answer this question, either, because I don't know what
> you want the fiemap information for.

Great details, thanks for sharing!

HTH,
Niels

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]