qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] ide.c make write cacheing controllable by guest


From: Jamie Lokier
Subject: Re: [Qemu-devel] [PATCH] ide.c make write cacheing controllable by guest
Date: Tue, 26 Feb 2008 17:25:35 +0000
User-agent: Mutt/1.5.13 (2006-08-11)

Ian Jackson wrote:
> Jamie Lokier writes ("Re: [Qemu-devel] [PATCH] ide.c make write cacheing 
> controllable by guest"):
> > I'm imagining that fdatasync() will flush the necessary metadata,
> > including file size, when a file is extended.  As would O_DSYNC.
> 
> Do you have a reference to support this supposition ?

Not a _standard_, of course, as you found with SuSv3.  More a folk
understanding, which admittedly might be lacking in some
implementations (like Linux perhaps...).

Take a look at your references.

> HP-UX 11i's fdatasync manpage:
> 
>       fdatasync() causes all modified data and file attributes of fildes
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^
>       required to retrieve the data to be written to disk.
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

That means size, bitmap updates, block pointers, extents etc. needed
to retrieve the data.

> The glibc info manual:
> 
>      Sometimes it is not even necessary to write all data associated with
>   a file descriptor.  E.g., in database files which do not change in size
>   it is enough to write all the file content data to the device.

A bit more from Glibc:

   Meta-information, like the modification time etc., are not that
   important and leaving such information uncommitted does not prevent a
   successful recovering of the file in case of a problem.

   When a call to the `fdatasync' function returns, it is ensured
   that all of the file data is written to the device.  For all
   pending I/O operations, the parts guaranteeing data integrity
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   finished.
   ^^^^^^^^

Draw your own conclusion.

> The Solaris manpage says that fdatasync does the same as O_DSYNC,

That's right, it's the common meaning of O_DSYNC.

> and it calls the service "synchronized I/O data integrity
> completion" which is defined by the `Programming Interfaces Guide'
> to include this:
>
>  * For writes, the operation has been completed, or diagnosed if
>    unsuccessful. The write operation succeeds when the data specified
>    in the write request is successfully transferred. Furthermore, all
>    file system information required to retrieve the data must be
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>    successfully transferred.
     ^^^^^^^^^^^^^^^^^^^^^^^^

That's quite clear.

> But then the next bullet point is this:
> 
>  * File attributes that are not necessary for data retrieval are not
>    transferred prior to returning to the calling process.
> 
> which says `are not transferred' when it ought to say `are not
> necessarily transferred' so it may be unwise to rely on the precise
> wording.

That's fine and consistent with the previous text.  It means size
increase, bitmaps, pointers, extents etc. are written (those are the
attributes necessary for data retrieval).

Attributes like modification time, access time, change time,
permissions etc. are not (necessarily) transferred.  You're right it
should say "not necessarily", but that's implicit: they can be
transferred at any time anyway, by normal background writeback.

> I looked at various other manpages but they all say useless things
> like `metadata such as modification time' which leaves open the
> question of whether the file size is included.

I agree it's a bit ambiguous.  My understanding is that _increases_ in
size are included, by convention as much as anything, since the larger
size is necessary to retrieve the data later.

This is supported by the fact that O_DSYNC has a tendancy to become
very slow on some systems when extending a file, compared with writing
in place.

> If the size is supposed to be included then the OS is required to keep
> a flag to say whether the file has been extended so that it knows that
> the next fdatasync ought really to be an fsync and write the inode
> too.  (In a traditional filesystem structure.)

That's right.

> Or perhaps fsck needs
> to extend the file as necessary to include the data blocks past the
> nominal end of file.

Well, in general, if your system is such that fsck following a crash
is part of normal filesystem operations, then fsck could be allowed to
do a lot more than extend the size attribute.

That doesn't matter to the application, though.  What matters is that
it writes data (including extending the file), calls fdatasync() (or
uses O_DSYNC), and when the fdatasync returns it knows after a crash
and recovery that it will be able to retrieve that data with the
appropriate confidence level.

> This seems like rather a minefield.

The implementation details seem like a minefield, but the intent and
documentation and tradition of fdatasync() seems quite clear to me.

However, I suppose you might want to be careful and check, when
deploying your new database which depends on fdatasync(), if the
target systems really do sync size changes :-)

It's easy enough to check, as it greatly slows down extending writes.

But I suppose, for an app writer, as you know it's going to involve a
slower than normal write anyway, it's also easy enough to extend by a
big chunk then call fsync() once, if you prefer to not have to trust
fdatasync() on this.

-- Jamie




reply via email to

[Prev in Thread] Current Thread [Next in Thread]