qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Ensuring data is written to disk


From: Jamie Lokier
Subject: Re: [Qemu-devel] Ensuring data is written to disk
Date: Tue, 1 Aug 2006 22:50:46 +0100
User-agent: Mutt/1.4.1i

Jens Axboe wrote:
> > > > If you just want to evict all data from the drive's cache, and don't
> > > > actually have other data to write, there is a CACHEFLUSH command you
> > > > can send to the drive which will be more dependable than writing as
> > > > much data as the cache size.
> > > 
> > > Exactly, and this is what the OS fsync() should do once the drive has
> > > acknowledged that the data has been written (to cache). At least
> > > reiserfs w/barriers on Linux does this.
> > 
> > 1. Are you sure this happens, w/ reiserfs on Linux, even if the disk
> >    is an SATA or SCSI type that supports ordered tagged commands?  My
> >    understanding is that barriers force an ordering between write
> >    commands, and that CACHEFLUSH is used only with disks that don't have
> >    more sophisticated write ordering commands.  Is the data still
> >    committed to the disk platter before fsync() returns on those?
> 
> No SATA drive supports ordered tags, that is a SCSI only property. The
> barrier writes is a separate thing, probably reiser ties the two
> together because it needs to know if the flush cache command works as
> expected. Drives are funny sometimes...
> 
> For SATA you always need at least one cache flush (you need one if you
> have the FUA/Forced Unit Access write available, you need two if not).

Well my question wasn't intended to be specific to ATA (sorry if that
wasn't clear), but a general question about writing to disks on Linux.

And I don't understand your answer.  Are you saying that reiserfs on
Linux (presumably 2.6) commits data (and file metadata) to disk
platters before returning from fsync(), for all types of disk
including PATA, SATA and SCSI?  Or if not, is that a known property of
PATA only, or PATA and SATA only?  (And in all cases, presumably only
"ordinary" controllers can be depended on, not RAID controllers or
USB/Firewire bridges which ignore cache flushes for no good reason).

> > 2. Do you know if ext3 (in ordered mode) w/barriers on Linux does it too,
> >    for in-place writes which don't modify the inode and therefore don't
> >    have a journal entry?
> 
> I don't think that it does, however it may have changed. A quick grep
> would seem to indicate that it has not changed.

Ew.  What do databases do to be reliable then?  Or aren't they, on Linux?

> > On Darwin, fsync() does not issue CACHEFLUSH to the drive.  Instead,
> > it has an fcntl F_FULLSYNC which does that, which is documented in
> > Darwin's fsync() page as working with all Darwin's filesystems,
> > provided the hardware honours CACHEFLUSH or the equivalent.
> 
> That seems somewhat strange to me, I'd much rather be able to say that
> fsync() itself is safe. An added fcntl hack doesn't really help the
> applications that already rely on the correct behaviour.

According to the Darwin fsync(2) man page, it claims Darwin is the
only OS which has a facility to commit the data to disk platters.
(And it claims to do this with IDE, SCSI and FibreChannel.  With
journalling filesystems, it requests the journal to do the commit but
the cache flush still ultimately reaches the disk.  Sounds like a good
implementation to me).

SQLite (a nice open source database) will use F_FULLSYNC on Darwin to
do this, and it appears to add a large performance penalty relative to
using fsync() alone.  People noticed and wondered why.

Other OSes show similar performance as Darwin with fsync() only.

So it looks like the man page is probably accurate: other OSes,
particularly including Linux, don't commit the data reliably to disk
platters when using fsync().

In which case, I'd imagine that's why Darwin has a separate option,
because if Darwin's fsync() was many times slower than all the other
OSes, most people would take that as a sign of a badly performing OS,
rather than understanding the benefits.

> > from what little documentation I've found, on Linux it appears to be
> > much less predictable.  It seems that some filesystems, with some
> > kernel versions, and some mount options, on some types of disk, with
> > some drive settings, will commit data to a platter before fsync()
> > returns, and others won't.  And an application calling fsync() has no
> > easy way to find out.  Have I got this wrong?
> 
> Nope, I'm afraid that is pretty much true... reiser and (it looks like,
> just grepped) XFS has best support for this. Unfortunately I don't think
> the user can actually tell if the OS does the right thing, outside of
> running a blktrace and verifying that it actually sends a flush cache
> down the queue.

Ew.  So what do databases on Linux do?  Or are database commits
unreliable because of this?

> > ps. (An aside question): do you happen to know of a good patch which
> > implements IDE barriers w/ ext3 on 2.4 kernels?  I found a patch by
> > googling, but it seemed that the ext3 parts might not be finished, so
> > I don't trust it.  I've found turning off the IDE write cache makes
> > writes safe, but with a huge performance cost.
> 
> The hard part (the IDE code) can be grabbed from the SLES8 latest
> kernels, I developed and tested the code there. That also has the ext3
> bits, IIRC.

Thanks muchly!  I will definitely take a look at that.  I'm working on
a uClinux project which must use a 2.4 kernel, and performance with
write cache off has been a real problem.  And I've seen fs corruption
after power cycles with write cache on many times, as expected.

It's a shame the ext3 bits don't do fsync() to the platter though. :-/

To reliably commit data to an ext3 file, should we do ioctl(block_dev,
HDIO_SET_WCACHE, 1) on 2.6 kernels on IDE?  (The side effects look to
me like they may create a barrier then flush the cache, even when it's
already enabled, but only on 2.6 kernels).  Or is there a better way?
(I don't see any way to do it on vanilla 2.4 kernels).

Should we change to only reiserfs and expect fsync() to commit data
reliably only with that fs?  I realise this is a lot of difficult
questions, that apply to more than just Qemu...

Still, the answers are relevant to Qemu and reliably emulating a disk
on Linux.  And relevant to most database users, I should think.

Thanks again,
-- Jamie




reply via email to

[Prev in Thread] Current Thread [Next in Thread]