Re: [Qemu-devel] Ensuring data is written to disk

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Ensuring data is written to disk

From:	Jens Axboe
Subject:	Re: [Qemu-devel] Ensuring data is written to disk
Date:	Wed, 2 Aug 2006 08:51:09 +0200

On Tue, Aug 01 2006, Jamie Lokier wrote:
> Jens Axboe wrote:
> > > > > If you just want to evict all data from the drive's cache, and don't
> > > > > actually have other data to write, there is a CACHEFLUSH command you
> > > > > can send to the drive which will be more dependable than writing as
> > > > > much data as the cache size.
> > > > 
> > > > Exactly, and this is what the OS fsync() should do once the drive has
> > > > acknowledged that the data has been written (to cache). At least
> > > > reiserfs w/barriers on Linux does this.
> > > 
> > > 1. Are you sure this happens, w/ reiserfs on Linux, even if the disk
> > >    is an SATA or SCSI type that supports ordered tagged commands?  My
> > >    understanding is that barriers force an ordering between write
> > >    commands, and that CACHEFLUSH is used only with disks that don't have
> > >    more sophisticated write ordering commands.  Is the data still
> > >    committed to the disk platter before fsync() returns on those?
> > 
> > No SATA drive supports ordered tags, that is a SCSI only property. The
> > barrier writes is a separate thing, probably reiser ties the two
> > together because it needs to know if the flush cache command works as
> > expected. Drives are funny sometimes...
> > 
> > For SATA you always need at least one cache flush (you need one if you
> > have the FUA/Forced Unit Access write available, you need two if not).
> 
> Well my question wasn't intended to be specific to ATA (sorry if that
> wasn't clear), but a general question about writing to disks on Linux.
> 
> And I don't understand your answer.  Are you saying that reiserfs on
> Linux (presumably 2.6) commits data (and file metadata) to disk
> platters before returning from fsync(), for all types of disk
> including PATA, SATA and SCSI?  Or if not, is that a known property of
> PATA only, or PATA and SATA only?  (And in all cases, presumably only
> "ordinary" controllers can be depended on, not RAID controllers or
> USB/Firewire bridges which ignore cache flushes for no good reason).

blkdev_issue_flush() is brutal, but it works on SATA/PATA/SCSI. So yes,
it should eb reliable.

> > > 2. Do you know if ext3 (in ordered mode) w/barriers on Linux does it too,
> > >    for in-place writes which don't modify the inode and therefore don't
> > >    have a journal entry?
> > 
> > I don't think that it does, however it may have changed. A quick grep
> > would seem to indicate that it has not changed.
> 
> Ew.  What do databases do to be reliable then?  Or aren't they, on Linux?

They probably run on better storage than commodity SATA drives with
write back caching enabled. To my knowledge, Linux is one of the only OS
that even attempts to fix this.

> > > On Darwin, fsync() does not issue CACHEFLUSH to the drive.  Instead,
> > > it has an fcntl F_FULLSYNC which does that, which is documented in
> > > Darwin's fsync() page as working with all Darwin's filesystems,
> > > provided the hardware honours CACHEFLUSH or the equivalent.
> > 
> > That seems somewhat strange to me, I'd much rather be able to say that
> > fsync() itself is safe. An added fcntl hack doesn't really help the
> > applications that already rely on the correct behaviour.
> 
> According to the Darwin fsync(2) man page, it claims Darwin is the
> only OS which has a facility to commit the data to disk platters.
> (And it claims to do this with IDE, SCSI and FibreChannel.  With
> journalling filesystems, it requests the journal to do the commit but
> the cache flush still ultimately reaches the disk.  Sounds like a good
> implementation to me).

The implementation may be nice, but it's the idea that is appalling to
me. But it sounds like the Darwin man page is out of date, or at least
untrue.

> SQLite (a nice open source database) will use F_FULLSYNC on Darwin to
> do this, and it appears to add a large performance penalty relative to
> using fsync() alone.  People noticed and wondered why.

Disk cache flushes are nasty, they stall everything. But it's still
typically faster than disabling write back caching, so...

> Other OSes show similar performance as Darwin with fsync() only.
> 
> So it looks like the man page is probably accurate: other OSes,
> particularly including Linux, don't commit the data reliably to disk
> platters when using fsync().

How did you reach that conclusion? reiser certainly does it if you have
barriers enabled (which you need anyways to be safe with write back
caching), and with a little investigation we can perhaps conclude that
XFS is safe as well.

> In which case, I'd imagine that's why Darwin has a separate option,
> because if Darwin's fsync() was many times slower than all the other
> OSes, most people would take that as a sign of a badly performing OS,
> rather than understanding the benefits.

That sounds like marketing driven engineering, nice. It requires app
changes, which is pretty silly. I would much rather have a way of just
enabling/disabling full flush on a per-device basis, you could use the
cache type as the default indicator of whether to issue the cache flush
or not. Then let the admin override it, if he wants to run unsafe but
faster.

> > > from what little documentation I've found, on Linux it appears to be
> > > much less predictable.  It seems that some filesystems, with some
> > > kernel versions, and some mount options, on some types of disk, with
> > > some drive settings, will commit data to a platter before fsync()
> > > returns, and others won't.  And an application calling fsync() has no
> > > easy way to find out.  Have I got this wrong?
> > 
> > Nope, I'm afraid that is pretty much true... reiser and (it looks like,
> > just grepped) XFS has best support for this. Unfortunately I don't think
> > the user can actually tell if the OS does the right thing, outside of
> > running a blktrace and verifying that it actually sends a flush cache
> > down the queue.
> 
> Ew.  So what do databases on Linux do?  Or are database commits
> unreliable because of this?

See above.

> > > ps. (An aside question): do you happen to know of a good patch which
> > > implements IDE barriers w/ ext3 on 2.4 kernels?  I found a patch by
> > > googling, but it seemed that the ext3 parts might not be finished, so
> > > I don't trust it.  I've found turning off the IDE write cache makes
> > > writes safe, but with a huge performance cost.
> > 
> > The hard part (the IDE code) can be grabbed from the SLES8 latest
> > kernels, I developed and tested the code there. That also has the ext3
> > bits, IIRC.
> 
> Thanks muchly!  I will definitely take a look at that.  I'm working on
> a uClinux project which must use a 2.4 kernel, and performance with
> write cache off has been a real problem.  And I've seen fs corruption
> after power cycles with write cache on many times, as expected.

No problem.

> It's a shame the ext3 bits don't do fsync() to the platter though. :-/

It really is, apparently none of the ext3 guys care about write back
caching problems. The only guy wanting to help with the ext3 bits was
Andrew. In the reiserfs guys favor, they have actively been pursuing
solutions to this problem. And XFS recently caught up and should be just
as good on the barrier side, I have yet to verify the fsync() part.

> To reliably commit data to an ext3 file, should we do ioctl(block_dev,
> HDIO_SET_WCACHE, 1) on 2.6 kernels on IDE?  (The side effects look to

Did you mean (..., 0)? And yes, it looks like it right now that fsync()
isn't any better than other OS on ext3, so disabling write back caching
is the safest.

> me like they may create a barrier then flush the cache, even when it's
> already enabled, but only on 2.6 kernels).  Or is there a better way?
> (I don't see any way to do it on vanilla 2.4 kernels).

2.4 vanilla doesn't have barrier support, unfortunately.

> Should we change to only reiserfs and expect fsync() to commit data
> reliably only with that fs?  I realise this is a lot of difficult
> questions, that apply to more than just Qemu...

Yes, reiser is the only one that works reliably across power loss with
write back caching for the journal commits as well as fsync guarantees.

> Still, the answers are relevant to Qemu and reliably emulating a disk
> on Linux.  And relevant to most database users, I should think.

Indeed, it would be nice if someone (whistles) would write up a note
about the current state of things...

-- 
Jens Axboe

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] Ensuring data is written to disk, Jamie Lokier, 2006/08/01
- Re: [Qemu-devel] Ensuring data is written to disk, Jens Axboe, 2006/08/01
  - Re: [Qemu-devel] Ensuring data is written to disk, Jamie Lokier, 2006/08/01
    - Re: [Qemu-devel] Ensuring data is written to disk, Jens Axboe, 2006/08/01
    - Re: [Qemu-devel] Ensuring data is written to disk, Jamie Lokier, 2006/08/01
    - Re: [Qemu-devel] Ensuring data is written to disk, Jens Axboe <=
    - Re: [Qemu-devel] Ensuring data is written to disk, Jamie Lokier, 2006/08/02
    - Re: [Qemu-devel] Ensuring data is written to disk, Bill C. Riemers, 2006/08/02
    - Re: [Qemu-devel] Ensuring data is written to disk, R. Armiento, 2006/08/07
    - Re: [Qemu-devel] Ensuring data is written to disk, Bill C. Riemers, 2006/08/07
    - Re: [Qemu-devel] Ensuring data is written to disk, Thomas Steffen, 2006/08/07
    - Re: [Qemu-devel] Ensuring data is written to disk, R. Armiento, 2006/08/07

Prev by Date: [Qemu-devel] qemu old releases ?
Next by Date: Re: [Qemu-devel] qemu old releases ?
Previous by thread: Re: [Qemu-devel] Ensuring data is written to disk
Next by thread: Re: [Qemu-devel] Ensuring data is written to disk
Index(es):
- Date
- Thread