qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH for-2.6 v2 0/3] Bug fixes for gluster


From: Kevin Wolf
Subject: Re: [Qemu-devel] [PATCH for-2.6 v2 0/3] Bug fixes for gluster
Date: Wed, 20 Apr 2016 13:46:09 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

Am 20.04.2016 um 12:40 hat Ric Wheeler geschrieben:
> On 04/20/2016 05:24 AM, Kevin Wolf wrote:
> >Am 20.04.2016 um 03:56 hat Ric Wheeler geschrieben:
> >>On 04/19/2016 10:09 AM, Jeff Cody wrote:
> >>>On Tue, Apr 19, 2016 at 08:18:39AM -0400, Ric Wheeler wrote:
> >>I still worry that in many non-gluster situations we will have
> >>permanent data loss here. Specifically, the way the page cache
> >>works, if we fail to write back cached data *at any time*, a future
> >>fsync() will get a failure.
> >And this is actually what saves the semantic correctness. If you threw
> >away data, any following fsync() must fail. This is of course
> >inconvenient because you won't be able to resume a VM that is configured
> >to stop on errors, and it means some data loss, but it's safe because we
> >never tell the guest that the data is on disk when it really isn't.
> >
> >gluster's behaviour (without resync-failed-syncs-after-fsync set) is
> >different, if I understand correctly. It will throw away the data and
> >then happily report success on the next fsync() call. And this is what
> >causes not only data loss, but corruption.
> 
> Yes, that makes sense to me - the kernel will remember that it could
> not write data back from the page cache and the future fsync() will
> see an error.
> 
> >
> >[ Hm, or having read what's below... Did I misunderstand and Linux
> >   returns failure only for a single fsync() and on the next one it
> >   returns success again? That would be bad. ]
> 
> I would need to think through that scenario with the memory
> management people to see if that could happen.

Okay, please do. This is the fundamental assumption we make: If an
fsync() succeeds, *all* successfully completed writes are on disk, no
matter whether another fsync() failed in between. If they can't be
written to the disk (e.g. because the data was thrown away), no
consequent fsync() can succeed any more.

> >>That failure could be because of a thinly provisioned backing store,
> >>but in the interim, the page cache is free to drop the pages that
> >>had failed. In effect, we end up with data loss in part or in whole
> >>without a way to detect which bits got dropped.
> >>
> >>Note that this is not a gluster issue, this is for any file system
> >>on top of thinly provisioned storage (i.e., we would see this with
> >>xfs on thin storage or ext4 on thin storage).  In effect, if gluster
> >>has written the data back to xfs and that is on top of a thinly
> >>provisioned target, the kernel might drop that data before you can
> >>try an fsync again. Even if you retry the fsync(), the pages are
> >>marked clean so they will not be pushed back to storage on that
> >>second fsync().
> >I'm wondering... Marking the page clean means that it can be evicted
> >from the cache, right? Which happens whenever something more useful can
> >be done with the memory, i.e. possibly at any time. Does this mean that
> >two consecutive reads of the same block can return different data even
> >though no process has written to the file in between?
> 
> This we should tease out with a careful review of the behavior, but
> I think that might be able to happen.
> 
> Specifically,
> 
> Time 0: File has pattern A at offset 0. Any reads at this point see pattern A
> 
> Time 1: Write pattern B to offset 0. Reads now see pattern B.
> 
> Time 2: Run out of space on the backing store (before the data has
> been written back)
> 
> Time 3: Do an fsync() *OR* have the page cache fail to write back that page
> 
> Time 4: Under memory pressure, the page which was marked clean, is dropped
> 
> Time 5: Read offset 0 again - do we now see pattern A again? Or an IO error?

Seeing pattern A again would certainly be surprising for programs.
Probably worth checking what really happens.

> >Also, O_DIRECT bypasses the problem, right? In that already the write
> >request would fail there, not only the fsync(). We recommend that for
> >production environments anyway.
> 
> O_DIRECT bypasses the page cache, but that data is allowed to be
> held in a volatile write cache (say in a disk's write cache) until
> the target device sees an fsync().
> 
> The safest (and horribly slow way) to be 100% safe is to write
> O_DIRECT|O_SYNC which bypasses the page cache and sends effectively
> a cache flush after each IO.
> 
> Most applications use fsync() after O_DIRECT at more strategic times
> though I assume (or don't know about this behavior).

qemu can be configured to flush after each write request, but for
obvious reasons that's not something you want to use if you don't have
to.

Anyway, disks are yet another layer, and I would guess that flush
failures become less and less likely to be temporary and recoverable the
further you go down the stack. Failing for good when the disk is broken
is fine, as far as I am concerned. Doing the same because the network
had a hiccup for a few seconds is not.

> >>Same issue with link loss - if we lose connection to a storage
> >>target, it is likely to take time to detect that, more time to
> >>reconnect. In the interim, any page cache data is very likely to get
> >>dropped under memory pressure.
> >>
> >>In both of these cases, fsync() failure is effectively a signal of a
> >>high chance of data that has been already lost. A retry will not
> >>save the day.
> >>
> >>At LSF/MM today, we discussed an option that would allow the page
> >>cache to hang on to data - for re-tryable errors only for example -
> >>so that this would not happen. The impact of this is also
> >>potentially huge (page cache/physical memory could be exhausted
> >>while waiting for an admin to fix the issue) so it would have to be
> >>a non-default option.
> >Is memory pressure the most common case, though?
> 
> I think it really depends on the type of storage device we have under us.
> 
> >
> >The odd effect that I see is that calling fsync() could actually make
> >data less safe than it was if the call fails. With the kernel marking
> >the pages clean on failure, instead of evicting "really clean" pages, we
> >can now evict "dirty, but failed writeout" pages even without any real
> >memory pressure, just because they can't be distinguished any more. Or
> >maybe they aren't even evicted, but the admin fixes the problem and we
> >could now write them to the disk if only they were still marked dirty
> >and wouldn't be ignored in the writeout.
> 
> fsync() is just the messenger that something bad happened - it is
> always better to know that we lost data since the last fsync() call
> rather than not know, correct?
> 
> Keep in mind that data will have this issue any time memory pressure
> (or other algorithms) cause data to be written back from the page
> cache, even if the application has not used an fsync().
> 
> Even if the admin "fixes" the issues (adds more storage, kicks a
> fibre channel switch, re-inserts a disk), IO might have been dropped
> forever from the page cache.

Yes. We can't recover in 100% of the cases. In some cases, like when
write failure and memory pressure come together, we may have lost. We
should probably just accept that and concentrate on improving the
average case.

My point is just that if there is no memory pressure (>90%?), we
shouldn't make the situation worse than it was. In this case, fsync()
wasn't only the messenger that a write failed, but it is what caused the
write to happen at this specific time in the first place. If we hadn't
called it, and the issue were fixed before memory pressure caused the
page to be written back, we might not have suffered data loss.

In other words, calling fsync() was harmful in this situation. And that
certainly shouldn't be the case.

> >I'm sure there are solutions that are more intelligent than the extremes
> >of "mark clean on error" and "keep failed pages indefinitely" and that
> >cover a large part of use cases where qemu wants to resume a VM after a
> >failure (for local files perhaps most commonly resuming after ENOSPC).
> >
> >Even just evicting pages immediately on a failure would probably be an
> >improvement because reads would then be consistent. And keeping the data
> >around until we *really* need memory might solve the problem for all
> >practical purposes. If we do eventually need the memory and throw away
> >data, fsync() consistently returning an error after throwing away data
> >is still safe, but we have a much better behaviour in the average case.
> >
> >>I think that we will need some discussions with the kernel memory
> >>management team (and some storage kernel people) to see what seems
> >>reasonable here.
> >It's a good discussion to have, but for the network protocols (like with
> >gluster) we tend to use the native libraries and don't even go through
> >the kernel page cache. So I think we shouldn't stop discussing the
> >semantics of these protocols and APIs while talking about the kernel
> >page cache.
> >
> >Network protocols are also where error like "network is down" become
> >more relevant, so if anything, we want to have better error recovery
> >than on local files there.
> 
> I agree that with gluster we can try various schemes pretty easily
> when the error appears because of something internal to gluster
> (like a network error to a remote gluster server) but we cannot
> shield applications from data loss when we are just the messenger
> for an error on the storage servers local storage stack.

Yes. In some cases, we will always have to tell the user "sorry,
something went wrong, your data is gone". But I think in most cases we
can do better than that.

> This is an important discussion to work through though - not just
> for qemu, I think it has a lot of value for everyone.

I would think so, yes.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]