Re: [Qemu-block] [PATCH for-2.6 2/2] block/gluster: prevent data loss af

On Mon, Apr 11, 2016 at 9:56 AM, Raghavendra Gowdappa <address@hidden> wrote:

> +Raghavendra G who implemented this option in write-behind, to this
> upstream patch review discussion

Thanks Pranith. Kritika did inform us about the discussion. We were working on ways to find out solutions to the problems raised (and it was a long festive weekend in Bangalore).

Sorry for top-posting. I am trying to address two issues raised in this mail thread:
1. No ways to identify whether setting of option succeeded in gfapi.
2. Why retaining cache after fsync failure is not default behavior?

1. No ways to identify whether setting of option succeeded in gfapi:

I've added Poornima and Raghavendra Talur who work on gfapi to assist on this.

There is currently no such feature in gfapi. We could think of two possible solutions:

a. Have the qemu-glusterfs driver require a version of glusterfs-api.rpm which surely has this write behind option. In that case, if you use glfs_set_xlator_option before glfs_init with right key and value, there is no way the volume set gets rejected. Note that this is a installation time dependency, we don't have libgfapi library versioning. This solution is good enough, if the logic in qemu is

ret = glfs_set_xlator_option;

if (ret) {

exit because we don't have required environment.

}

proceed with work;

b. Second solution is to implement a case in write_behind getxattr FOP to handle such request and reply back with the current setting. Look at dht_get_real_filename for similar feature. You will have to implement logic similar to something like below

ret = glfs_getxattr ( fs, "/", glusterfs.write-behind-resync-failed-syncs-after-fsync-status, &value, size);

if ( (ret = -1 && errno == ENODATA) || value == 0 ) {

// write behind in the stack does not understand this option

// OR

// we have write-behind with required option set to off

} else {

// We do have the required option

}

One negative aspect of both the solutions above is the loosely coupled nature of logic. If the behavior ever changes in lower layer(which is gfapi or write-behind in this case) the upper layer(qemu) will have to

a. have a dependency of the sort which defines both the minimum version and maximum version of package required

b. interpret the return values with more if-else conditions to maintain backward compatibility.

We are thinking of other ways too, but given above are current solutions that come to mind.

Thanks,

Raghavendra Talur

2. Why retaining cache after fsync failure is not default behavior?

It was mostly to not to break two synchronized applications, which rely on fsync failures to retry. Details of discussion can be found below. The other reason was we could not figure out what POSIX's take on the state of earlier write after fsync failure (Other filesystems xfs, ext4 had non-uniform behavior). The question more specifically was "is it correct for a write issued before a failed fsync to succeed on the backend storage (persisting happened after fsync failure)?". I've also added Vijay Bellur who was involved in the earlier discussion to cc list.

Following is the discussion we had earlier with Kevin, Jeff Cody and others on the same topic. I am quoting it verbatim below.

<quote>

> > > > > I would actually argue that this setting would be right for everyone,
> > > > > not just qemu. Can you think of a case where keeping the data cached
> > > > > after a failed fsync would actively hurt any application? I can only
> > > > > think of cases where data is unnecessarily lost if data is dropped.
> > > > >
> > > >
> > > > I worry about use cases with concurrent writers to the same file and
> > > > how different applications would handle fsync() failures with our new
> > > > behavior.
> > >
> > > Any specific scenario you're worried about?
> > >
> > > > Keeping the known old behavior as the default will ensure that we do
> > > > not break anything once this is out. qemu/virt users with gluster are
> > > > accustomed to setting the virt group and hence no additional knobs
> > > > would need to be altered by them.
> > >
> > > Not changing anything is a safe way to avoid regressions. But it's also
> > > a safe way to leave bugs unfixed. I would say that the current behavíour
> > > is at least borderline buggy and very hard for applications to handle
> > > correctly.
> >
> > I tend to agree with Kevin on this. If we've an error handling strategy
> > that
> > is posix-complaint, I don't think there is need to add one more option.
> > Most
> > of the times options tend to be used in default values, which is equivalent
> > to not providing an option at all. However, before doing that its better we
> > think through whether it can affect any existing deployments adversely
> > (even
> > when they are not posix-complaint).
> >
>
> One pattern that I can think of -
>
> Applications that operate on the same file from different clients through
> some locking or other co-ordination would have this pattern:
>
> lock (file), write (file), fsync (file), unlock (file);
>
> Now if the first fsync() fails, based on application logic the offset used
> for the failed write + fsync could be re-utilized by a co-ordinating
> application on another node to write out legitimate data. When control
> returns back to the application that received a failure, the subsequent
> write + fsync can cause data to be overwritten at the old offset along with
> new data being written at the new offset.

Yeah. I agree. Co-ordinated applications on different mounts will have issues, if they are working on the assumption that after fsync no older writes will hit the backend. Given that there seems to be a fair bit of confusion on status of retry of older writes after an fsync failure, we can expect some regressions. So, to summarize,

1. Keep the behaviour in patchset 11 of [1] as default. Both fsync and flush act as barriers and will make sure either
a. older writes are synced to backend
b. old writes are failed and never retried.

after a failed fsync/flush.

2. Modify the behaviour of patchset 11 of [1] to keep failed syncs after a failed fsync and retry them till a flush. After a flush, no retries of failed syncs will be attempted. This behaviour will be introduced as an option.

3. Transient and non-transient errors will be treated similarly and failed syncs will be retried alike.

Does everyone agree on the above points? If yes, I'll modify [1] accordingly.

[1] http://review.gluster.org/#/c/12594/

</quote>

From:	Raghavendra Talur
Subject:	Re: [Qemu-block] [PATCH for-2.6 2/2] block/gluster: prevent data loss after i/o error
Date:	Mon, 9 May 2016 12:08:31 +0530