gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Selfheal is not working? Once more


From: Kevan Benson
Subject: Re: [Gluster-devel] Selfheal is not working? Once more
Date: Wed, 30 Jul 2008 14:42:07 -0700
User-agent: Thunderbird 2.0.0.14 (X11/20080421)


Previous quotes posts removed for brevity...

Martin Fick wrote:
It does seem like it would be fairly easy to add another metadata attribute to each file/directory that would hold a checksum for it. This way, AFR itself could be configured to check/compute the checksum anytime the file is read/written. Since this would slow AFR down, I would
suggest a configuration option to turn this on.  If the
checksum is wrong, it could heal to the version of the
other brick if the other brick's checksum is correct.

Another alternative would be to create an offline checksummer that updates such an attribute if it does not
exist, and checks the checksum if it does exist.  If when
it checks the checksum it fails, it would simply delete the
file and its attributes (and potentially the directory
attributes up the tree) so that AFR will then heal it.

The only modification needed by AFR to support this
would be to delete the checksum attribute anytime the file/directory is updated so that the offline checksummer will recreate it instead of thinking it is corrupt. In fact, even this could be eliminated so that the offline checksummer is completely "self-powered", anytime it calculates a checksum it could copy the glusterfs version and timestamp attributes to two new "checksummer" attributes. If these become out of date the cheksummer will know to recompute the checksum instead of
assuming that the file has been corrupted.

The one risk with this is that if a file gets corrupted
on both nodes, it will get deleted on both nodes so you will not have a corrupted file to at least look at. This too could be overcome by saving any deleted files in a separate "trash can" and cleaning the trash can once the files in it have been healed, sort of a self cleaning lost+found directory.


I know this may not be the answers that you were looking for, but I hope it helps clarify things a little.

A while back I seem to remember someone talking about eventually creating a fsck.glusterfs utility. Since underlying server node corruption would (hopefully) not be a common problem, it seems like a specific tool that could be run when prudent would be a good approach. If the underlying data is suspected of corruption on a node, run the normal fsck on that node, then the fsck.glusterfs on the share utility which can utilize a much more comprehensive set of checks and repairs than would be feasible in normal AFR file processing.

--

-Kevan Benson
-A-1 Networks




reply via email to

[Prev in Thread] Current Thread [Next in Thread]