Re: [rdiff-backup-users] Regression errors

rdiff-backup-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [rdiff-backup-users] Regression errors

From:	Maarten Bezemer
Subject:	Re: [rdiff-backup-users] Regression errors
Date:	Wed, 15 Apr 2009 11:29:59 +0200 (CEST)

Hi,

Maybe a little late, but here goes.


On Tue, 31 Mar 2009, Bob Mead wrote:

The BUG locked Processor error was a long time ago and according to anarticle Andrew directed me to, it was due to a problem with ubuntu 8.04. Atthe time, I did run memtest on the server that gave up the error for somehours and it never did fail or find any errors. I have not seen thatparticular error since then and I am not using ubuntu 8.04 anymore as aresult.

Depending on the amount of memory in the machine, 'some hours' may or maynot have been enough to find certain errors. I've seen machines throwingup only 1 error in a 12-hour run of memmxtest (and again only 1 error intwo repeated 24-hour runs), so that error was consistent but not triggeredeasily.So, if you have the opportunity to do more extensive tests (e.g. over theweekend), please do, just to be sure.

Third, I re-read some of your emails about your situation and what you'vebeen trying to do. Having missing metadata files also might indicatehardware problems. Or maybe it's something related to kernel versions anddata corruption on your file systems. Either way, it's pretty bad.
I do not have missing metadata files that I know of. I mis-typed"current-metadata" files for "current-/mirror"/ files in my most recent post.At Andrew's suggestion, I had adjusted the Current Mirror file to indicate aprior time to 'fool' rdiff into believing that it had not already run. When Idid this (by renaming the file with an earlier date), rdiff did run, butcomplained about not finding the metadata files and said that it would usethe filesystem instead. The backup has not run properly since then.

I don't know exactly what happens when you fool rdiff-backup like that. Ifit uses the 'current-mirror' marker as "the timestamp indicated in thecurrent-mirror marker is taken as 'now', and all files found in the treeshould match this 'now'", then you could very well break things if asubsequent (possible unfinished) rdiff-backup run changed the files. Inthat case, mirror_metadata wouldn't match the real file contents. Also,applying reverse-diffs to another version of the file than they were builtfor, could screw up things badly.

When I look at the source, it is not clear to me what is the case. Maybesomeone with more extensive experience with the sources can comment onthis?

Problem #1:
Origin/source server: Linux 2.6.7-gentoo-r5 #2 SMP Wed Nov 30 12:40:39 PST2005 i686 Intel(R) Pentium(R) 4 CPU 3.06GHz GenuineIntel GNU/Linux.

This is a bit ancient. However, I didn't find any reports on known bugs inthis version causing memory or filesystem corruption.

Destination/backup server: Linux 2.6.15-51-amd64-server #1 SMP Tue Feb 1217:08:38 UTC 2008 x86_64 GNU/Linux
Problem #2:
Origin/source server: Linux 2.6.27-11-server #1 SMP Thu Jan 29 20:19:41 UTC2009 i686 GNU/Linux.
Destination/backup server: Linux 2.6.27-11-server #1 SMP Thu Jan 29 20:19:41UTC 2009 i686 GNU/Linux

These are fairly recent kernels. As far as my information goes, there wasa known bug in 2.6.27 prior to 2.6.27.10 related to file locking. I'm notsure if this was fixed in your 2.6.27-11 build (2.6.27-11 not being thesame as 2.6.27.11). If you're using a current ubuntu release and have thelatest kernel available for that release, you should be OK.

 You wrote earlier that upgrading or doing just anything with the
  server running rdiff-backup 1.0.4/1.0.5 is out of the question
  because of lack of resources. An alternative might be to first use
  rsync to synchronise your data to another server, and then use
  rdiff-backup from there. That gives you the opportunity to "play
  around" with different rdiff-backup versions without risking a
  "total breakdown" of the primary server.
Again, lack of resources prevents me from doing this on a network wide basis.I don't have any spare servers to rsync to and the time it would take to dothat and then try to rdiff that result somewhere else is beyond the carryingcapacity of our network and/or available times/bandwidths. I am actuallyworking on a buildout of additional servers for placement at each remote sitewhich will act as local backups and I will be doing exactly that (rsync tothat new local machine and then rdiff from there to the backup server)however that project may take some months to complete.

Well, it seems that (at this time at least) you have a 'somewhat' brokenbackup system. Some would say a broken backup system is worse than nosuch system at all (since having one makes people believe the data is safeand all). So, if that's fine with your boss then you're out of luck.Otherwise this might be a perfect reason to have some additional resourcesassigned to your work. It's just a matter of how valuable the data is, andwhat the consequences are when it is lost. Would you be fired, or wouldthe blame be on your boss.. ;-)


I'll get back to this below..

Based on my traceback results, have the regressions actually failed? All Isee is messages about regressing destination now. There doesn't ever seem tobe any message about what happens after that.

There's always the --check-destination-dir switch you can run locally onthe backup server, to see if the backup tree needs to be cleaned up. Theregression is done automatically at start of a normal backup run whenrdiff-backup finds an unclean backup tree, but running rdiff-backup--check-destination-dir does only just the cleanup. You might want to trythat.If you're running a recent version of rdiff-backup, you could also try the--verify switch to see it the files in the backup repo match the checksumrecorded in the metadata file.

On the other hand, you once mentioned that one of the servers had a clockthat was way off. Only recently I saw something on this mailing list aboutusing clocks of different sides for doing calculations that should havebeen using clocks at only one side. Maybe you ran into a similar issue thatscrewed up your repo?
 If you insist trying to fix this "the software way", I have a
 suggestion for you. The second point problem in your email talks
 about a 23-hour run of rdiff-backup. Given the size of the backup,
 I'd say that this was an initial run and there aren't some hundreds
 of increments in play here?
From my original post (below): "This backup data (241GB) set took severaltries to get to run properly, however it did complete successfully on 3/23(after running for 23 hours to complete)". Perhaps this is not as clear as Ithought. Yes, this is the initial run and no there are not any increments.Your wording here leads me to believe that you think this is an erroneousquestion, perhaps one that ought not to be answered, at least here, or byyou. I am not 'insisting' on anything. I asked the list for help on twoparticular problems I am having - nothing more. If it turns out that it isnot the case that either problem I am having has anything at all to do withsoftware, I am more than happy to look elsewhere to solve the problems. Iwish I had the experience to see the 'CRC check failed' and immediately go to'hardware issue'. Unfortunately, I don't. So I ask questions. I apologize ifmy asking has upset you.


I'm not upset, although my wording could have been a bit unfriendly.

There are a number of things you can try here. Given the fact that it is alarge amount of data, we can use it to at least detect some hardwareproblems.

For example, try this:
# cd /path/to/dataset/location
# find . -type f -print0 | sort -z | xargs -0r md5sum > /tmp/md5sum_run1
# find . -type f -print0 | sort -z | xargs -0r md5sum > /tmp/md5sum_run2
# md5sum /tmp/md5sum_run*

And check that both /tmp/md5sum_run* files have the same checksum. Theyshould have, if there's no rdiff-backup process running.

If the checksums don't match, try:
# diff -u /tmp/md5sum_run1 /tmp/md5sum_run2 | less
And look for the differences. Maybe just one line, maybe a lot of lines.
Do these tests both on the source and on the backup machines.

I've seen cases where some combinations of chipsets, processors and memorychips go weird. For example a mainboard based on a Via KT400a chipset, aFSB266 processor and DDR400 ram modules. Memtest didn't find any problems,file checksums were usually right, but 1 out of 20 times orso they didn'tmatch. Your 200+ GB dataset is likely to show these problems in two runs,but you are of course free to do more tests, creating /tmp/md5sum_run3, etc.I found that clocking the ram at 133MHz instead of 200 (i.e., matching ramspeed to fsb speed) made the system stable.

Depending on how fast and how often the contents of your dataset change,you could also compare the source and backup /tmp/md5sum_run1 files. Whenthe data changes often, this might be a bit pointless, but see below.

If so, could you try rsync with the --checksum argument to synchronise thebackup to the source and see if there are more files being updated thatshould not have been changed, based on file modification time stamps. Ifyou see such files then you're probably just out of luck and need somehardware replaced. Either in your computers or the networking equipment.
Since this is the initial run, there are only files that have changed (all ofthem) in the repo. I guess I'm not clear on what you're wanting to see here.If I rsync the repo as is, to the source I'm going to see what? Since thereis only one backup, and it is the initial run, how will rsyncing that runback to the source files tell me about changed files?

I wasn't entirely clear on this. Normally, rsync bases its decision tosync file contents only on file modification timestamps and sizes. So,files that are corrupted but have the same size and timestamps will notget 'repaired'. When you add the --checksum argument, all files will getchecksummed to see if they still match.If you have files in your repo that are not supposed to change often, butare updated when you run rsync with the --checksum argument, this canpoint to problems. Either with the way they were transferred initially, orwith the hardware.

If you don't see any unchanged files being updated, then we're left withthe question why rdiff-backup sees a failed CRC checksum. If you didn'tmess with metadata files on the given repository, we're looking at somedata corruption issue.
I haven't messed with any metadata files. The source data is rsynced dailyfrom the server that it is replacing (new-server runs rsync -aH at 11pm dailyto syncronize with old-server). Then that rsynced data set is rdiff'd to thebackup server (new server pushes rdiff-backup at 4pm daily). I purposelyhave the rdiff sessions start before the rsync sessions to allow rsync to runovernight before the next day's rdiff. Perhaps the data is being corrupted bythe rsync process?

Now the situation is getting more clear to me. What I understand is thatyou have:

1) source-server:/path/to/data
2) backup-server:/path/to/rsynced-data
3) backup-server:/path/to/rdiff-backup-tree

And you use rsync to sync 1) to 2) and then rdiff-backup to sync 2) to 3).

Meaning that at the backup-server you have two times the dataset, once in/path/to/rsynced-data and once in /path/to/rdiff-backup-tree, and theselocations are not shared.

In that case, you could schedule a find|sort|xargs md5sum thing at thesource-server and at the backup-server right after the rsync run finishes.Given the time, I'd expect the data usually doesn't change during thenights. Then, try to compare these md5sums files and see if they differ:they shouldn't.

As an aside, even if you don't want to rebuild your servers, there stillare some ways to compile a new version of rdiff-backup. I had to do thisonce for some clients that didn't want to upgrade from 1.2.2 to 1.2.5 justyet. It turned out to be relatively easy to install python2.4 + librsync +rdiff-backup in my own home directory, and have multiple versions in activeuse by not using the standard python site-packages location but settingsome environment variables.
I am having enough troubles getting the versions I have to work successfully.None of the errors I am seeing have ever been described as "fixed, upgradeand you will not see these any more". I have seen only one problem thatAndrew described as giving a better message in newer versions.

If we don't get any further with the suggestions above, would you considertrying a new version of rdiff-backup if provide you with a recipe to buildit, separated from the normal rdiff-backup package? I'd be willing to helpyou with that, just to see what we can find. But first, try thesuggestions above, maybe we can resolve the issue without it.



Regards,
 Maarten

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [rdiff-backup-users] Regression errors, Bob Mead, 2009/04/02
- Re: [rdiff-backup-users] Regression errors, Maarten Bezemer <=
  - Re: [rdiff-backup-users] Regression errors, Bob Mead, 2009/04/20

Prev by Date: Re: [rdiff-backup-users] restoring with incremental
Next by Date: [rdiff-backup-users] Case sensitivity problem
Previous by thread: Re: [rdiff-backup-users] Regression errors
Next by thread: Re: [rdiff-backup-users] Regression errors
Index(es):
- Date
- Thread