Re: [rdiff-backup-users] Regression errors

rdiff-backup-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [rdiff-backup-users] Regression errors

From:	Bob Mead
Subject:	Re: [rdiff-backup-users] Regression errors
Date:	Mon, 20 Apr 2009 15:42:30 -0700
User-agent:	Thunderbird 2.0.0.18 (X11/20081125)

Hi Maarten:
Thanks for your message - much to think about!

Maarten Bezemer wrote:

Hi,

Maybe a little late, but here goes.


On Tue, 31 Mar 2009, Bob Mead wrote:
The BUG locked Processor error was a long time ago and according toan article Andrew directed me to, it was due to a problem with ubuntu8.04. At the time, I did run memtest on the server that gave up theerror for some hours and it never did fail or find any errors. I havenot seen that particular error since then and I am not using ubuntu8.04 anymore as a result.
Depending on the amount of memory in the machine, 'some hours' may ormay not have been enough to find certain errors. I've seen machinesthrowing up only 1 error in a 12-hour run of memmxtest (and again only1 error in two repeated 24-hour runs), so that error was consistentbut not triggered easily.So, if you have the opportunity to do more extensive tests (e.g. overthe weekend), please do, just to be sure.
Third, I re-read some of your emails about your situation and whatyou've been trying to do. Having missing metadata files also mightindicate hardware problems. Or maybe it's something related tokernel versions and data corruption on your file systems. Eitherway, it's pretty bad.
I do not have missing metadata files that I know of. I mis-typed"current-metadata" files for "current-/mirror"/ files in my mostrecent post. At Andrew's suggestion, I had adjusted the CurrentMirror file to indicate a prior time to 'fool' rdiff into believingthat it had not already run. When I did this (by renaming the filewith an earlier date), rdiff did run, but complained about notfinding the metadata files and said that it would use the filesysteminstead. The backup has not run properly since then.
I don't know exactly what happens when you fool rdiff-backup likethat. If it uses the 'current-mirror' marker as "the timestampindicated in the current-mirror marker is taken as 'now', and allfiles found in the tree should match this 'now'", then you could verywell break things if a subsequent (possible unfinished) rdiff-backuprun changed the files. In that case, mirror_metadata wouldn't matchthe real file contents. Also, applying reverse-diffs to anotherversion of the file than they were built for, could screw up thingsbadly.
When I look at the source, it is not clear to me what is the case.Maybe someone with more extensive experience with the sources cancomment on this?

I have since moved this data set to a new repo. It has been working finein its new home. As my scripts remove all increments older than twomonths, I will wait another few weeks and then delete the original repoand its now [probably] hopelessly broken data set.

Problem #1:
Origin/source server: Linux 2.6.7-gentoo-r5 #2 SMP Wed Nov 3012:40:39 PST 2005 i686 Intel(R) Pentium(R) 4 CPU 3.06GHz GenuineIntelGNU/Linux.
This is a bit ancient. However, I didn't find any reports on knownbugs in this version causing memory or filesystem corruption.
Destination/backup server: Linux 2.6.15-51-amd64-server #1 SMP TueFeb 12 17:08:38 UTC 2008 x86_64 GNU/Linux
Problem #2:
Origin/source server: Linux 2.6.27-11-server #1 SMP Thu Jan 2920:19:41 UTC 2009 i686 GNU/Linux.
Destination/backup server: Linux 2.6.27-11-server #1 SMP Thu Jan 2920:19:41 UTC 2009 i686 GNU/Linux
These are fairly recent kernels. As far as my information goes, therewas a known bug in 2.6.27 prior to 2.6.27.10 related to file locking.I'm not sure if this was fixed in your 2.6.27-11 build (2.6.27-11 notbeing the same as 2.6.27.11). If you're using a current ubuntu releaseand have the latest kernel available for that release, you should be OK.

The last two are both fresh ubuntu 8.10 installs [using the defaultkernel supplied]. The older kernel [2.6.15-51] is undoubtedly thedefault kernel supplied with that distro. So it sounds like there are nokernel issues known at this time.

 You wrote earlier that upgrading or doing just anything with the
  server running rdiff-backup 1.0.4/1.0.5 is out of the question
  because of lack of resources. An alternative might be to first use
  rsync to synchronise your data to another server, and then use
  rdiff-backup from there. That gives you the opportunity to "play
  around" with different rdiff-backup versions without risking a
  "total breakdown" of the primary server.
Again, lack of resources prevents me from doing this on a networkwide basis. I don't have any spare servers to rsync to and the timeit would take to do that and then try to rdiff that result somewhereelse is beyond the carrying capacity of our network and/or availabletimes/bandwidths. I am actually working on a buildout of additionalservers for placement at each remote site which will act as localbackups and I will be doing exactly that (rsync to that new localmachine and then rdiff from there to the backup server) however thatproject may take some months to complete.
Well, it seems that (at this time at least) you have a 'somewhat'broken backup system. Some would say a broken backup system is worsethan no such system at all (since having one makes people believe thedata is safe and all). So, if that's fine with your boss then you'reout of luck. Otherwise this might be a perfect reason to have someadditional resources assigned to your work. It's just a matter of howvaluable the data is, and what the consequences are when it is lost.Would you be fired, or would the blame be on your boss.. ;-)
I'll get back to this below..

Its more of 'how do we provide the best solution we can with theresources we have at hand'. Yes, the data is important, no one will befired if it gets lost, and most importantly - there are no additionalresources. So I have to make do with what I've got. I agree that abroken backup system is less than ideal - hence its been handed to me asjob #1 to make it work.

I do have the all new site-backup-servers deployed now and I rsync eachof the site-servers to their respective site-backup-servers daily.Since all of the new site-backup-servers are ubuntu 8.10 installs andthe 'new' backup server is also an 8.10 install, I am hoping to move allthe rdiff backups to use the new servers (all of which run v1.1.16included as part of the 8.10 repository). It is my hope that running oneversion of rdiff will simplify things to some degree.

Based on my traceback results, have the regressions actually failed?All I see is messages about regressing destination now. There doesn'tever seem to be any message about what happens after that.
There's always the --check-destination-dir switch you can run locallyon the backup server, to see if the backup tree needs to be cleanedup. The regression is done automatically at start of a normal backuprun when rdiff-backup finds an unclean backup tree, but runningrdiff-backup --check-destination-dir does only just the cleanup. Youmight want to try that.

I tried this, with varying results. Some of my most recent problembackups (2 to be exact) returned a 'crc-check error'. One other returnedwith 'OK'. And still another returned with:


Traceback (most recent call last):
 File "/usr/bin/rdiff-backup", line 23, in ?
   rdiff_backup.Main.Main(sys.argv[1:])

File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line285, in Main

   take_action(rps)

File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line257, in take_action

   elif action == "check-destination-dir": CheckDest(rps[0])

File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line854, in CheckDest

   need_check = checkdest_need_check(dest_rp)

File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line890, in checkdest_need_check

   assert len(curmir_incs) == 2, "Found too many current_mirror incs!"
AssertionError: Found too many current_mirror incs!

I tried renaming [separately] both the oldest and newest current_mirrorfiles in the rdiff-backup-data directory which then threw up errorsabout not finding an appropriate meta-data to regress to. Any ideas onhow to remedy this, short of starting over with a new repo? When Igoogled the answer, I found a thread regarding using 'rsync with--delete' to remove extra current_mirror files - is there a way to dothis with rdiff?

If you're running a recent version of rdiff-backup, you could also trythe --verify switch to see it the files in the backup repo match thechecksum recorded in the metadata file.

Does v1.1.16 support the --verify option?

On the other hand, you once mentioned that one of the servers had aclock that was way off. Only recently I saw something on thismailing list about using clocks of different sides for doingcalculations that should have been using clocks at only one side.Maybe you ran into a similar issue that screwed up your repo?
 If you insist trying to fix this "the software way", I have a
 suggestion for you. The second point problem in your email talks
 about a 23-hour run of rdiff-backup. Given the size of the backup,
 I'd say that this was an initial run and there aren't some hundreds
 of increments in play here?
From my original post (below): "This backup data (241GB) set tookseveral tries to get to run properly, however it did completesuccessfully on 3/23 (after running for 23 hours to complete)".Perhaps this is not as clear as I thought. Yes, this is the initialrun and no there are not any increments.Your wording here leads me to believe that you think this is anerroneous question, perhaps one that ought not to be answered, atleast here, or by you. I am not 'insisting' on anything. I asked thelist for help on two particular problems I am having - nothing more.If it turns out that it is not the case that either problem I amhaving has anything at all to do with software, I am more than happyto look elsewhere to solve the problems. I wish I had the experienceto see the 'CRC check failed' and immediately go to 'hardware issue'.Unfortunately, I don't. So I ask questions. I apologize if my askinghas upset you.
I'm not upset, although my wording could have been a bit unfriendly.
There are a number of things you can try here. Given the fact that itis a large amount of data, we can use it to at least detect somehardware problems.
For example, try this:
# cd /path/to/dataset/location

# find . -type f -print0 | sort -z | xargs -0r md5sum > /tmp/md5sum_run1

# find . -type f -print0 | sort -z | xargs -0r md5sum > /tmp/md5sum_run2
# md5sum /tmp/md5sum_run*

And check that both /tmp/md5sum_run* files have the same checksum.They should have, if there's no rdiff-backup process running.

If the checksums don't match, try:
# diff -u /tmp/md5sum_run1 /tmp/md5sum_run2 | less
And look for the differences. Maybe just one line, maybe a lot of lines.
Do these tests both on the source and on the backup machines.

I will add this to run as a script after the rsync commands in thenightly synchronization process at the source-backup servers. Dependingon the output of that, then I can try the #diff ... part to see whatchanges. Running this on the source would present a greater challenge asthe data set is comprised of /home, /var, /etc, and /root with someexclusions.

I've seen cases where some combinations of chipsets, processors andmemory chips go weird. For example a mainboard based on a Via KT400achipset, a FSB266 processor and DDR400 ram modules. Memtest didn'tfind any problems, file checksums were usually right, but 1 out of 20times orso they didn't match. Your 200+ GB dataset is likely to showthese problems in two runs, but you are of course free to do moretests, creating /tmp/md5sum_run3, etc.I found that clocking the ram at 133MHz instead of 200 (i.e., matchingram speed to fsb speed) made the system stable.
Depending on how fast and how often the contents of your datasetchange, you could also compare the source and backup /tmp/md5sum_run1files. When the data changes often, this might be a bit pointless, butsee below.
If so, could you try rsync with the --checksum argument tosynchronise the backup to the source and see if there are more filesbeing updated that should not have been changed, based on filemodification time stamps. If you see such files then you're probablyjust out of luck and need some hardware replaced. Either in yourcomputers or the networking equipment.
Since this is the initial run, there are only files that have changed(all of them) in the repo. I guess I'm not clear on what you'rewanting to see here. If I rsync the repo as is, to the source I'mgoing to see what? Since there is only one backup, and it is theinitial run, how will rsyncing that run back to the source files tellme about changed files?
I wasn't entirely clear on this. Normally, rsync bases its decision tosync file contents only on file modification timestamps and sizes. So,files that are corrupted but have the same size and timestamps willnot get 'repaired'. When you add the --checksum argument, all fileswill get checksummed to see if they still match.If you have files in your repo that are not supposed to change often,but are updated when you run rsync with the --checksum argument, thiscan point to problems. Either with the way they were transferredinitially, or with the hardware.
If you don't see any unchanged files being updated, then we're leftwith the question why rdiff-backup sees a failed CRC checksum. Ifyou didn't mess with metadata files on the given repository, we'relooking at some data corruption issue.
I haven't messed with any metadata files. The source data is rsynceddaily from the server that it is replacing (new-server runs rsync -aHat 11pm daily to syncronize with old-server). Then that rsynced dataset is rdiff'd to the backup server (new server pushes rdiff-backupat 4pm daily). I purposely have the rdiff sessions start before thersync sessions to allow rsync to run overnight before the next day'srdiff. Perhaps the data is being corrupted by the rsync process?
Now the situation is getting more clear to me. What I understand isthat you have:
1) source-server:/path/to/data
2) backup-server:/path/to/rsynced-data
3) backup-server:/path/to/rdiff-backup-tree
And you use rsync to sync 1) to 2) and then rdiff-backup to sync 2) to3).Meaning that at the backup-server you have two times the dataset, oncein /path/to/rsynced-data and once in /path/to/rdiff-backup-tree, andthese locations are not shared.

You are close. In actual fact, I have 1) as you have described and 2)[which I'll call the 'source-backup-server' as per your namingconvention] and I do use rsync to sync these two. Then I have 3) as youdescribe although its a different physical machine [and in a differentlocation] from 1) or 2) [I have 10 separate sites each with both a'source' and a 'source-backup' server]. I currently use rdiff to backupfrom the 'source' servers to the backup server. I am hoping to migrateto using rdiff to backup [to 3)] the data synced to 2) [source-backupservers] but I haven't been able to implement that yet.

In that case, you could schedule a find|sort|xargs md5sum thing at thesource-server and at the backup-server right after the rsync runfinishes. Given the time, I'd expect the data usually doesn't changeduring the nights. Then, try to compare these md5sums files and see ifthey differ: they shouldn't.
As an aside, even if you don't want to rebuild your servers, therestill are some ways to compile a new version of rdiff-backup. I hadto do this once for some clients that didn't want to upgrade from1.2.2 to 1.2.5 just yet. It turned out to be relatively easy toinstall python2.4 + librsync + rdiff-backup in my own homedirectory, and have multiple versions in active use by not using thestandard python site-packages location but setting some environmentvariables.
I am having enough troubles getting the versions I have to worksuccessfully. None of the errors I am seeing have ever been describedas "fixed, upgrade and you will not see these any more". I have seenonly one problem that Andrew described as giving a better message innewer versions.
If we don't get any further with the suggestions above, would youconsider trying a new version of rdiff-backup if provide you with arecipe to build it, separated from the normal rdiff-backup package?I'd be willing to help you with that, just to see what we can find.But first, try the suggestions above, maybe we can resolve the issuewithout it.

Do you think that the older versions of rdiff that I use currently(v1.0.4 and 1.0.5) are in any way causing the errors I am seeing? No onehas previously indicated that it is the software version(s) I am usingthat are the cause of the error(s).


Thanks for your help and suggestions.
   ~bob

bmead.vcf
Description: Vcard

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [rdiff-backup-users] Regression errors, Bob Mead, 2009/04/02
- Re: [rdiff-backup-users] Regression errors, Maarten Bezemer, 2009/04/15
  - Re: [rdiff-backup-users] Regression errors, Bob Mead <=

Prev by Date: [rdiff-backup-users] Fatal Error: Source is not a directory
Next by Date: [rdiff-backup-users] Case sensitivity problem
Previous by thread: Re: [rdiff-backup-users] Regression errors
Next by thread: [rdiff-backup-users] Connection Read Error
Index(es):
- Date
- Thread