[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [rdiff-backup-users] Regression errors
From: |
Maarten Bezemer |
Subject: |
Re: [rdiff-backup-users] Regression errors |
Date: |
Wed, 15 Apr 2009 11:29:59 +0200 (CEST) |
Hi,
Maybe a little late, but here goes.
On Tue, 31 Mar 2009, Bob Mead wrote:
The BUG locked Processor error was a long time ago and according to an
article Andrew directed me to, it was due to a problem with ubuntu 8.04. At
the time, I did run memtest on the server that gave up the error for some
hours and it never did fail or find any errors. I have not seen that
particular error since then and I am not using ubuntu 8.04 anymore as a
result.
Depending on the amount of memory in the machine, 'some hours' may or may
not have been enough to find certain errors. I've seen machines throwing
up only 1 error in a 12-hour run of memmxtest (and again only 1 error in
two repeated 24-hour runs), so that error was consistent but not triggered
easily.
So, if you have the opportunity to do more extensive tests (e.g. over the
weekend), please do, just to be sure.
Third, I re-read some of your emails about your situation and what you've
been trying to do. Having missing metadata files also might indicate
hardware problems. Or maybe it's something related to kernel versions and
data corruption on your file systems. Either way, it's pretty bad.
I do not have missing metadata files that I know of. I mis-typed
"current-metadata" files for "current-/mirror"/ files in my most recent post.
At Andrew's suggestion, I had adjusted the Current Mirror file to indicate a
prior time to 'fool' rdiff into believing that it had not already run. When I
did this (by renaming the file with an earlier date), rdiff did run, but
complained about not finding the metadata files and said that it would use
the filesystem instead. The backup has not run properly since then.
I don't know exactly what happens when you fool rdiff-backup like that. If
it uses the 'current-mirror' marker as "the timestamp indicated in the
current-mirror marker is taken as 'now', and all files found in the tree
should match this 'now'", then you could very well break things if a
subsequent (possible unfinished) rdiff-backup run changed the files. In
that case, mirror_metadata wouldn't match the real file contents. Also,
applying reverse-diffs to another version of the file than they were built
for, could screw up things badly.
When I look at the source, it is not clear to me what is the case. Maybe
someone with more extensive experience with the sources can comment on
this?
Problem #1:
Origin/source server: Linux 2.6.7-gentoo-r5 #2 SMP Wed Nov 30 12:40:39 PST
2005 i686 Intel(R) Pentium(R) 4 CPU 3.06GHz GenuineIntel GNU/Linux.
This is a bit ancient. However, I didn't find any reports on known bugs in
this version causing memory or filesystem corruption.
Destination/backup server: Linux 2.6.15-51-amd64-server #1 SMP Tue Feb 12
17:08:38 UTC 2008 x86_64 GNU/Linux
Problem #2:
Origin/source server: Linux 2.6.27-11-server #1 SMP Thu Jan 29 20:19:41 UTC
2009 i686 GNU/Linux.
Destination/backup server: Linux 2.6.27-11-server #1 SMP Thu Jan 29 20:19:41
UTC 2009 i686 GNU/Linux
These are fairly recent kernels. As far as my information goes, there was
a known bug in 2.6.27 prior to 2.6.27.10 related to file locking. I'm not
sure if this was fixed in your 2.6.27-11 build (2.6.27-11 not being the
same as 2.6.27.11). If you're using a current ubuntu release and have the
latest kernel available for that release, you should be OK.
You wrote earlier that upgrading or doing just anything with the
server running rdiff-backup 1.0.4/1.0.5 is out of the question
because of lack of resources. An alternative might be to first use
rsync to synchronise your data to another server, and then use
rdiff-backup from there. That gives you the opportunity to "play
around" with different rdiff-backup versions without risking a
"total breakdown" of the primary server.
Again, lack of resources prevents me from doing this on a network wide basis.
I don't have any spare servers to rsync to and the time it would take to do
that and then try to rdiff that result somewhere else is beyond the carrying
capacity of our network and/or available times/bandwidths. I am actually
working on a buildout of additional servers for placement at each remote site
which will act as local backups and I will be doing exactly that (rsync to
that new local machine and then rdiff from there to the backup server)
however that project may take some months to complete.
Well, it seems that (at this time at least) you have a 'somewhat' broken
backup system. Some would say a broken backup system is worse than no
such system at all (since having one makes people believe the data is safe
and all). So, if that's fine with your boss then you're out of luck.
Otherwise this might be a perfect reason to have some additional resources
assigned to your work. It's just a matter of how valuable the data is, and
what the consequences are when it is lost. Would you be fired, or would
the blame be on your boss.. ;-)
I'll get back to this below..
Based on my traceback results, have the regressions actually failed? All I
see is messages about regressing destination now. There doesn't ever seem to
be any message about what happens after that.
There's always the --check-destination-dir switch you can run locally on
the backup server, to see if the backup tree needs to be cleaned up. The
regression is done automatically at start of a normal backup run when
rdiff-backup finds an unclean backup tree, but running rdiff-backup
--check-destination-dir does only just the cleanup. You might want to try
that.
If you're running a recent version of rdiff-backup, you could also try the
--verify switch to see it the files in the backup repo match the checksum
recorded in the metadata file.
On the other hand, you once mentioned that one of the servers had a clock
that was way off. Only recently I saw something on this mailing list about
using clocks of different sides for doing calculations that should have
been using clocks at only one side. Maybe you ran into a similar issue that
screwed up your repo?
If you insist trying to fix this "the software way", I have a
suggestion for you. The second point problem in your email talks
about a 23-hour run of rdiff-backup. Given the size of the backup,
I'd say that this was an initial run and there aren't some hundreds
of increments in play here?
From my original post (below): "This backup data (241GB) set took several
tries to get to run properly, however it did complete successfully on 3/23
(after running for 23 hours to complete)". Perhaps this is not as clear as I
thought. Yes, this is the initial run and no there are not any increments.
Your wording here leads me to believe that you think this is an erroneous
question, perhaps one that ought not to be answered, at least here, or by
you. I am not 'insisting' on anything. I asked the list for help on two
particular problems I am having - nothing more. If it turns out that it is
not the case that either problem I am having has anything at all to do with
software, I am more than happy to look elsewhere to solve the problems. I
wish I had the experience to see the 'CRC check failed' and immediately go to
'hardware issue'. Unfortunately, I don't. So I ask questions. I apologize if
my asking has upset you.
I'm not upset, although my wording could have been a bit unfriendly.
There are a number of things you can try here. Given the fact that it is a
large amount of data, we can use it to at least detect some hardware
problems.
For example, try this:
# cd /path/to/dataset/location
# find . -type f -print0 | sort -z | xargs -0r md5sum > /tmp/md5sum_run1
# find . -type f -print0 | sort -z | xargs -0r md5sum > /tmp/md5sum_run2
# md5sum /tmp/md5sum_run*
And check that both /tmp/md5sum_run* files have the same checksum. They
should have, if there's no rdiff-backup process running.
If the checksums don't match, try:
# diff -u /tmp/md5sum_run1 /tmp/md5sum_run2 | less
And look for the differences. Maybe just one line, maybe a lot of lines.
Do these tests both on the source and on the backup machines.
I've seen cases where some combinations of chipsets, processors and memory
chips go weird. For example a mainboard based on a Via KT400a chipset, a
FSB266 processor and DDR400 ram modules. Memtest didn't find any problems,
file checksums were usually right, but 1 out of 20 times orso they didn't
match. Your 200+ GB dataset is likely to show these problems in two runs,
but you are of course free to do more tests, creating /tmp/md5sum_run3, etc.
I found that clocking the ram at 133MHz instead of 200 (i.e., matching ram
speed to fsb speed) made the system stable.
Depending on how fast and how often the contents of your dataset change,
you could also compare the source and backup /tmp/md5sum_run1 files. When
the data changes often, this might be a bit pointless, but see below.
If so, could you try rsync with the --checksum argument to synchronise the
backup to the source and see if there are more files being updated that
should not have been changed, based on file modification time stamps. If
you see such files then you're probably just out of luck and need some
hardware replaced. Either in your computers or the networking equipment.
Since this is the initial run, there are only files that have changed (all of
them) in the repo. I guess I'm not clear on what you're wanting to see here.
If I rsync the repo as is, to the source I'm going to see what? Since there
is only one backup, and it is the initial run, how will rsyncing that run
back to the source files tell me about changed files?
I wasn't entirely clear on this. Normally, rsync bases its decision to
sync file contents only on file modification timestamps and sizes. So,
files that are corrupted but have the same size and timestamps will not
get 'repaired'. When you add the --checksum argument, all files will get
checksummed to see if they still match.
If you have files in your repo that are not supposed to change often, but
are updated when you run rsync with the --checksum argument, this can
point to problems. Either with the way they were transferred initially, or
with the hardware.
If you don't see any unchanged files being updated, then we're left with
the question why rdiff-backup sees a failed CRC checksum. If you didn't
mess with metadata files on the given repository, we're looking at some
data corruption issue.
I haven't messed with any metadata files. The source data is rsynced daily
from the server that it is replacing (new-server runs rsync -aH at 11pm daily
to syncronize with old-server). Then that rsynced data set is rdiff'd to the
backup server (new server pushes rdiff-backup at 4pm daily). I purposely
have the rdiff sessions start before the rsync sessions to allow rsync to run
overnight before the next day's rdiff. Perhaps the data is being corrupted by
the rsync process?
Now the situation is getting more clear to me. What I understand is that
you have:
1) source-server:/path/to/data
2) backup-server:/path/to/rsynced-data
3) backup-server:/path/to/rdiff-backup-tree
And you use rsync to sync 1) to 2) and then rdiff-backup to sync 2) to 3).
Meaning that at the backup-server you have two times the dataset, once in
/path/to/rsynced-data and once in /path/to/rdiff-backup-tree, and these
locations are not shared.
In that case, you could schedule a find|sort|xargs md5sum thing at the
source-server and at the backup-server right after the rsync run finishes.
Given the time, I'd expect the data usually doesn't change during the
nights. Then, try to compare these md5sums files and see if they differ:
they shouldn't.
As an aside, even if you don't want to rebuild your servers, there still
are some ways to compile a new version of rdiff-backup. I had to do this
once for some clients that didn't want to upgrade from 1.2.2 to 1.2.5 just
yet. It turned out to be relatively easy to install python2.4 + librsync +
rdiff-backup in my own home directory, and have multiple versions in active
use by not using the standard python site-packages location but setting
some environment variables.
I am having enough troubles getting the versions I have to work successfully.
None of the errors I am seeing have ever been described as "fixed, upgrade
and you will not see these any more". I have seen only one problem that
Andrew described as giving a better message in newer versions.
If we don't get any further with the suggestions above, would you consider
trying a new version of rdiff-backup if provide you with a recipe to build
it, separated from the normal rdiff-backup package? I'd be willing to help
you with that, just to see what we can find. But first, try the
suggestions above, maybe we can resolve the issue without it.
Regards,
Maarten