[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [rdiff-backup-users] Regression errors
From: |
Bob Mead |
Subject: |
Re: [rdiff-backup-users] Regression errors |
Date: |
Mon, 20 Apr 2009 15:42:30 -0700 |
User-agent: |
Thunderbird 2.0.0.18 (X11/20081125) |
Hi Maarten:
Thanks for your message - much to think about!
Maarten Bezemer wrote:
Hi,
Maybe a little late, but here goes.
On Tue, 31 Mar 2009, Bob Mead wrote:
The BUG locked Processor error was a long time ago and according to
an article Andrew directed me to, it was due to a problem with ubuntu
8.04. At the time, I did run memtest on the server that gave up the
error for some hours and it never did fail or find any errors. I have
not seen that particular error since then and I am not using ubuntu
8.04 anymore as a result.
Depending on the amount of memory in the machine, 'some hours' may or
may not have been enough to find certain errors. I've seen machines
throwing up only 1 error in a 12-hour run of memmxtest (and again only
1 error in two repeated 24-hour runs), so that error was consistent
but not triggered easily.
So, if you have the opportunity to do more extensive tests (e.g. over
the weekend), please do, just to be sure.
Third, I re-read some of your emails about your situation and what
you've been trying to do. Having missing metadata files also might
indicate hardware problems. Or maybe it's something related to
kernel versions and data corruption on your file systems. Either
way, it's pretty bad.
I do not have missing metadata files that I know of. I mis-typed
"current-metadata" files for "current-/mirror"/ files in my most
recent post. At Andrew's suggestion, I had adjusted the Current
Mirror file to indicate a prior time to 'fool' rdiff into believing
that it had not already run. When I did this (by renaming the file
with an earlier date), rdiff did run, but complained about not
finding the metadata files and said that it would use the filesystem
instead. The backup has not run properly since then.
I don't know exactly what happens when you fool rdiff-backup like
that. If it uses the 'current-mirror' marker as "the timestamp
indicated in the current-mirror marker is taken as 'now', and all
files found in the tree should match this 'now'", then you could very
well break things if a subsequent (possible unfinished) rdiff-backup
run changed the files. In that case, mirror_metadata wouldn't match
the real file contents. Also, applying reverse-diffs to another
version of the file than they were built for, could screw up things
badly.
When I look at the source, it is not clear to me what is the case.
Maybe someone with more extensive experience with the sources can
comment on this?
I have since moved this data set to a new repo. It has been working fine
in its new home. As my scripts remove all increments older than two
months, I will wait another few weeks and then delete the original repo
and its now [probably] hopelessly broken data set.
Problem #1:
Origin/source server: Linux 2.6.7-gentoo-r5 #2 SMP Wed Nov 30
12:40:39 PST 2005 i686 Intel(R) Pentium(R) 4 CPU 3.06GHz GenuineIntel
GNU/Linux.
This is a bit ancient. However, I didn't find any reports on known
bugs in this version causing memory or filesystem corruption.
Destination/backup server: Linux 2.6.15-51-amd64-server #1 SMP Tue
Feb 12 17:08:38 UTC 2008 x86_64 GNU/Linux
Problem #2:
Origin/source server: Linux 2.6.27-11-server #1 SMP Thu Jan 29
20:19:41 UTC 2009 i686 GNU/Linux.
Destination/backup server: Linux 2.6.27-11-server #1 SMP Thu Jan 29
20:19:41 UTC 2009 i686 GNU/Linux
These are fairly recent kernels. As far as my information goes, there
was a known bug in 2.6.27 prior to 2.6.27.10 related to file locking.
I'm not sure if this was fixed in your 2.6.27-11 build (2.6.27-11 not
being the same as 2.6.27.11). If you're using a current ubuntu release
and have the latest kernel available for that release, you should be OK.
The last two are both fresh ubuntu 8.10 installs [using the default
kernel supplied]. The older kernel [2.6.15-51] is undoubtedly the
default kernel supplied with that distro. So it sounds like there are no
kernel issues known at this time.
You wrote earlier that upgrading or doing just anything with the
server running rdiff-backup 1.0.4/1.0.5 is out of the question
because of lack of resources. An alternative might be to first use
rsync to synchronise your data to another server, and then use
rdiff-backup from there. That gives you the opportunity to "play
around" with different rdiff-backup versions without risking a
"total breakdown" of the primary server.
Again, lack of resources prevents me from doing this on a network
wide basis. I don't have any spare servers to rsync to and the time
it would take to do that and then try to rdiff that result somewhere
else is beyond the carrying capacity of our network and/or available
times/bandwidths. I am actually working on a buildout of additional
servers for placement at each remote site which will act as local
backups and I will be doing exactly that (rsync to that new local
machine and then rdiff from there to the backup server) however that
project may take some months to complete.
Well, it seems that (at this time at least) you have a 'somewhat'
broken backup system. Some would say a broken backup system is worse
than no such system at all (since having one makes people believe the
data is safe and all). So, if that's fine with your boss then you're
out of luck. Otherwise this might be a perfect reason to have some
additional resources assigned to your work. It's just a matter of how
valuable the data is, and what the consequences are when it is lost.
Would you be fired, or would the blame be on your boss.. ;-)
I'll get back to this below..
Its more of 'how do we provide the best solution we can with the
resources we have at hand'. Yes, the data is important, no one will be
fired if it gets lost, and most importantly - there are no additional
resources. So I have to make do with what I've got. I agree that a
broken backup system is less than ideal - hence its been handed to me as
job #1 to make it work.
I do have the all new site-backup-servers deployed now and I rsync each
of the site-servers to their respective site-backup-servers daily.
Since all of the new site-backup-servers are ubuntu 8.10 installs and
the 'new' backup server is also an 8.10 install, I am hoping to move all
the rdiff backups to use the new servers (all of which run v1.1.16
included as part of the 8.10 repository). It is my hope that running one
version of rdiff will simplify things to some degree.
Based on my traceback results, have the regressions actually failed?
All I see is messages about regressing destination now. There doesn't
ever seem to be any message about what happens after that.
There's always the --check-destination-dir switch you can run locally
on the backup server, to see if the backup tree needs to be cleaned
up. The regression is done automatically at start of a normal backup
run when rdiff-backup finds an unclean backup tree, but running
rdiff-backup --check-destination-dir does only just the cleanup. You
might want to try that.
I tried this, with varying results. Some of my most recent problem
backups (2 to be exact) returned a 'crc-check error'. One other returned
with 'OK'. And still another returned with:
Traceback (most recent call last):
File "/usr/bin/rdiff-backup", line 23, in ?
rdiff_backup.Main.Main(sys.argv[1:])
File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line
285, in Main
take_action(rps)
File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line
257, in take_action
elif action == "check-destination-dir": CheckDest(rps[0])
File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line
854, in CheckDest
need_check = checkdest_need_check(dest_rp)
File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line
890, in checkdest_need_check
assert len(curmir_incs) == 2, "Found too many current_mirror incs!"
AssertionError: Found too many current_mirror incs!
I tried renaming [separately] both the oldest and newest current_mirror
files in the rdiff-backup-data directory which then threw up errors
about not finding an appropriate meta-data to regress to. Any ideas on
how to remedy this, short of starting over with a new repo? When I
googled the answer, I found a thread regarding using 'rsync with
--delete' to remove extra current_mirror files - is there a way to do
this with rdiff?
If you're running a recent version of rdiff-backup, you could also try
the --verify switch to see it the files in the backup repo match the
checksum recorded in the metadata file.
Does v1.1.16 support the --verify option?
On the other hand, you once mentioned that one of the servers had a
clock that was way off. Only recently I saw something on this
mailing list about using clocks of different sides for doing
calculations that should have been using clocks at only one side.
Maybe you ran into a similar issue that screwed up your repo?
If you insist trying to fix this "the software way", I have a
suggestion for you. The second point problem in your email talks
about a 23-hour run of rdiff-backup. Given the size of the backup,
I'd say that this was an initial run and there aren't some hundreds
of increments in play here?
From my original post (below): "This backup data (241GB) set took
several tries to get to run properly, however it did complete
successfully on 3/23 (after running for 23 hours to complete)".
Perhaps this is not as clear as I thought. Yes, this is the initial
run and no there are not any increments.
Your wording here leads me to believe that you think this is an
erroneous question, perhaps one that ought not to be answered, at
least here, or by you. I am not 'insisting' on anything. I asked the
list for help on two particular problems I am having - nothing more.
If it turns out that it is not the case that either problem I am
having has anything at all to do with software, I am more than happy
to look elsewhere to solve the problems. I wish I had the experience
to see the 'CRC check failed' and immediately go to 'hardware issue'.
Unfortunately, I don't. So I ask questions. I apologize if my asking
has upset you.
I'm not upset, although my wording could have been a bit unfriendly.
There are a number of things you can try here. Given the fact that it
is a large amount of data, we can use it to at least detect some
hardware problems.
For example, try this:
# cd /path/to/dataset/location
# find . -type f -print0 | sort -z | xargs -0r md5sum > /tmp/md5sum_run1
# find . -type f -print0 | sort -z | xargs -0r md5sum > /tmp/md5sum_run2
# md5sum /tmp/md5sum_run*
And check that both /tmp/md5sum_run* files have the same checksum.
They should have, if there's no rdiff-backup process running.
If the checksums don't match, try:
# diff -u /tmp/md5sum_run1 /tmp/md5sum_run2 | less
And look for the differences. Maybe just one line, maybe a lot of lines.
Do these tests both on the source and on the backup machines.
I will add this to run as a script after the rsync commands in the
nightly synchronization process at the source-backup servers. Depending
on the output of that, then I can try the #diff ... part to see what
changes. Running this on the source would present a greater challenge as
the data set is comprised of /home, /var, /etc, and /root with some
exclusions.
I've seen cases where some combinations of chipsets, processors and
memory chips go weird. For example a mainboard based on a Via KT400a
chipset, a FSB266 processor and DDR400 ram modules. Memtest didn't
find any problems, file checksums were usually right, but 1 out of 20
times orso they didn't match. Your 200+ GB dataset is likely to show
these problems in two runs, but you are of course free to do more
tests, creating /tmp/md5sum_run3, etc.
I found that clocking the ram at 133MHz instead of 200 (i.e., matching
ram speed to fsb speed) made the system stable.
Depending on how fast and how often the contents of your dataset
change, you could also compare the source and backup /tmp/md5sum_run1
files. When the data changes often, this might be a bit pointless, but
see below.
If so, could you try rsync with the --checksum argument to
synchronise the backup to the source and see if there are more files
being updated that should not have been changed, based on file
modification time stamps. If you see such files then you're probably
just out of luck and need some hardware replaced. Either in your
computers or the networking equipment.
Since this is the initial run, there are only files that have changed
(all of them) in the repo. I guess I'm not clear on what you're
wanting to see here. If I rsync the repo as is, to the source I'm
going to see what? Since there is only one backup, and it is the
initial run, how will rsyncing that run back to the source files tell
me about changed files?
I wasn't entirely clear on this. Normally, rsync bases its decision to
sync file contents only on file modification timestamps and sizes. So,
files that are corrupted but have the same size and timestamps will
not get 'repaired'. When you add the --checksum argument, all files
will get checksummed to see if they still match.
If you have files in your repo that are not supposed to change often,
but are updated when you run rsync with the --checksum argument, this
can point to problems. Either with the way they were transferred
initially, or with the hardware.
If you don't see any unchanged files being updated, then we're left
with the question why rdiff-backup sees a failed CRC checksum. If
you didn't mess with metadata files on the given repository, we're
looking at some data corruption issue.
I haven't messed with any metadata files. The source data is rsynced
daily from the server that it is replacing (new-server runs rsync -aH
at 11pm daily to syncronize with old-server). Then that rsynced data
set is rdiff'd to the backup server (new server pushes rdiff-backup
at 4pm daily). I purposely have the rdiff sessions start before the
rsync sessions to allow rsync to run overnight before the next day's
rdiff. Perhaps the data is being corrupted by the rsync process?
Now the situation is getting more clear to me. What I understand is
that you have:
1) source-server:/path/to/data
2) backup-server:/path/to/rsynced-data
3) backup-server:/path/to/rdiff-backup-tree
And you use rsync to sync 1) to 2) and then rdiff-backup to sync 2) to
3).
Meaning that at the backup-server you have two times the dataset, once
in /path/to/rsynced-data and once in /path/to/rdiff-backup-tree, and
these locations are not shared.
You are close. In actual fact, I have 1) as you have described and 2)
[which I'll call the 'source-backup-server' as per your naming
convention] and I do use rsync to sync these two. Then I have 3) as you
describe although its a different physical machine [and in a different
location] from 1) or 2) [I have 10 separate sites each with both a
'source' and a 'source-backup' server]. I currently use rdiff to backup
from the 'source' servers to the backup server. I am hoping to migrate
to using rdiff to backup [to 3)] the data synced to 2) [source-backup
servers] but I haven't been able to implement that yet.
In that case, you could schedule a find|sort|xargs md5sum thing at the
source-server and at the backup-server right after the rsync run
finishes. Given the time, I'd expect the data usually doesn't change
during the nights. Then, try to compare these md5sums files and see if
they differ: they shouldn't.
As an aside, even if you don't want to rebuild your servers, there
still are some ways to compile a new version of rdiff-backup. I had
to do this once for some clients that didn't want to upgrade from
1.2.2 to 1.2.5 just yet. It turned out to be relatively easy to
install python2.4 + librsync + rdiff-backup in my own home
directory, and have multiple versions in active use by not using the
standard python site-packages location but setting some environment
variables.
I am having enough troubles getting the versions I have to work
successfully. None of the errors I am seeing have ever been described
as "fixed, upgrade and you will not see these any more". I have seen
only one problem that Andrew described as giving a better message in
newer versions.
If we don't get any further with the suggestions above, would you
consider trying a new version of rdiff-backup if provide you with a
recipe to build it, separated from the normal rdiff-backup package?
I'd be willing to help you with that, just to see what we can find.
But first, try the suggestions above, maybe we can resolve the issue
without it.
Do you think that the older versions of rdiff that I use currently
(v1.0.4 and 1.0.5) are in any way causing the errors I am seeing? No one
has previously indicated that it is the software version(s) I am using
that are the cause of the error(s).
Thanks for your help and suggestions.
~bob
bmead.vcf
Description: Vcard