rdiff-backup-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [rdiff-backup-users] Regression errors


From: Maarten Bezemer
Subject: Re: [rdiff-backup-users] Regression errors
Date: Fri, 27 Mar 2009 20:42:39 +0100 (CET)

Hi Bob,

First, let me say that your situation is not quite like mine. I use rdiff-backup started from the backup server, so I'm doing "pull style" instead of your "push style". Also, I run rdiff-backup as normal user on the backup side and as root on the source side (since I obviously cannot read each user's files as non-root). Rdiff-backup keeps records of metadata separately, so I don't need root at the backup server.

Second, I looked back in my mail archives and found this:
When I call rdiff using this cmd, it locks up the destination server
(console showed 'BUG locked Processor 1 for 11s' messages).
I unfortunately cannot find any document describing this error message. However, it's possibly a kernel message. If so, it might indicate some of your hardware if dying. Failed CRC checksums mostly come from broken hardware (RAM, CPU, hard drives, hard drive cables, or power supply, to name just a few). There is nothing you can do about that with software.

Third, I re-read some of your emails about your situation and what you've been trying to do. Having missing metadata files also might indicate hardware problems. Or maybe it's something related to kernel versions and data corruption on your file systems. Either way, it's pretty bad. Before going any further, please make sure you're using reliable hardware at all servers. See if you don't have any leaking capacitors on your mainboards or inside the power supply. Next, get a memory testing program (memtest86+ or memmxtest) and have that run overnight. (I say overnight because it needs to run for at least a few hours and the server needs to be brought down which usually isn't possible during day-time hours.)
Harddisk diagnostics tools might also come in handy.
If everything turns out to be OK, we could go suspect software bugs.
Oh, by the way, could you give us kernel versions of the machines you're using? (copy/paste the output of "uname -a" ) Some kernel versions are known for causing data corruption in certain file system types.


You wrote earlier that upgrading or doing just anything with the server running rdiff-backup 1.0.4/1.0.5 is out of the question because of lack of resources. An alternative might be to first use rsync to synchronise your data to another server, and then use rdiff-backup from there. That gives you the opportunity to "play around" with different rdiff-backup versions without risking a "total breakdown" of the primary server.

The things you wrote made me a bit nervous. Like this: "The work around seemed to be the renaming of the current meta-data file to a time prior to the next run of rdiff." Doing such things is very likely to screw up any repository... especially regressions to previous states WILL break when the metadata files are messed up. I've been using rdiff-backup for years now, and not a single time did a regress fail on me. And yes, I've had rdiff-backup regress my repo quite often since ADSL links haven't always been as stable as the are today. Also, I never had to do special things to metadata timestamps or whatever. On the other hand, you once mentioned that one of the servers had a clock that was way off. Only recently I saw something on this mailing list about using clocks of different sides for doing calculations that should have been using clocks at only one side. Maybe you ran into a similar issue that screwed up your repo?


If you insist trying to fix this "the software way", I have a suggestion for you. The second point problem in your email talks about a 23-hour run of rdiff-backup. Given the size of the backup, I'd say that this was an initial run and there aren't some hundreds of increments in play here? If so, could you try rsync with the --checksum argument to synchronise the backup to the source and see if there are more files being updated that should not have been changed, based on file modification time stamps. If you see such files then you're probably just out of luck and need some hardware replaced. Either in your computers or the networking equipment. If you don't see any unchanged files being updated, then we're left with the question why rdiff-backup sees a failed CRC checksum. If you didn't mess with metadata files on the given repository, we're looking at some data corruption issue.

As an aside, even if you don't want to rebuild your servers, there still are some ways to compile a new version of rdiff-backup. I had to do this once for some clients that didn't want to upgrade from 1.2.2 to 1.2.5 just yet. It turned out to be relatively easy to install python2.4 + librsync + rdiff-backup in my own home directory, and have multiple versions in active use by not using the standard python site-packages location but setting some environment variables.


I hope I gave you enough pointers to work with for now. Please report back to the list if you have any news.

Regards,
 Maarten


On Thu, 26 Mar 2009, Bob Mead wrote:

Hello all:
I have a series of rdiff-backups that run every day to backup 10 remote sites and total of 14 different servers. It seems that each day, at least one of the backups fails. I have been working at getting these to work flawlessly for 6 months, but it seems beyond my grasp. In the last week, I thought I was hot on the trail of a 'perfect run' but now I'm not so sure. For the past few days I have been having troubles with the same two servers' backups. These are push type backups (as are all my backup jobs) with the remote servers running backup scripts to rdiff to, in this case, two different destination/backup servers.

In the first case: older Gentoo Linux system (running v1.0.5, dest. has v1.0.4) the following commands: rdiff-backup --force --print-statistics --include /etc --include /home --include /var --include /root --exclude / / root@<servername>::/home/backups/dor rdiff-backup --force --remove-older-than 2M root@<servername>::/home/backups/dor

(I added the --force option to test if that would clear up the regression problem, it didn't)

returned this as output:
Previous backup seems to have failed, regressing destination now.
Traceback (most recent call last):
File "/usr/bin/rdiff-backup", line 23, in <module>
  rdiff_backup.Main.Main(sys.argv[1:])
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py", line 285, in Main
  take_action(rps)
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py", line 255, in take_action
  elif action == "backup": Backup(rps[0], rps[1])
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py", line 299, in Backup
  backup_final_init(rpout)
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py", line 396, in backup_final_init
  checkdest_if_necessary(rpout)
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py", line 911, in checkdest_if_necessary
  dest_rp.conn.regress.Regress(dest_rp)
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/connection.py", line 445, in __call__
  return apply(self.connection.reval, (self.name,) + args)
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/connection.py", line 367, in reval
  if isinstance(result, Exception): raise result
IOError: [Errno None] None: None
Traceback (most recent call last):
File "/usr/bin/rdiff-backup", line 23, in ?
  rdiff_backup.Main.Main(sys.argv[1:])
File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line 285, in Main
  take_action(rps)
File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line 253, in take_action
  connection.PipeConnection(sys.stdin, sys.stdout).Server()
File "/usr/lib/python2.4/site-packages/rdiff_backup/connection.py", line 352, in Server
  self.get_response(-1)
File "/usr/lib/python2.4/site-packages/rdiff_backup/connection.py", line 314, in get_response
  try: req_num, object = self._get()
File "/usr/lib/python2.4/site-packages/rdiff_backup/connection.py", line 230, in _get
  raise ConnectionReadError("Truncated header string (problem "
rdiff_backup.connection.ConnectionReadError: Truncated header string (problem probably originated remotely)

At some point recently (3/20), this backup worked. Then it started to fail, giving up regressing dest. errors each time it has run since then. This is the same backup I posted on recently where I had to 'pull the wool' over rdiff's eyes because of a server date malfunction. The work around seemed to be the renaming of the current meta-data file to a time prior to the next run of rdiff. That seemed to work in that it didn't complain about too many current mirror files, but it did make rdiff unable to 'see' the metadata file and therefore use the filesystem. Perhaps these problems are then related? If so, any ideas on how to get it working again would be greatly appreciated. There should be two months of increments stored in the repository so I don't want to lose those by starting over.

The second failed backup is a brand new install of ubuntu 8.10 running rdiff v1.1.16 pushing backups to another fresh 8.10 install also running rdiff v1.1.16. Using the following commands:

rdiff-backup --force --print-statistics --exclude-special-files --include /etc --include /home --include /var/www --exclude /var --include /root --exclude / / root@<servername2>::/home/backups/images2 rdiff-backup --force --remove-older-than 2M root@<servername2>::/home/backups/images2

(again, I added the --force options to see if it would not regress...)

returned this output:
Previous backup seems to have failed, regressing destination now.
Exception 'CRC check failed' raised of class '<type 'exceptions.IOError'>':
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 302, in error_check_Main
  try: Main(arglist)
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 322, in Main
  take_action(rps)
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 278, in take_action
  elif action == "backup": Backup(rps[0], rps[1])
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 341, in Backup
  backup.Mirror_and_increment(rpin, rpout, incdir)
File "/var/lib/python-support/python2.5/rdiff_backup/backup.py", line 51, in Mirror_and_increment
  DestS.patch_and_increment(dest_rpath, source_diffiter, inc_rpath)
File "/var/lib/python-support/python2.5/rdiff_backup/connection.py", line 447, in __call__
  return apply(self.connection.reval, (self.name,) + args)
File "/var/lib/python-support/python2.5/rdiff_backup/connection.py", line 369, in reval
  if isinstance(result, Exception): raise result

Traceback (most recent call last):
File "/usr/bin/rdiff-backup", line 23, in <module>
  rdiff_backup.Main.error_check_Main(sys.argv[1:])
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 302, in error_check_Main
  try: Main(arglist)
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 322, in Main
  take_action(rps)
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 278, in take_action
  elif action == "backup": Backup(rps[0], rps[1])
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 341, in Backup
  backup.Mirror_and_increment(rpin, rpout, incdir)
File "/var/lib/python-support/python2.5/rdiff_backup/backup.py", line 51, in Mirror_and_increment
  DestS.patch_and_increment(dest_rpath, source_diffiter, inc_rpath)
File "/var/lib/python-support/python2.5/rdiff_backup/connection.py", line 447, in __call__
  return apply(self.connection.reval, (self.name,) + args)
File "/var/lib/python-support/python2.5/rdiff_backup/connection.py", line 369, in reval
  if isinstance(result, Exception): raise result
IOError: CRC check failed
Fatal Error: Lost connection to the remote system

Seems like the last line is a big issue. Is there any further descriptor to be had for the lost connection error (I've tried running rdiff with both -v5 and -v7 levels but neither seemed to give me any more info on the lost connection error, and this error does re-occur on each successive running)? This backup data (241GB) set took several tries to get to run properly, however it did complete successfully on 3/23 (after running for 23 hours to complete). Since then it has thrown up the 'previous backup seems to have failed, regressing destination' errors each time. I have the network almost to myself this week, so there's not a lot of extra traffic impeding packet flow and no obvious reasons for a lost connection error (i.e. - the link has not seemed to go down [at least not that cacti or nagios noticed]).

Thanks in advance for any help on either of these.
  ~bob







reply via email to

[Prev in Thread] Current Thread [Next in Thread]