Re: [rdiff-backup-users] Regression errors

rdiff-backup-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [rdiff-backup-users] Regression errors

From:	Maarten Bezemer
Subject:	Re: [rdiff-backup-users] Regression errors
Date:	Fri, 27 Mar 2009 20:42:39 +0100 (CET)

Hi Bob,

First, let me say that your situation is not quite like mine. I userdiff-backup started from the backup server, so I'm doing "pull style"instead of your "push style". Also, I run rdiff-backup as normal user onthe backup side and as root on the source side (since I obviously cannotread each user's files as non-root). Rdiff-backup keeps records ofmetadata separately, so I don't need root at the backup server.


Second, I looked back in my mail archives and found this:

When I call rdiff using this cmd, it locks up the destination server
(console showed 'BUG locked Processor 1 for 11s' messages).

I unfortunately cannot find any document describing this error message.However, it's possibly a kernel message. If so, it might indicate some ofyour hardware if dying.Failed CRC checksums mostly come from broken hardware (RAM, CPU, harddrives, hard drive cables, or power supply, to name just a few). There isnothing you can do about that with software.

Third, I re-read some of your emails about your situation and what you'vebeen trying to do. Having missing metadata files also might indicatehardware problems. Or maybe it's something related to kernel versions anddata corruption on your file systems. Either way, it's pretty bad.Before going any further, please make sure you're using reliable hardwareat all servers. See if you don't have any leaking capacitors on yourmainboards or inside the power supply. Next, get a memory testing program(memtest86+ or memmxtest) and have that run overnight. (I say overnightbecause it needs to run for at least a few hours and the server needs tobe brought down which usually isn't possible during day-time hours.)

Harddisk diagnostics tools might also come in handy.
If everything turns out to be OK, we could go suspect software bugs.

Oh, by the way, could you give us kernel versions of the machines you'reusing? (copy/paste the output of "uname -a" ) Some kernel versions areknown for causing data corruption in certain file system types.

You wrote earlier that upgrading or doing just anything with the serverrunning rdiff-backup 1.0.4/1.0.5 is out of the question because of lack ofresources. An alternative might be to first use rsync to synchronise yourdata to another server, and then use rdiff-backup from there. That givesyou the opportunity to "play around" with different rdiff-backup versionswithout risking a "total breakdown" of the primary server.

The things you wrote made me a bit nervous. Like this: "The work aroundseemed to be the renaming of the current meta-data file to a time prior tothe next run of rdiff."Doing such things is very likely to screw up any repository... especiallyregressions to previous states WILL break when the metadata files aremessed up.I've been using rdiff-backup for years now, and not a single time did aregress fail on me. And yes, I've had rdiff-backup regress my repo quiteoften since ADSL links haven't always been as stable as the are today.Also, I never had to do special things to metadata timestamps or whatever.On the other hand, you once mentioned that one of the servers had a clockthat was way off. Only recently I saw something on this mailing list aboutusing clocks of different sides for doing calculations that should havebeen using clocks at only one side. Maybe you ran into a similar issuethat screwed up your repo?

If you insist trying to fix this "the software way", I have a suggestionfor you. The second point problem in your email talks about a 23-hour runof rdiff-backup. Given the size of the backup, I'd say that this was aninitial run and there aren't some hundreds of increments in play here?If so, could you try rsync with the --checksum argument to synchronise thebackup to the source and see if there are more files being updated thatshould not have been changed, based on file modification time stamps. Ifyou see such files then you're probably just out of luck and need somehardware replaced. Either in your computers or the networking equipment.If you don't see any unchanged files being updated, then we're left withthe question why rdiff-backup sees a failed CRC checksum. If you didn'tmess with metadata files on the given repository, we're looking at somedata corruption issue.

As an aside, even if you don't want to rebuild your servers, there stillare some ways to compile a new version of rdiff-backup. I had to do thisonce for some clients that didn't want to upgrade from 1.2.2 to 1.2.5 justyet. It turned out to be relatively easy to install python2.4 + librsync +rdiff-backup in my own home directory, and have multiple versions inactive use by not using the standard python site-packages location butsetting some environment variables.

I hope I gave you enough pointers to work with for now. Please report backto the list if you have any news.


Regards,
 Maarten


On Thu, 26 Mar 2009, Bob Mead wrote:

Hello all:
I have a series of rdiff-backups that run every day to backup 10 remote sitesand total of 14 different servers. It seems that each day, at least one ofthe backups fails. I have been working at getting these to work flawlesslyfor 6 months, but it seems beyond my grasp. In the last week, I thought I washot on the trail of a 'perfect run' but now I'm not so sure. For the past fewdays I have been having troubles with the same two servers' backups. Theseare push type backups (as are all my backup jobs) with the remote serversrunning backup scripts to rdiff to, in this case, two differentdestination/backup servers.
In the first case: older Gentoo Linux system (running v1.0.5, dest. hasv1.0.4) the following commands:rdiff-backup --force --print-statistics --include /etc --include /home--include /var --include /root --exclude / /root@<servername>::/home/backups/dorrdiff-backup --force --remove-older-than 2Mroot@<servername>::/home/backups/dor
(I added the --force option to test if that would clear up the regressionproblem, it didn't)
returned this as output:
Previous backup seems to have failed, regressing destination now.
Traceback (most recent call last):
File "/usr/bin/rdiff-backup", line 23, in <module>
  rdiff_backup.Main.Main(sys.argv[1:])
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py", line285, in Main
  take_action(rps)
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py", line255, in take_action
  elif action == "backup": Backup(rps[0], rps[1])
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py", line299, in Backup
  backup_final_init(rpout)
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py", line396, in backup_final_init
  checkdest_if_necessary(rpout)
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py", line911, in checkdest_if_necessary
  dest_rp.conn.regress.Regress(dest_rp)
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/connection.py",line 445, in __call__
  return apply(self.connection.reval, (self.name,) + args)
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/connection.py",line 367, in reval
  if isinstance(result, Exception): raise result
IOError: [Errno None] None: None
Traceback (most recent call last):
File "/usr/bin/rdiff-backup", line 23, in ?
  rdiff_backup.Main.Main(sys.argv[1:])
File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line 285, inMain
  take_action(rps)
File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line 253, intake_action
  connection.PipeConnection(sys.stdin, sys.stdout).Server()
File "/usr/lib/python2.4/site-packages/rdiff_backup/connection.py", line352, in Server
  self.get_response(-1)
File "/usr/lib/python2.4/site-packages/rdiff_backup/connection.py", line314, in get_response
  try: req_num, object = self._get()
File "/usr/lib/python2.4/site-packages/rdiff_backup/connection.py", line230, in _get
  raise ConnectionReadError("Truncated header string (problem "
rdiff_backup.connection.ConnectionReadError: Truncated header string (problemprobably originated remotely)
At some point recently (3/20), this backup worked. Then it started to fail,giving up regressing dest. errors each time it has run since then. This isthe same backup I posted on recently where I had to 'pull the wool' overrdiff's eyes because of a server date malfunction. The work around seemed tobe the renaming of the current meta-data file to a time prior to the next runof rdiff. That seemed to work in that it didn't complain about too manycurrent mirror files, but it did make rdiff unable to 'see' the metadata fileand therefore use the filesystem. Perhaps these problems are then related? Ifso, any ideas on how to get it working again would be greatly appreciated.There should be two months of increments stored in the repository so I don'twant to lose those by starting over.
The second failed backup is a brand new install of ubuntu 8.10 running rdiffv1.1.16 pushing backups to another fresh 8.10 install also running rdiffv1.1.16. Using the following commands:
rdiff-backup --force --print-statistics --exclude-special-files --include/etc --include /home --include /var/www --exclude /var --include /root--exclude / / root@<servername2>::/home/backups/images2rdiff-backup --force --remove-older-than 2Mroot@<servername2>::/home/backups/images2
(again, I added the --force options to see if it would not regress...)

returned this output:
Previous backup seems to have failed, regressing destination now.
Exception 'CRC check failed' raised of class '<type 'exceptions.IOError'>':
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 302, inerror_check_Main
  try: Main(arglist)
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 322, inMain
  take_action(rps)
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 278, intake_action
  elif action == "backup": Backup(rps[0], rps[1])
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 341, inBackup
  backup.Mirror_and_increment(rpin, rpout, incdir)
File "/var/lib/python-support/python2.5/rdiff_backup/backup.py", line 51, inMirror_and_increment
  DestS.patch_and_increment(dest_rpath, source_diffiter, inc_rpath)
File "/var/lib/python-support/python2.5/rdiff_backup/connection.py", line447, in __call__
  return apply(self.connection.reval, (self.name,) + args)
File "/var/lib/python-support/python2.5/rdiff_backup/connection.py", line369, in reval
  if isinstance(result, Exception): raise result

Traceback (most recent call last):
File "/usr/bin/rdiff-backup", line 23, in <module>
  rdiff_backup.Main.error_check_Main(sys.argv[1:])
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 302, inerror_check_Main
  try: Main(arglist)
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 322, inMain
  take_action(rps)
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 278, intake_action
  elif action == "backup": Backup(rps[0], rps[1])
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 341, inBackup
  backup.Mirror_and_increment(rpin, rpout, incdir)
File "/var/lib/python-support/python2.5/rdiff_backup/backup.py", line 51, inMirror_and_increment
  DestS.patch_and_increment(dest_rpath, source_diffiter, inc_rpath)
File "/var/lib/python-support/python2.5/rdiff_backup/connection.py", line447, in __call__
  return apply(self.connection.reval, (self.name,) + args)
File "/var/lib/python-support/python2.5/rdiff_backup/connection.py", line369, in reval
  if isinstance(result, Exception): raise result
IOError: CRC check failed
Fatal Error: Lost connection to the remote system
Seems like the last line is a big issue. Is there any further descriptor tobe had for the lost connection error (I've tried running rdiff with both -v5and -v7 levels but neither seemed to give me any more info on the lostconnection error, and this error does re-occur on each successive running)?This backup data (241GB) set took several tries to get to run properly,however it did complete successfully on 3/23 (after running for 23 hours tocomplete). Since then it has thrown up the 'previous backup seems to havefailed, regressing destination' errors each time. I have the network almostto myself this week, so there's not a lot of extra traffic impeding packetflow and no obvious reasons for a lost connection error (i.e. - the link hasnot seemed to go down [at least not that cacti or nagios noticed]).
Thanks in advance for any help on either of these.
  ~bob

[Prev in Thread]

Current Thread

[Next in Thread]

[rdiff-backup-users] Regression errors, Bob Mead, 2009/03/26
- Re: [rdiff-backup-users] Regression errors, Maarten Bezemer <=
  - Re: [rdiff-backup-users] Regression errors, Bob Mead, 2009/03/31
- Re: [rdiff-backup-users] Regression errors, Andrew Ferguson, 2009/03/30
  - Re: [rdiff-backup-users] Regression errors, Bob Mead, 2009/03/31

Prev by Date: Re: [rdiff-backup-users] Rdiff-Backup Windows Integer Overflow Problem
Next by Date: [rdiff-backup-users] Also get bad index order assertion when recovering from failed backup
Previous by thread: [rdiff-backup-users] Regression errors
Next by thread: Re: [rdiff-backup-users] Regression errors
Index(es):
- Date
- Thread