Re: [rdiff-backup-users] atomic increment files?

rdiff-backup-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [rdiff-backup-users] atomic increment files?

From:	Marcel (Felix) Giannelia
Subject:	Re: [rdiff-backup-users] atomic increment files?
Date:	Wed, 11 Mar 2009 03:23:41 -0700
User-agent:	Thunderbird 2.0.0.16 (X11/20080726)

An interesting thing about the output tarballs from my script: if Irdiff two of them, one of them plus the patch file is significantlysmaller than two of them (presumably because diffs on different daysare nonetheless similar).* This is probably very dependent on whatkind of data is being backed up, but it may lead to a way to makeincrement storage even more efficient (but also more fragile, since arestore would take two levels of merging). It's also very possiblethat this is a clear indication that I've done something very wrong inmy script that's causing duplicate data in what are supposed to beseparate increments. Further testing is required ;)
*Example from my test set: a collected increment from 2008-10-04 is49MB, and the one from 2008-10-05 is also 49MB (total 98MB). An rdiffdelta file to turn 2008-10-04 into 2008-10-05 is only 18MB, so2008-10-04 plus the delta file is 67MB. Another delta to turn2008-10-05 into 2008-10-06 is also only 18MB, so the three of themtogether are 85MB instead of 147MB. Again, this is probably highlydependent on the kind of data that's in these increments, but I'msurprised it works as well as it does given that I'm tarring somealready-gzipped files together.

I've been doing some more experimenting, and I've found a partialexplanation for this. Mirror metadata files are huge! In my case eachone is 88MB (uncompressed), though they're only 6MB when gzipped. Thereare only minor differences between consecutive ones (a diff patch fromone to the next is on the order of 61KB uncompressed, 11KB compressed;an rdiff patch is considerably larger since they're plain text), so inmy example above that explains some of the saved space. File statisticsfiles probably also don't change much, but they too don't account formuch when compressed (also 5MB apiece).

In round 2 of testing, I tried uncompressing all of the files in anincrement, and then re-storing that as a tar, then generating the rdiffdelta, and then recompressing everything. This yielded a very slightadvantage in compression, but a significant one in rdiff'ing -- therdiff deltas are down from 18MB to only 7.4MB (and as I said, some ofthat can be explained away by similarities in mirror metadata files).

That leaves, of a 49MB increment: 7.4MB of data that's different + 5MBof nearly-identical file statistics + 6.1MB of nearly-identical mirrormetadata + another 30.5 MB of data that must identical between the twoincrements.

This leads me to suspect that rdiff-backup is storing snapshots ofthings that it shouldn't. Even if rdiff-backup routinely storessnapshots every 10 times a file changes (as was mentioned earlier), Ifind it unlikely that this would coincidentally happen to enough fileson 7 consecutive backup runs (I've run this experiment on 7 adjacentpairs of increments and get similar numbers for all of them) to get thekind of numbers I'm getting.

Another possibility is that these overlaps can be explained as filemoves. Currently I think rdiff-backup cannot detect a file move, andstores it as a deletion plus a new file; correct? If so, then perhapswhat's happening here is that part of the backup data set includesdaily-rotated logfiles. Rdiff can detect the identical blocks, becausewhen I'm using it on tarballs of the entire increment, all of the datais in one file. Supposing the rotating logs keep 10 files, thenrdiff-backup is seeing 10 files change so drastically that it's cheaperto store snapshots, but rdiff sees 10 large blocks of identical datathat just happen to have moved down by a unit or two in the tarball.

So, perhaps my harebrained original suggestion of storing increments assingle files has lead to a relatively easy way to implement file movedetection? (I'll be the first to point out, though, that since itrequires tarballs to work from, it's not particularly efficient tocreate even if it is efficient to store once it's done. There might be abetter way of this same idea, though.)


~Felix.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [rdiff-backup-users] atomic increment files?, (continued)
- Re: [rdiff-backup-users] atomic increment files?, Marcel (Felix) Giannelia, 2009/03/11
  - Re: [rdiff-backup-users] atomic increment files?, Marcel (Felix) Giannelia <=

Prev by Date: Re: [rdiff-backup-users] atomic increment files?
Next by Date: [rdiff-backup-users] Reliability/windows
Previous by thread: Re: [rdiff-backup-users] atomic increment files?
Next by thread: [rdiff-backup-users] Is this really an I/O error or what?
Index(es):
- Date
- Thread