rdiff-backup-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [rdiff-backup-users] atomic increment files?


From: Marcel (Felix) Giannelia
Subject: Re: [rdiff-backup-users] atomic increment files?
Date: Mon, 09 Mar 2009 18:21:26 -0700
User-agent: Thunderbird 2.0.0.6 (X11/20071015)

Hi Matt,

Matthew Flaschen wrote:

It seems to me you're very wrong.  A typical restore is only going to
touch a few files.
If a typical restore is only restoring a small part of the filesystem and only going back a few days, you're right. But I wasn't even concerned with restore operations -- I want the increment storage to be more efficient so that I can archive it quickly and easily.
Now you tried to address this with, "accessing an
archive file's directory structure is likely faster than doing the same
in a part of the filesystem containing many thousands of files per
directory."   But you provided no evidence whatsoever for this very
non-obvious statement.
It doesn't strike me as non-obvious; reading an archive header on a file format that stores one (e.g. rar, zip, 7z as opposed to tar, which doesn't) has always seemed to go faster than enumerating the same list of files in the filesystem, when there are many files involved. That's because reading an archive header is a single, linear disk read, whereas a large subtree traversal involves a crapload of very small reads that are all over the place. Seeking time is where most of the waiting is done -- hard drives can access data that happens to be under the heads in about 1ms (or less?), but seeking the heads takes on the order of 13-15 ms.

If I need to get a directory listing of a very large subtree (for instance, the one in my problem backup) I usually do something like this:

find . -type f -print0 | xargs -0 ls -l >> big_directory_listing

(I use find instead of ls -R because find prints the full path on every line, which is easier to parse in scripts.)

That way, I can quickly refer to big_directory_listing without having to traverse the subtree again. In my case, a total of about 1.5 million files produced an output file of over 200 MB, but it was still *much* faster to work with that text file than it was to read from the filesystem. For instance, du -s took over an hour even the second time I ran it [because the size of the directory tree was too big to fit into the RAM file cache], but "cat big_directory_listing | gawk '{sum += $5} END {print sum}'" took about 15 seconds.

  If that were true, why don't people use AVFS as
their primary filesystem?
My points about access speed only apply to data sets that are read-only. Updating an index like that is a relatively slow operation, and wouldn't work very well for day-to-day filesystem use. Filesystems are designed to make a trade-off between enumeration speed and update speed, and for the most part they do that fairly well (and they're getting better).

But when you're storing something that you *know* is going to be read-only and will never need to be modified again, then it makes more sense to store a nice index at the front (which is why CD's do that). Aside from unrecommended fiddling, people generally don't modify rdiff-backup's increment files so I think that would be a good application of a nice indexed archive. Creating the index would incur almost no extra time, increments could be stored without the extra slack space separate files cause, and typical restore operations wouldn't slow down by much (full restores from a long time ago would probably take longer, but they already take a while).

~Felix.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]