[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Duplicity-talk] Duplicate files
From: |
Jacob Godserv |
Subject: |
Re: [Duplicity-talk] Duplicate files |
Date: |
Wed, 10 Mar 2010 17:58:52 -0500 |
On Wed, Mar 10, 2010 at 16:06, Philip Tait <address@hidden> wrote:
> Has there been any work on duplicate file elimination for duplicity?
I had an interesting idea that just occurred to me. I'm probably as
unhelpful as ever, but it's worth a shot. :)
A checksum for each block within a file (the parts that duplicity can
diff against; I'm pretty sure they're bigger than a byte) could be
generated and placed into an array or temporary folder of some sort,
ccache-style. When another block has the same checksum, duplicity
could simply reference the location of the original checksum. I
haven't delt with compression algorithms at all, so I'm speaking at a
pretty high level.
Duplicate files could be internally reduced dramatically in size using
this method.
Due to the high I/O and/or memory usage, it'd be interesting to gather
statistics on higher vs. lower counts of blocks contained within a
checksum. The more blocks per checksum, the smaller the checksum store
is, but there's a lesser chance of finding common pieces. (For
example, is it better to assume that, if there are common pieces, it's
more likely part of a larger portion?)
--
Jacob
"For then there will be great distress, unequaled
from the beginning of the world until now — and never
to be equaled again. If those days had not been cut
short, no one would survive, but for the sake of the
elect those days will be shortened."
Are you ready?