Re: [Duplicity-talk] Duplicate files

duplicity-talk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Duplicity-talk] Duplicate files

From:	Jacob Godserv
Subject:	Re: [Duplicity-talk] Duplicate files
Date:	Wed, 10 Mar 2010 17:58:52 -0500

On Wed, Mar 10, 2010 at 16:06, Philip Tait <address@hidden> wrote:
> Has there been any work on duplicate file elimination for duplicity?

I had an interesting idea that just occurred to me. I'm probably as
unhelpful as ever, but it's worth a shot. :)

A checksum for each block within a file (the parts that duplicity can
diff against; I'm pretty sure they're bigger than a byte) could be
generated and placed into an array or temporary folder of some sort,
ccache-style. When another block has the same checksum, duplicity
could simply reference the location of the original checksum. I
haven't delt with compression algorithms at all, so I'm speaking at a
pretty high level.

Duplicate files could be internally reduced dramatically in size using
this method.

Due to the high I/O and/or memory usage, it'd be interesting to gather
statistics on higher vs. lower counts of blocks contained within a
checksum. The more blocks per checksum, the smaller the checksum store
is, but there's a lesser chance of finding common pieces. (For
example, is it better to assume that, if there are common pieces, it's
more likely part of a larger portion?)

-- 
    Jacob

    "For then there will be great distress, unequaled
    from the beginning of the world until now — and never
    to be equaled again. If those days had not been cut
    short, no one would survive, but for the sake of the
    elect those days will be shortened."

    Are you ready?

[Prev in Thread]

Current Thread

[Next in Thread]

[Duplicity-talk] Duplicate files, Philip Tait, 2010/03/10
- Re: [Duplicity-talk] Duplicate files, Jacob Godserv <=
  - Re: [Duplicity-talk] Duplicate files, Oei, YC, 2010/03/11
    - Re: [Duplicity-talk] Duplicate files, Paul Harris, 2010/03/11

Prev by Date: [Duplicity-talk] Duplicate files
Next by Date: Re: [Duplicity-talk] Duplicate files
Previous by thread: [Duplicity-talk] Duplicate files
Next by thread: Re: [Duplicity-talk] Duplicate files
Index(es):
- Date
- Thread