[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: My experience with using cp to copy a lot of files (432 millions, 3
From: |
Pádraig Brady |
Subject: |
Re: My experience with using cp to copy a lot of files (432 millions, 39 TB) |
Date: |
Sun, 14 Sep 2014 11:43:39 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 |
On 09/12/2014 03:11 PM, Pádraig Brady wrote:
> On 09/12/2014 04:42 AM, address@hidden wrote:
>> This post has made it to Hacker News[1].
>>
>> We have discussed optimization possibilities a bit, and I made the
>> suggestion to replace the usage of a hash table in cp with sorting a
>> list.
>>
>> For example: walk the source tree and write a list of ino/dev/path to
>> a temporary file, then sort that file according to ino/dev (e.g. using
>> GNU sort, which I seem to remember is already using a memory-efficient
>> algorithm (i.e. works well with files much bigger than RAM)?), then
>> parse the file back and copy the first path of every group with the
>> same ino/dev and link the rest.
>>
>> The assumption is that sorting a list requires much less RAM to be
>> efficient than the hash table. (I can't find my copy of TAOCP right
>> now, I think it describes solutions.)
>>
>> Christian.
>>
>> [1] https://news.ycombinator.com/item?id=8305283
>
> The main problem here is that you need the full set somewhere
> at some stage to identify the hardlinks among the full set.
>
> Using an external sorted list would result in a larger disk usage
> (than having cp swap out its mem), but probably would benefit
> from better mem locality. I.E. while more expensive CPU wise,
> sort(1) would use a divide and conquer approach to achieve better
> memory locality, while cp updates its hash as it copies, which is the
> best approach assuming no swapping, but that results in essentially
> random parts of mem (swap) being updated as it runs.
>
> That resulted on the 10G RAM system in around 30/420 inodes/direntries
> being copied per second on average over the 10 day run.
>
> It's worth mentioning that mechanical disks with their orders of magnitude
> slower random access latencies are going away, so we must consider that
> along with the frequency of this use case when making any changes to
> the complexity of the code.
>
> If we were to address this, the two approaches would be to
> 1. reduce the size of the hash
> 2. employ locality-sensitive hashing
>
> Looking at 1, I see there was an attempt to reduce the size
> of the hash by excluding traversed source directories:
> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=3ece035
>
> Now Rasmus' analysis suggests that this didn't happen for him,
> resulting in a 350% mem increase for his case, which would have been
> more than enough to move the working set from RAM to swap.
>
> The patch above didn't cater for the fact that dirs
> generally have st_nlink > 1, which is fixed up in the following patch:
>
> Note there is a caveat where this will now not detect "hardlinked dirs"
> as reported (intermittently?) on netapp file systems, and thus not
> exit with an error, and instead just create new directories in that case.
> Perhaps that's what we should be doing anyway?
> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/copy.c;h=a9561c6
>
> diff --git a/src/copy.c b/src/copy.c
> index a9561c6..a2e297f 100644
> --- a/src/copy.c
> +++ b/src/copy.c
> @@ -2152,7 +2152,7 @@ copy_internal (char const *src_name, char const
> *dst_name,
> }
> else if (x->preserve_links
> && !x->hard_link
> - && (1 < src_sb.st_nlink
> + && ((1 < src_sb.st_nlink && ! S_ISDIR (src_mode))
> || (command_line_arg
> && x->dereference == DEREF_COMMAND_LINE_ARGUMENTS)
> || x->dereference == DEREF_ALWAYS))
>
> thanks,
> Pádraig.
>
>
BTW there is a lot of redundancy in all the stored paths,
which could be improved by using a prefix compression method.
Pádraig.
- My experience with using cp to copy a lot of files (432 millions, 39 TB), chrjae, 2014/09/12
- Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Pádraig Brady, 2014/09/12
- Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Pádraig Brady, 2014/09/12
- Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Christian Jaeger, 2014/09/12
- Re: My experience with using cp to copy a lot of files (432 millions, 39 TB),
Pádraig Brady <=
- Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Pádraig Brady, 2014/09/18