Re: My experience with using cp to copy a lot of files (432 millions, 3

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: My experience with using cp to copy a lot of files (432 millions, 3

From:	Pádraig Brady
Subject:	Re: My experience with using cp to copy a lot of files (432 millions, 39 TB)
Date:	Thu, 21 Aug 2014 10:31:13 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2

On 08/21/2014 08:10 AM, Bernhard Voelker wrote:
> On 08/11/2014 03:55 PM, Rasmus Borup Hansen wrote:
>> Trusting that resizing the hash table would eventually finish, the cp
>> command was allowed to continue, and after a while it started copying
>> again. It stopped again and resized the hash table a couple of times,
>>  each taking more and more time. Finally, after 10 days of copying and
>>  hash table resizing, the new file system used as many blocks and inodes
>>  as the old one according to df, but to my surprise the cp command didn't
>>  exit. Looking at the source again, I found that cp disassembles its hash
>>  table data structures nicely after copying (the forget_all call). Since
>>  the virtual size of the cp process was now more than 17 GB and the
>>  server only had 10 GB of RAM, it did a lot of swapping.
> 
> Thinking about this case again, I find this very surprising:
> 
> a) that cp(1) uses 17 GB of memory when copying 39 TB of data.
> That means roughly 2300 bytes per file:
> 
>   $ bc <<<'39 * 1024 / 17'
>   2349
> 
> ... although the hashed structure only has these members:
> 
>   struct Src_to_dest
>   {
>     ino_t st_ino;
>     dev_t st_dev;
>     char *name;
>   };
> 
> I think either the file names where rather long (in average!),
> or there is something wrong in the code.
> 
> b) that cp(1) is increasing the hash table that often.
> This is because it uses the default Hash_tuning (hash.c):
> 
>   /* [...] The growth threshold defaults to 0.8, and the growth factor
>      defaults to 1.414, meaning that the table will have doubled its size
>      every second time 80% of the buckets get used.  */
>   #define DEFAULT_GROWTH_THRESHOLD 0.8f
>   #define DEFAULT_GROWTH_FACTOR 1.414f
> 
> It is like this since the introduction of hashing, and
> I wonder if cp(1) couldn't use better values for this.
> 
> Have a nice day,
> Berny
> 

The amount of files rather than the amount of data is pertinent here.
So 17G/432M is about 40 bytes per entry which is about right.

cheers,
Pádraig.

[Prev in Thread]

Current Thread

[Next in Thread]

My experience with using cp to copy a lot of files (432 millions, 39 TB), Rasmus Borup Hansen, 2014/08/11
- Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Pádraig Brady, 2014/08/11
  - Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Bernhard Voelker, 2014/08/11
- Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Jim Meyering, 2014/08/11
- Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Bernhard Voelker, 2014/08/21
  - Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Pádraig Brady <=
    - Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Rasmus Borup Hansen, 2014/08/21

Prev by Date: Re: My experience with using cp to copy a lot of files (432 millions, 39 TB)
Next by Date: Re: My experience with using cp to copy a lot of files (432 millions, 39 TB)
Previous by thread: Re: My experience with using cp to copy a lot of files (432 millions, 39 TB)
Next by thread: Re: My experience with using cp to copy a lot of files (432 millions, 39 TB)
Index(es):
- Date
- Thread