coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: My experience with using cp to copy a lot of files (432 millions, 3


From: Pádraig Brady
Subject: Re: My experience with using cp to copy a lot of files (432 millions, 39 TB)
Date: Thu, 21 Aug 2014 10:31:13 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2

On 08/21/2014 08:10 AM, Bernhard Voelker wrote:
> On 08/11/2014 03:55 PM, Rasmus Borup Hansen wrote:
>> Trusting that resizing the hash table would eventually finish, the cp
>> command was allowed to continue, and after a while it started copying
>> again. It stopped again and resized the hash table a couple of times,
>>  each taking more and more time. Finally, after 10 days of copying and
>>  hash table resizing, the new file system used as many blocks and inodes
>>  as the old one according to df, but to my surprise the cp command didn't
>>  exit. Looking at the source again, I found that cp disassembles its hash
>>  table data structures nicely after copying (the forget_all call). Since
>>  the virtual size of the cp process was now more than 17 GB and the
>>  server only had 10 GB of RAM, it did a lot of swapping.
> 
> Thinking about this case again, I find this very surprising:
> 
> a) that cp(1) uses 17 GB of memory when copying 39 TB of data.
> That means roughly 2300 bytes per file:
> 
>   $ bc <<<'39 * 1024 / 17'
>   2349
> 
> ... although the hashed structure only has these members:
> 
>   struct Src_to_dest
>   {
>     ino_t st_ino;
>     dev_t st_dev;
>     char *name;
>   };
> 
> I think either the file names where rather long (in average!),
> or there is something wrong in the code.
> 
> b) that cp(1) is increasing the hash table that often.
> This is because it uses the default Hash_tuning (hash.c):
> 
>   /* [...] The growth threshold defaults to 0.8, and the growth factor
>      defaults to 1.414, meaning that the table will have doubled its size
>      every second time 80% of the buckets get used.  */
>   #define DEFAULT_GROWTH_THRESHOLD 0.8f
>   #define DEFAULT_GROWTH_FACTOR 1.414f
> 
> It is like this since the introduction of hashing, and
> I wonder if cp(1) couldn't use better values for this.
> 
> Have a nice day,
> Berny
> 

The amount of files rather than the amount of data is pertinent here.
So 17G/432M is about 40 bytes per entry which is about right.

cheers,
Pádraig.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]