coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Taking advantage of L1 and L2 cache in sort


From: Chen Guo
Subject: Re: Taking advantage of L1 and L2 cache in sort
Date: Mon, 18 Jun 2012 12:24:27 -0700

Fascinating, Padraig. I wonder, did this email just get to my inbox really,
really late, or what prompted you to dig up a 2 year old thread?

Anyhow, it seems you had to specifically unset MALLOC_PERTURB_,
so is it by default set? I forget if we ever made the change to default to
a small buffer, but either way if desired I can make the change to read
MALLOC_PERTURB_ from env[] and set the buffer size accordingly
based on that.

On Thu, Jun 7, 2012 at 5:14 PM, Pádraig Brady <address@hidden> wrote:
On 03/03/2010 01:06 AM, Pádraig Brady wrote:
> On 02/03/10 18:20, Chen Guo wrote:
>> This is exactly what that guy Shaun Jackman was talking about earlier.
>> I'm actually really surprised this is faster, if I can dig up his e-mail I'll
>> forward him this, I remember him saying something about experimenting
>> with exactly this.
>
> I missed that thread but yes he pretty much had the
> same idea as I stumbled on when trying to perturb
> the posix_fadvise() testing by changing the buffer size.
> http://lists.gnu.org/archive/html/bug-coreutils/2010-02/msg00151.html
> Spooky :)
>
> Shaun, you can use `taskset` to set process affinity BTW.
>
>> Can you profile the difference in the number of I/O system calls?
>
> $ TMPDIR=/ram LANG=C /usr/bin/time -v strace -c ./src/sort sort.t/sort.1.test > /dev/null
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  70.70    0.283077       21775        13           read
>  28.97    0.115983       19331         6           munmap
>   0.32    0.001268           0     21609           write
>   0.01    0.000054           8         7           open
>   0.00    0.000000           0         9           close
>   0.00    0.000000           0         1           execve
>   0.00    0.000000           0         1         1 access
>   0.00    0.000000           0         3           brk
>   0.00    0.000000           0         1         1 ioctl
>   0.00    0.000000           0         1           uname
>   0.00    0.000000           0         5           mprotect
>   0.00    0.000000           0        25           rt_sigaction
>   0.00    0.000000           0         1           rt_sigprocmask
>   0.00    0.000000           0         4           getrlimit
>   0.00    0.000000           0        16           mmap2
>   0.00    0.000000           0         9           fstat64
>   0.00    0.000000           0         2         1 futex
>   0.00    0.000000           0         1           set_thread_area
>   0.00    0.000000           0         1           set_tid_address
>   0.00    0.000000           0         1           fadvise64_64
>   0.00    0.000000           0         1           set_robust_list
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.400382                 21717         3 total
>     Command being timed: "strace -c ./src/sort sort.t/sort.1.test"
>     User time (seconds): 26.91
>     System time (seconds): 2.01
>     Percent of CPU this job got: 90%
>     Elapsed (wall clock) time (h:mm:ss or m:ss): 0:32.02
>     Average shared text size (kbytes): 0
>     Average unshared data size (kbytes): 0
>     Average stack size (kbytes): 0
>     Average total size (kbytes): 0
>     Maximum resident set size (kbytes): 0
>     Average resident set size (kbytes): 0
>     Major (requiring I/O) page faults: 3
>     Minor (reclaiming a frame) page faults: 181060
>     Voluntary context switches: 87362
>     Involuntary context switches: 2526
>     Swaps: 0
>     File system inputs: 173504
>     File system outputs: 0
>     Socket messages sent: 0
>     Socket messages received: 0
>     Signals delivered: 0
>     Page size (bytes): 4096
>     Exit status: 0
>
> $ TMPDIR=/ram LANG=C /usr/bin/time -v strace -c ./src/sort -S1M sort.t/sort.1.test > /dev/null
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  38.95    0.035011           1     60991           read
>  33.47    0.030081          90       334           unlink
>  24.17    0.021721           0     81864           write
>   2.07    0.001862           2      1006           munmap
>   0.75    0.000670           1       673           open
>   0.23    0.000209           0      1016           mmap2
>   0.19    0.000167           0       675           fstat64
>   0.09    0.000085           0       675           close
>   0.07    0.000062           0       334           fcntl64
>   0.02    0.000018           0      1337           rt_sigprocmask
>   0.00    0.000000           0         1           execve
>   0.00    0.000000           0         1         1 access
>   0.00    0.000000           0         3           brk
>   0.00    0.000000           0         1         1 ioctl
>   0.00    0.000000           0         1           gettimeofday
>   0.00    0.000000           0         1           uname
>   0.00    0.000000           0         5           mprotect
>   0.00    0.000000           0       334           _llseek
>   0.00    0.000000           0        25           rt_sigaction
>   0.00    0.000000           0         1           getrlimit
>   0.00    0.000000           0         2         1 futex
>   0.00    0.000000           0         1           set_thread_area
>   0.00    0.000000           0         1           set_tid_address
>   0.00    0.000000           0       335           fadvise64_64
>   0.00    0.000000           0         1           set_robust_list
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.089886                149618         3 total
>     Command being timed: "strace -c ./src/sort -S1M sort.t/sort.1.test"
>     User time (seconds): 21.76
>     System time (seconds): 4.51
>     Percent of CPU this job got: 98%
>     Elapsed (wall clock) time (h:mm:ss or m:ss): 0:26.79
>     Average shared text size (kbytes): 0
>     Average unshared data size (kbytes): 0
>     Average stack size (kbytes): 0
>     Average total size (kbytes): 0
>     Maximum resident set size (kbytes): 0
>     Average resident set size (kbytes): 0
>     Major (requiring I/O) page faults: 3
>     Minor (reclaiming a frame) page faults: 23038
>     Voluntary context switches: 598317
>     Involuntary context switches: 2316
>     Swaps: 0
>     File system inputs: 173504
>     File system outputs: 0
>     Socket messages sent: 0
>     Socket messages received: 0
>     Signals delivered: 0
>     Page size (bytes): 4096
>     Exit status: 0

Hmm I think the results above are invalid
due to MALLOC_PERTURB_ biasing against large buffers.
In the test below the extra external file process dominates
any gains from CPU cache usage:

$ # Test setup
$ cache_size() { lscpu | grep cache | tail -n1 | tr -s ' ' | cut -d' ' -f3; }
$ export TMPDIR=/dev/shm # RAM
$ export OMP_NUM_THREADS=1 # limit to a single CPU to simplify
$ unset MALLOC_PERTURB_ # negatively impacts performance tests
$ shuf -i 1-10000000 > sort.test # 78MB

$ LANG=C /usr/bin/time -v strace -c sort -S$(cache_size) sort.test > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 66.33    0.034878         185       189           unlink
 17.89    0.009408           0     45766           read
 8.73    0.004590           0     57711           write
 4.53    0.002381           4       571           munmap
 1.28    0.000671           2       382           open
 0.46    0.000242           1       384           fstat
 0.35    0.000183           0       580           mmap
 0.19    0.000100           0       384           close
 0.14    0.000074           0       757           rt_sigprocmask
 0.11    0.000058           0       189           fcntl
 0.00    0.000000           0       189           lseek
 0.00    0.000000           0         5           mprotect
 0.00    0.000000           0         4           brk
 0.00    0.000000           0        25           rt_sigaction
 0.00    0.000000           0         1         1 ioctl
 0.00    0.000000           0         1         1 access
 0.00    0.000000           0         1           execve
 0.00    0.000000           0         1           getrlimit
 0.00    0.000000           0         1           arch_prctl
 0.00    0.000000           0         1           set_tid_address
 0.00    0.000000           0       190           fadvise64
 0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.052585                107333         2 total
       Command being timed: "strace -c sort --parallel=1 -T/dev/shm -S3072K sort.test"
       User time (seconds): 25.88
       System time (seconds): 5.81
       Percent of CPU this job got: 94%
       Elapsed (wall clock) time (h:mm:ss or m:ss): 0:33.59
       Average shared text size (kbytes): 0
       Average unshared data size (kbytes): 0
       Average stack size (kbytes): 0
       Average total size (kbytes): 0
       Maximum resident set size (kbytes): 4968
       Average resident set size (kbytes): 0
       Major (requiring I/O) page faults: 0
       Minor (reclaiming a frame) page faults: 13900
       Voluntary context switches: 429178
       Involuntary context switches: 4882
       Swaps: 0
       File system inputs: 0
       File system outputs: 0
       Socket messages sent: 0
       Socket messages received: 0
       Signals delivered: 0
       Page size (bytes): 4096
       Exit status: 0


$ LANG=C /usr/bin/time -v strace -c sort -S200M sort.test > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 80.68    0.348335         565       617           read
 11.58    0.049994        3846        13           munmap
 5.33    0.022995        7665         3           unlink
 2.38    0.010281           0     38521           write
 0.01    0.000052          52         1         1 access
 0.01    0.000042           8         5           mprotect
 0.01    0.000032           1        25           rt_sigaction
 0.00    0.000000           0        10           open
 0.00    0.000000           0        12           close
 0.00    0.000000           0        12           fstat
 0.00    0.000000           0         3           lseek
 0.00    0.000000           0        22           mmap
 0.00    0.000000           0         4           brk
 0.00    0.000000           0        13           rt_sigprocmask
 0.00    0.000000           0         1         1 ioctl
 0.00    0.000000           0         1           execve
 0.00    0.000000           0         3           fcntl
 0.00    0.000000           0         1           getrlimit
 0.00    0.000000           0         1           arch_prctl
 0.00    0.000000           0         1           set_tid_address
 0.00    0.000000           0         4           fadvise64
 0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.431731                 39274         2 total
       Command being timed: "strace -c sort -S200M sort.test"
       User time (seconds): 19.82
       System time (seconds): 4.33
       Percent of CPU this job got: 94%
       Elapsed (wall clock) time (h:mm:ss or m:ss): 0:25.57
       Average shared text size (kbytes): 0
       Average unshared data size (kbytes): 0
       Average stack size (kbytes): 0
       Average total size (kbytes): 0
       Maximum resident set size (kbytes): 205532
       Average resident set size (kbytes): 0
       Major (requiring I/O) page faults: 355
       Minor (reclaiming a frame) page faults: 98057
       Voluntary context switches: 159217
       Involuntary context switches: 4681
       Swaps: 0
       File system inputs: 140344
       File system outputs: 154168
       Socket messages sent: 0
       Socket messages received: 0
       Signals delivered: 0
       Page size (bytes): 4096
       Exit status: 0

I notice also that performance gets worse for the above input
for the small buffer size when I enable multithreading on my 4 core i3.
       User time (seconds): 35.40
       System time (seconds): 9.20
       Percent of CPU this job got: 120%
       Elapsed (wall clock) time (h:mm:ss or m:ss): 0:36.98

cheers,
Pádraig.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]