coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Taking advantage of L1 and L2 cache in sort


From: Pádraig Brady
Subject: Re: Taking advantage of L1 and L2 cache in sort
Date: Fri, 08 Jun 2012 01:14:24 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/20110816 Thunderbird/6.0

On 03/03/2010 01:06 AM, Pádraig Brady wrote:
> On 02/03/10 18:20, Chen Guo wrote:
>> This is exactly what that guy Shaun Jackman was talking about earlier.
>> I'm actually really surprised this is faster, if I can dig up his e-mail I'll
>> forward him this, I remember him saying something about experimenting
>> with exactly this.
> 
> I missed that thread but yes he pretty much had the
> same idea as I stumbled on when trying to perturb
> the posix_fadvise() testing by changing the buffer size.
> http://lists.gnu.org/archive/html/bug-coreutils/2010-02/msg00151.html
> Spooky :)
> 
> Shaun, you can use `taskset` to set process affinity BTW.
> 
>> Can you profile the difference in the number of I/O system calls?
> 
> $ TMPDIR=/ram LANG=C /usr/bin/time -v strace -c ./src/sort sort.t/sort.1.test 
> > /dev/null
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  70.70    0.283077       21775        13           read
>  28.97    0.115983       19331         6           munmap
>   0.32    0.001268           0     21609           write
>   0.01    0.000054           8         7           open
>   0.00    0.000000           0         9           close
>   0.00    0.000000           0         1           execve
>   0.00    0.000000           0         1         1 access
>   0.00    0.000000           0         3           brk
>   0.00    0.000000           0         1         1 ioctl
>   0.00    0.000000           0         1           uname
>   0.00    0.000000           0         5           mprotect
>   0.00    0.000000           0        25           rt_sigaction
>   0.00    0.000000           0         1           rt_sigprocmask
>   0.00    0.000000           0         4           getrlimit
>   0.00    0.000000           0        16           mmap2
>   0.00    0.000000           0         9           fstat64
>   0.00    0.000000           0         2         1 futex
>   0.00    0.000000           0         1           set_thread_area
>   0.00    0.000000           0         1           set_tid_address
>   0.00    0.000000           0         1           fadvise64_64
>   0.00    0.000000           0         1           set_robust_list
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.400382                 21717         3 total
>     Command being timed: "strace -c ./src/sort sort.t/sort.1.test"
>     User time (seconds): 26.91
>     System time (seconds): 2.01
>     Percent of CPU this job got: 90%
>     Elapsed (wall clock) time (h:mm:ss or m:ss): 0:32.02
>     Average shared text size (kbytes): 0
>     Average unshared data size (kbytes): 0
>     Average stack size (kbytes): 0
>     Average total size (kbytes): 0
>     Maximum resident set size (kbytes): 0
>     Average resident set size (kbytes): 0
>     Major (requiring I/O) page faults: 3
>     Minor (reclaiming a frame) page faults: 181060
>     Voluntary context switches: 87362
>     Involuntary context switches: 2526
>     Swaps: 0
>     File system inputs: 173504
>     File system outputs: 0
>     Socket messages sent: 0
>     Socket messages received: 0
>     Signals delivered: 0
>     Page size (bytes): 4096
>     Exit status: 0
> 
> $ TMPDIR=/ram LANG=C /usr/bin/time -v strace -c ./src/sort -S1M 
> sort.t/sort.1.test > /dev/null
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  38.95    0.035011           1     60991           read
>  33.47    0.030081          90       334           unlink
>  24.17    0.021721           0     81864           write
>   2.07    0.001862           2      1006           munmap
>   0.75    0.000670           1       673           open
>   0.23    0.000209           0      1016           mmap2
>   0.19    0.000167           0       675           fstat64
>   0.09    0.000085           0       675           close
>   0.07    0.000062           0       334           fcntl64
>   0.02    0.000018           0      1337           rt_sigprocmask
>   0.00    0.000000           0         1           execve
>   0.00    0.000000           0         1         1 access
>   0.00    0.000000           0         3           brk
>   0.00    0.000000           0         1         1 ioctl
>   0.00    0.000000           0         1           gettimeofday
>   0.00    0.000000           0         1           uname
>   0.00    0.000000           0         5           mprotect
>   0.00    0.000000           0       334           _llseek
>   0.00    0.000000           0        25           rt_sigaction
>   0.00    0.000000           0         1           getrlimit
>   0.00    0.000000           0         2         1 futex
>   0.00    0.000000           0         1           set_thread_area
>   0.00    0.000000           0         1           set_tid_address
>   0.00    0.000000           0       335           fadvise64_64
>   0.00    0.000000           0         1           set_robust_list
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.089886                149618         3 total
>     Command being timed: "strace -c ./src/sort -S1M sort.t/sort.1.test"
>     User time (seconds): 21.76
>     System time (seconds): 4.51
>     Percent of CPU this job got: 98%
>     Elapsed (wall clock) time (h:mm:ss or m:ss): 0:26.79
>     Average shared text size (kbytes): 0
>     Average unshared data size (kbytes): 0
>     Average stack size (kbytes): 0
>     Average total size (kbytes): 0
>     Maximum resident set size (kbytes): 0
>     Average resident set size (kbytes): 0
>     Major (requiring I/O) page faults: 3
>     Minor (reclaiming a frame) page faults: 23038
>     Voluntary context switches: 598317
>     Involuntary context switches: 2316
>     Swaps: 0
>     File system inputs: 173504
>     File system outputs: 0
>     Socket messages sent: 0
>     Socket messages received: 0
>     Signals delivered: 0
>     Page size (bytes): 4096
>     Exit status: 0

Hmm I think the results above are invalid
due to MALLOC_PERTURB_ biasing against large buffers.
In the test below the extra external file process dominates
any gains from CPU cache usage:

$ # Test setup
$ cache_size() { lscpu | grep cache | tail -n1 | tr -s ' ' | cut -d' ' -f3; }
$ export TMPDIR=/dev/shm # RAM
$ export OMP_NUM_THREADS=1 # limit to a single CPU to simplify
$ unset MALLOC_PERTURB_ # negatively impacts performance tests
$ shuf -i 1-10000000 > sort.test # 78MB

$ LANG=C /usr/bin/time -v strace -c sort -S$(cache_size) sort.test > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 66.33    0.034878         185       189           unlink
 17.89    0.009408           0     45766           read
  8.73    0.004590           0     57711           write
  4.53    0.002381           4       571           munmap
  1.28    0.000671           2       382           open
  0.46    0.000242           1       384           fstat
  0.35    0.000183           0       580           mmap
  0.19    0.000100           0       384           close
  0.14    0.000074           0       757           rt_sigprocmask
  0.11    0.000058           0       189           fcntl
  0.00    0.000000           0       189           lseek
  0.00    0.000000           0         5           mprotect
  0.00    0.000000           0         4           brk
  0.00    0.000000           0        25           rt_sigaction
  0.00    0.000000           0         1         1 ioctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0       190           fadvise64
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.052585                107333         2 total
        Command being timed: "strace -c sort --parallel=1 -T/dev/shm -S3072K 
sort.test"
        User time (seconds): 25.88
        System time (seconds): 5.81
        Percent of CPU this job got: 94%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:33.59
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4968
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 13900
        Voluntary context switches: 429178
        Involuntary context switches: 4882
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0


$ LANG=C /usr/bin/time -v strace -c sort -S200M sort.test > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 80.68    0.348335         565       617           read
 11.58    0.049994        3846        13           munmap
  5.33    0.022995        7665         3           unlink
  2.38    0.010281           0     38521           write
  0.01    0.000052          52         1         1 access
  0.01    0.000042           8         5           mprotect
  0.01    0.000032           1        25           rt_sigaction
  0.00    0.000000           0        10           open
  0.00    0.000000           0        12           close
  0.00    0.000000           0        12           fstat
  0.00    0.000000           0         3           lseek
  0.00    0.000000           0        22           mmap
  0.00    0.000000           0         4           brk
  0.00    0.000000           0        13           rt_sigprocmask
  0.00    0.000000           0         1         1 ioctl
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         3           fcntl
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         4           fadvise64
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.431731                 39274         2 total
        Command being timed: "strace -c sort -S200M sort.test"
        User time (seconds): 19.82
        System time (seconds): 4.33
        Percent of CPU this job got: 94%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:25.57
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 205532
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 355
        Minor (reclaiming a frame) page faults: 98057
        Voluntary context switches: 159217
        Involuntary context switches: 4681
        Swaps: 0
        File system inputs: 140344
        File system outputs: 154168
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

I notice also that performance gets worse for the above input
for the small buffer size when I enable multithreading on my 4 core i3.
        User time (seconds): 35.40
        System time (seconds): 9.20
        Percent of CPU this job got: 120%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:36.98

cheers,
Pádraig.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]