bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: af_alg: Add ability to use Linux kernel crypto API on data in memory


From: Matteo Croce
Subject: Re: af_alg: Add ability to use Linux kernel crypto API on data in memory
Date: Mon, 7 May 2018 14:10:19 +0200

On Mon, May 7, 2018 at 11:55 AM, Matteo Croce <address@hidden> wrote:
> On Mon, May 7, 2018 at 4:07 AM, Paul Eggert <address@hidden> wrote:
>> Bruno Haible wrote:
>>>
>>> Oops, I goofed with "git diff". Here's the correct patch to test.
>>
>>
>> I tried those bench-md5 benchmarks on two platforms, with somewhat more
>> disappointing results.
>>
>> I observed a real-time slowdown ranging from 11% (large buffers) to 22x
>> (small buffers) on Intel Xeon E3-1225 V2 (circa 2012 CPU), Ubuntu 16.04,
>> Linux 4.4.0, glibc 2.23. See attached file ubuntu1604.txt.
>>
>> I observed a real-time slowdown ranging from 8% (large buffers) to 43x
>> (small buffers) on AMD Phenom II X4 910e (circa 2010 CPU), Fedora 28, Linux
>> 4.16.5, glibc 2.27. See attached file fedora28.txt.
>>
>> These numbers compare somewhat unfavorably with your report, where the
>> real-time slowdown ranged from 1.5% (large buffers) to 25x (small buffers),
>> as reported in <https://lists.gnu.org/r/bug-gnulib/2018-05/msg00035.html>.
>
> Hi all,
>
> I tried all the above, I can confirm the disappointing results with
> md5 or small buffers.
> This is what happens on my machine, a Lenovo Laptop with Intel(R)
> Core(TM) i7-6820HQ CPU @ 2.70GHz running Fedora 27
>
> with large buffers all the algos are faster but md5:
>
> $ without/gltests/bench-md5 1000000000 1
> real   1.520719
> user   1.520
> sys    0.000
> $ with/gltests/bench-md5 1000000000 1
> real   1.684162
> user   0.000
> sys    1.684
>
> $ without/gltests/bench-sha1 1000000000 1
> real   1.696258
> user   1.696
> sys    0.000
> $ with/gltests/bench-sha1 1000000000 1
> real   1.072500
> user   0.000
> sys    1.072
>
> $ without/gltests/bench-sha256 1000000000 1
> real   4.467676
> user   4.468
> sys    0.000
> $ with/gltests/bench-sha256 1000000000 1
> real   2.527936
> user   0.009
> sys    2.519
>
> $ without/gltests/bench-sha512 1000000000 1
> real   2.684985
> user   2.685
> sys    0.000
> $ with/gltests/bench-sha256 1000000000 1
> real   2.546133
> user   0.004
> sys    2.542
>
>
> While for sha1, af_alg become faster with buffers > 100k:
>
> $ without/gltests/bench-sha1 100 1000000
> real   0.292869
> user   0.293
> sys    0.000
> $ with/gltests/bench-sha1 100 1000000
> real   9.153545
> user   0.698
> sys    8.421
>
> $ without/gltests/bench-sha1 1000 100000
> real   0.190652
> user   0.191
> sys    0.000
> $ with/gltests/bench-sha1 1000 100000
> real   1.033346
> user   0.071
> sys    0.963
>
> $ without/gltests/bench-sha1 10000 10000
> real   0.183897
> user   0.184
> sys    0.000
> $ with/gltests/bench-sha1 10000 10000
> real   0.214090
> user   0.003
> sys    0.212
>
> $ without/gltests/bench-sha1 100000 1000
> real   0.181184
> user   0.181
> sys    0.000
> $ with/gltests/bench-sha1 100000 1000
> real   0.131482
> user   0.002
> sys    0.130
>
> $ without/gltests/bench-sha1 1000000 100
> real   0.178751
> user   0.179
> sys    0.000
> $ with/gltests/bench-sha1 1000000 100
> real   0.122498
> user   0.000
>
>
> sha256 instead, become faster with af_alg with buffers > 10k:
>
> $ without/gltests/bench-sha256 100 1000000
> real   0.617181
> user   0.617
> sys    0.000
> $ with/gltests/bench-sha256 100 1000000
> real   9.655386
> user   0.703
> sys    8.950
>
> $ without/gltests/bench-sha256 1000 100000
> real   0.470694
> user   0.471
> sys    0.000
> $ with/gltests/bench-sha256 1000 100000
> real   1.203199
> user   0.091
> sys    1.112
>
> $ without/gltests/bench-sha256 10000 10000
> real   0.459542
> user   0.460
> sys    0.000
> $ with/gltests/bench-sha256 10000 10000
> real   0.360933
> user   0.003
> sys    0.358
>
> $ without/gltests/bench-sha256 100000 1000
> real   0.454326
> user   0.454
> sys    0.000
> $ with/gltests/bench-sha256 100000 1000
> real   0.279998
> user   0.000
> sys    0.280
>
> $ without/gltests/bench-sha256 1000000 100
> real   0.451635
> user   0.452
> sys    0.000
> $ with/gltests/bench-sha256 1000000 100
> real   0.266343
> user   0.001
> sys    0.265
>
> $ without/gltests/bench-sha256 10000000 10
> real   0.443723
> user   0.444
> sys    0.000
> $ with/gltests/bench-sha256 10000000 10
> real   0.260270
> user   0.000
> sys    0.260
>
> Keep in mind that I have the infamous patch to mitigate the Intel CPU
> bug, which adds a big overhead to syscalls, but it will hopefully
> disappear on future CPUs:
>
> $ dmesg |grep isolation
> [    0.000000] Kernel/User page tables isolation: enabled
>
> --
> Matteo Croce
> per aspera ad upstream

I did some tests, it seems that a big overhead is the creation and
binding of the kernel socket:

$ strace -r -e trace=%network,%desc with/gltests/bench-sha1 100 1
     0.000785 socket(AF_ALG, SOCK_SEQPACKET, 0) = 3
     0.000101 bind(3, {sa_family=AF_ALG, sa_data="hash\0... sha1\0...
"}, 88) = 0
     0.000086 accept(3, NULL, NULL)     = 4
     0.000065 sendto(4, "\0\2\3\5\7\n\f\17\22\26\33
&-5>IUbp\201\223\246\274\323\355\t'Gj\217\267"..., 100, MSG_MORE,
NULL, 0) = 100
     0.000117 read(4,
"v\3770\230\10\374\322\25\26\340\253Y\266\257D\266\30&G\354", 20) = 20

I changed the code to allocate the socket only once and then reuse it
to see if there are some differences.
Obviously it works only if you always use the same algo and with a
single thread, it's just an experiment.

current code

$ without/gltests/bench-sha1 100 1000000
real   0.292869
user   0.293
sys    0.000
$ with/gltests/bench-sha1 100 1000000
real   9.153545
user   0.698
sys    8.421

one time alg

$ with/gltests/bench-sha1 100 1000000
real   1.365084
user   0.178
sys    1.187

An idea is to keep a cache of FDs, one per algo, and initialize them
only once per algo.

-- 
Matteo Croce
per aspera ad upstream

Attachment: onetime_afalg.diff
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]