[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: conv2 performance
From: |
Jaroslav Hajek |
Subject: |
Re: conv2 performance |
Date: |
Tue, 2 Mar 2010 11:26:31 +0100 |
On Tue, Mar 2, 2010 at 10:56 AM, Jaroslav Hajek <address@hidden> wrote:
> On Mon, Mar 1, 2010 at 9:46 AM, Jaroslav Hajek <address@hidden> wrote:
>> On Sun, Feb 28, 2010 at 4:53 PM, John W. Eaton <address@hidden> wrote:
>>> I recieved a complaint that Octave's conv2 function is about 5 times
>>> slower than Matlab's. By making a few simple changes, I was able to
>>> improve the perfromance some, but not by a factor of 5. My changes
>>> are here:
>>>
>>> http://hg.savannah.gnu.org/hgweb/octave/rev/dc8637fd7a76
>>>
>>> On my system, I see the following improvement after these changes:
>>>
>>> > n = 1024; m = 64; x = rand(n,n); k = rand(m,m); tic; y = conv2(x,k); toc
>>>
>>> old: Elapsed time is 26.2576 seconds.
>>> new: Elapsed time is 13.6104 seconds.
>>>
>>> and
>>>
>>> > n = 1024; m = 512; x = rand(n,1); k = rand(m,1); m = rand (m,n);
>>> > tic; y = conv2(x,k,m); toc
>>>
>>> old: Elapsed time is 8.53273 seconds.
>>> new: Elapsed time is 3.27103 seconds.
>>>
>>> Maybe there is still room for improvement here. I would happily use a
>>> free software library with a GPL-compatible license to implement this
>>> function, but I don't know whether one is available.
>>>
>>> jwe
>>>
>>>
>>
>>
>> I have some ideas for speeding up the conv2 in stock. Using an
>> existing library would of course be better, but I don't know of a
>> suitable one. Unless anyone can come up with an existing solution,
>> I'll try to convert the ideas to code eventually.
>>
>>
>
> I think John Swensen has a point in that FFT-based convolution is
> asymptotically better. 4th-order loops can outperform it only for
> small kernels (the meaning of "small" depending on implementation).
> Hence I think any loop-based code should primarily target small
> kernels.
>
> I committed an initial implementation of convolutions in liboctave:
> http://hg.savannah.gnu.org/hgweb/octave/rev/978f5c94b11f
>
> summary:
> This instantiates a specialized code for all kernels up to 8x8 in
> size. Larger kernels are split into 8x8 blocks.
> The 4th-order loop is such that the innermost 2 loops have fixed size,
> hence can be completely unrolled.
>
> I used the attached benchmark to measure the time for convolution of a
> 1000x1000 matrix with all kernel sizes up to 10x10.
> Times are averaged over 5 runs. Results on my computer (Core 2 Duo @
> 2.83 GHz, g++ -O3 -march=native) are attached.
>
> octave:1> load all_data
>
> with a recent tip:
>
> octave:2> tocs_oct0
> tocs_oct0 =
>
> 0.012844 0.013554 0.016711 0.022400 0.023357 0.026729
> 0.030247 0.033534 0.036832 0.040352
> 0.021440 0.029549 0.039860 0.042910 0.046509 0.043920
> 0.050186 0.055437 0.060768 0.064602
> 0.014037 0.022069 0.032884 0.041644 0.046854 0.052915
> 0.061239 0.067863 0.073653 0.079161
> 0.015658 0.030140 0.034210 0.045213 0.052253 0.060553
> 0.069365 0.078562 0.086291 0.093605
> 0.019656 0.028900 0.045464 0.047990 0.058807 0.066490
> 0.079297 0.088274 0.097667 0.108786
> 0.022072 0.033146 0.047572 0.060943 0.070684 0.081349
> 0.092902 0.100911 0.115715 0.127048
> 0.021898 0.030693 0.043972 0.058956 0.077249 0.084425
> 0.100613 0.111503 0.126935 0.141250
> 0.021071 0.037299 0.050950 0.070723 0.081438 0.101516
> 0.114788 0.125197 0.141521 0.158815
> 0.024811 0.036528 0.052532 0.073733 0.095367 0.101299
> 0.117457 0.132726 0.147585 0.165222
> 0.024415 0.045528 0.062038 0.077777 0.094293 0.110504
> 0.127776 0.146341 0.163502 0.178777
>
> with the new patch:
>
> octave:3> tocs_oct1
> tocs_oct1 =
>
> 0.0074930 0.0093061 0.0101123 0.0059468 0.0067618
> 0.0073412 0.0095744 0.0111704 0.0155106 0.0155500
> 0.0056431 0.0061924 0.0084282 0.0121774 0.0137026
> 0.0166219 0.0200053 0.0235966 0.0279384 0.0284500
> 0.0060744 0.0084982 0.0126268 0.0165213 0.0221927
> 0.0273326 0.0332757 0.0390677 0.0434884 0.0455091
> 0.0063169 0.0110016 0.0172338 0.0247400 0.0311880
> 0.0387136 0.0465136 0.0544024 0.0592641 0.0642496
> 0.0066828 0.0141782 0.0221559 0.0309312 0.0406902
> 0.0501870 0.0602919 0.0492872 0.0548416 0.0618096
> 0.0084610 0.0162350 0.0308314 0.0393550 0.0504898
> 0.0641784 0.0500531 0.0585372 0.0652503 0.0736086
> 0.0105060 0.0235956 0.0373542 0.0465218 0.0594338
> 0.0500659 0.0609870 0.0713593 0.0790979 0.0899705
> 0.0148870 0.0233453 0.0382854 0.0579996 0.0480443
> 0.0571550 0.0695515 0.0805356 0.0899501 0.1026436
> 0.0159232 0.0315198 0.0468793 0.0580004 0.0528670
> 0.0641594 0.0777459 0.0901399 0.1043304 0.1173685
> 0.0191256 0.0280614 0.0450657 0.0671604 0.0608538
> 0.0769042 0.0929764 0.1040684 0.1214832 0.1341881
>
> using Matlab 2007a:
>
> octave:4> tocs_matlab
> tocs_matlab =
>
> 0.0062376 0.0139384 0.0180750 0.0221970 0.0220130
> 0.0257478 0.0305874 0.0344506 0.0383642 0.0428446
> 0.0109510 0.0266958 0.0351854 0.0458640 0.0620206
> 0.0698450 0.0828816 0.0961360 0.1055158 0.1111318
> 0.0134432 0.0272864 0.0418616 0.0544420 0.0615062
> 0.0739450 0.0873426 0.0982644 0.1111086 0.1225830
> 0.0184896 0.0378428 0.0543014 0.0658802 0.0816770
> 0.0981762 0.1139014 0.1303218 0.1464620 0.1628398
> 0.0219104 0.0428282 0.0660560 0.0862450 0.1027792
> 0.1222084 0.1425500 0.1629794 0.1833898 0.2025316
> 0.0265670 0.0535164 0.0782622 0.0981846 0.1218348
> 0.1462216 0.1700228 0.1945714 0.2187592 0.2437806
> 0.0340150 0.0575714 0.0851972 0.1174214 0.1405062
> 0.1686618 0.1965610 0.2246892 0.2526790 0.2803052
> 0.0348728 0.0733926 0.1020340 0.1292908 0.1611636
> 0.1934834 0.2251806 0.2569800 0.2883594 0.3213730
> 0.0423292 0.0779326 0.1099018 0.1503224 0.1818022
> 0.2175340 0.2541414 0.2897038 0.3259232 0.3611042
> 0.0422816 0.0899742 0.1258612 0.1622720 0.2018848
> 0.2427072 0.2823120 0.3228700 0.3632112 0.4034948
>
> you can see that this is actually slower in most cases than Octave
> (even with the previous implementation).
>
> The speed-ups (in %) of the new code compared to the old one (more is
> better, 0 is no speed-up)
>
> octave:5> 100 * (tocs_oct0 ./ tocs_oct1 - 1)
> ans =
>
> 71.410 45.645 65.257 276.671 245.425 264.099 215.913
> 200.206 137.462 159.501
> 279.935 377.184 372.942 252.378 239.414 164.228 150.863
> 134.938 117.508 127.072
> 131.090 159.694 160.428 152.063 111.125 93.595 84.036
> 73.706 69.362 73.946
> 147.882 173.962 98.506 82.752 67.542 56.413 49.128
> 44.409 45.604 45.689
> 194.121 103.835 105.200 55.150 44.523 32.485 31.522
> 79.101 78.090 76.002
> 160.868 104.162 54.297 54.854 39.997 26.754 85.606
> 72.388 77.340 72.599
> 108.435 30.080 17.717 26.727 29.975 68.627 64.974
> 56.256 60.479 56.995
> 41.537 59.772 33.078 21.937 69.506 77.616 65.040
> 55.455 57.333 54.725
> 55.819 15.888 12.059 27.125 80.391 57.886 51.079
> 47.244 41.459 40.772
> 27.655 62.245 37.662 15.809 54.951 43.691 37.429
> 40.620 34.589 33.229
>
> apparently (and according to expectations), the small kernels are the
> most affected ones.
> I think 8x8 is unnecessarily high (generates 64 functions), I'll
> probably reduce that to 7x5 or something like that.
>
> For comparison, speed-ups compared to Matlab:
>
> ans =
>
> -16.755 49.776 78.742 273.261 225.548 250.730 219.470
> 208.410 147.342 175.528
> 94.059 331.103 317.473 276.633 352.618 320.198 314.298
> 307.414 277.673 290.621
> 121.310 221.083 231.530 229.527 177.146 170.538 162.481
> 151.523 155.490 169.359
> 192.703 243.976 215.087 166.290 161.886 153.596 144.878
> 139.552 147.134 153.449
> 227.861 202.071 198.142 178.828 152.589 143.506 136.433
> 230.673 234.399 227.670
> 213.994 229.637 153.839 149.484 141.306 127.836 239.685
> 232.389 235.262 231.185
> 223.769 143.992 128.079 152.401 136.408 236.879 222.300
> 214.870 219.451 211.552
> 134.249 214.378 166.509 122.917 235.448 238.524 223.761
> 219.089 220.577 213.096
> 165.833 147.250 134.435 159.175 243.886 239.052 226.887
> 221.394 212.395 207.667
> 121.073 220.633 179.284 141.619 231.754 215.597 203.638
> 210.248 198.981 200.693
>
> hmmm. Maybe there were some improvements in Matlab since?
>
> There are still a few gotchas:
>
> First, only the "full" convolution is treated directly, the other
> cases are extracted from it. I can try to add direct code for the
> "valid" case in the future.
>
> Also, quoting the sources:
>
> // FIXME: Only the axpy-form accumulation is used. This is natural for outer
> // convolution as it requires no boundary treatment.
> // This typically requires one more memory store per operation, but as the
> // memory access pattern is equivalent (switching ro and rw parts), I wouldn't
> // expect a significant difference. cf. for instance sum(), where row-wise sum
> // (axpy-based) is no slower than column-wise (dot-based).
> // It would be nice, however, if the inner convolution was computed directly
> by
> // dot-based accumulation.
>
> Looking at the assembler code generated for oct-convn.cc, I would
> expect the inner loops to be vectorized, but it doesn't appear so -
> they're only unrolled. This is a bit of surprise to me because with
> newer gcc -ftree-vectorize is on with -03. I googled a little and
> found that aliasing might be the issue, but adding __restrict__
> doesn't seem to help. One more thing to figure out.
>
> regards
>
In any case, convn results are decisive:
old:
octave:1> a = rand (100, 100, 100); b = rand (5, 5, 5); tic; c = convn
(a, b); toc
Elapsed time is 10.5321 seconds.
new:
octave:2> a = rand (100, 100, 100); b = rand (5, 5, 5); tic; c = convn
(a, b); toc
Elapsed time is 0.206056 seconds.
I removed the old convn implementation, as the new code appears to be
much faster.
enjoy
--
RNDr. Jaroslav Hajek, PhD
computing expert & GNU Octave developer
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz
- Re: conv2 performance, Jaroslav Hajek, 2010/03/01
- Re: conv2 performance, Jaroslav Hajek, 2010/03/02
- Re: conv2 performance,
Jaroslav Hajek <=
- Re: conv2 performance, Jaroslav Hajek, 2010/03/03
- Re: conv2 performance, Michael D. Godfrey, 2010/03/03
- Re: conv2 performance, Jaroslav Hajek, 2010/03/03
- Re: conv2 performance, John W. Eaton, 2010/03/03
- Re: conv2 performance, Michael D. Godfrey, 2010/03/03
- Re: conv2 performance, Jaroslav Hajek, 2010/03/04
- Re: conv2 performance, Michael Goffioul, 2010/03/04
- Re: conv2 performance, John W. Eaton, 2010/03/04
- Re: conv2 performance, Jaroslav Hajek, 2010/03/04
- Re: conv2 performance, John W. Eaton, 2010/03/04