Re: conv2 performance

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: conv2 performance

From:	Jaroslav Hajek
Subject:	Re: conv2 performance
Date:	Tue, 2 Mar 2010 11:26:31 +0100

On Tue, Mar 2, 2010 at 10:56 AM, Jaroslav Hajek <address@hidden> wrote:
> On Mon, Mar 1, 2010 at 9:46 AM, Jaroslav Hajek <address@hidden> wrote:
>> On Sun, Feb 28, 2010 at 4:53 PM, John W. Eaton <address@hidden> wrote:
>>> I recieved a complaint that Octave's conv2 function is about 5 times
>>> slower than Matlab's.  By making a few simple changes, I was able to
>>> improve the perfromance some, but not by a factor of 5.  My changes
>>> are here:
>>>
>>>  http://hg.savannah.gnu.org/hgweb/octave/rev/dc8637fd7a76
>>>
>>> On my system, I see the following improvement after these changes:
>>>
>>>  > n = 1024; m = 64; x = rand(n,n); k = rand(m,m); tic; y = conv2(x,k); toc
>>>
>>>  old: Elapsed time is 26.2576 seconds.
>>>  new: Elapsed time is 13.6104 seconds.
>>>
>>> and
>>>
>>>  > n = 1024; m = 512; x = rand(n,1); k = rand(m,1); m = rand (m,n);
>>>  > tic; y = conv2(x,k,m); toc
>>>
>>>  old: Elapsed time is 8.53273 seconds.
>>>  new: Elapsed time is 3.27103 seconds.
>>>
>>> Maybe there is still room for improvement here.  I would happily use a
>>> free software library with a GPL-compatible license to implement this
>>> function, but I don't know whether one is available.
>>>
>>> jwe
>>>
>>>
>>
>>
>> I have some ideas for speeding up the conv2 in stock. Using an
>> existing library would of course be better, but I don't know of a
>> suitable one. Unless anyone can come up with an existing solution,
>> I'll try to convert the ideas to code eventually.
>>
>>
>
> I think John Swensen has a point in that FFT-based convolution is
> asymptotically better. 4th-order loops can outperform it only for
> small kernels (the meaning of "small" depending on implementation).
> Hence I think any loop-based code should primarily target small
> kernels.
>
> I committed an initial implementation of convolutions in liboctave:
> http://hg.savannah.gnu.org/hgweb/octave/rev/978f5c94b11f
>
> summary:
> This instantiates a specialized code for all kernels up to 8x8 in
> size. Larger kernels are split into 8x8 blocks.
> The 4th-order loop is such that the innermost 2 loops have fixed size,
> hence can be completely unrolled.
>
> I used the attached benchmark to measure the time for convolution of a
> 1000x1000 matrix with all kernel sizes up to 10x10.
> Times are averaged over 5 runs. Results on my computer (Core 2 Duo @
> 2.83 GHz, g++ -O3 -march=native) are attached.
>
> octave:1> load all_data
>
> with a recent tip:
>
> octave:2> tocs_oct0
> tocs_oct0 =
>
>   0.012844   0.013554   0.016711   0.022400   0.023357   0.026729
> 0.030247   0.033534   0.036832   0.040352
>   0.021440   0.029549   0.039860   0.042910   0.046509   0.043920
> 0.050186   0.055437   0.060768   0.064602
>   0.014037   0.022069   0.032884   0.041644   0.046854   0.052915
> 0.061239   0.067863   0.073653   0.079161
>   0.015658   0.030140   0.034210   0.045213   0.052253   0.060553
> 0.069365   0.078562   0.086291   0.093605
>   0.019656   0.028900   0.045464   0.047990   0.058807   0.066490
> 0.079297   0.088274   0.097667   0.108786
>   0.022072   0.033146   0.047572   0.060943   0.070684   0.081349
> 0.092902   0.100911   0.115715   0.127048
>   0.021898   0.030693   0.043972   0.058956   0.077249   0.084425
> 0.100613   0.111503   0.126935   0.141250
>   0.021071   0.037299   0.050950   0.070723   0.081438   0.101516
> 0.114788   0.125197   0.141521   0.158815
>   0.024811   0.036528   0.052532   0.073733   0.095367   0.101299
> 0.117457   0.132726   0.147585   0.165222
>   0.024415   0.045528   0.062038   0.077777   0.094293   0.110504
> 0.127776   0.146341   0.163502   0.178777
>
> with the new patch:
>
> octave:3> tocs_oct1
> tocs_oct1 =
>
>   0.0074930   0.0093061   0.0101123   0.0059468   0.0067618
> 0.0073412   0.0095744   0.0111704   0.0155106   0.0155500
>   0.0056431   0.0061924   0.0084282   0.0121774   0.0137026
> 0.0166219   0.0200053   0.0235966   0.0279384   0.0284500
>   0.0060744   0.0084982   0.0126268   0.0165213   0.0221927
> 0.0273326   0.0332757   0.0390677   0.0434884   0.0455091
>   0.0063169   0.0110016   0.0172338   0.0247400   0.0311880
> 0.0387136   0.0465136   0.0544024   0.0592641   0.0642496
>   0.0066828   0.0141782   0.0221559   0.0309312   0.0406902
> 0.0501870   0.0602919   0.0492872   0.0548416   0.0618096
>   0.0084610   0.0162350   0.0308314   0.0393550   0.0504898
> 0.0641784   0.0500531   0.0585372   0.0652503   0.0736086
>   0.0105060   0.0235956   0.0373542   0.0465218   0.0594338
> 0.0500659   0.0609870   0.0713593   0.0790979   0.0899705
>   0.0148870   0.0233453   0.0382854   0.0579996   0.0480443
> 0.0571550   0.0695515   0.0805356   0.0899501   0.1026436
>   0.0159232   0.0315198   0.0468793   0.0580004   0.0528670
> 0.0641594   0.0777459   0.0901399   0.1043304   0.1173685
>   0.0191256   0.0280614   0.0450657   0.0671604   0.0608538
> 0.0769042   0.0929764   0.1040684   0.1214832   0.1341881
>
> using Matlab 2007a:
>
> octave:4> tocs_matlab
> tocs_matlab =
>
>   0.0062376   0.0139384   0.0180750   0.0221970   0.0220130
> 0.0257478   0.0305874   0.0344506   0.0383642   0.0428446
>   0.0109510   0.0266958   0.0351854   0.0458640   0.0620206
> 0.0698450   0.0828816   0.0961360   0.1055158   0.1111318
>   0.0134432   0.0272864   0.0418616   0.0544420   0.0615062
> 0.0739450   0.0873426   0.0982644   0.1111086   0.1225830
>   0.0184896   0.0378428   0.0543014   0.0658802   0.0816770
> 0.0981762   0.1139014   0.1303218   0.1464620   0.1628398
>   0.0219104   0.0428282   0.0660560   0.0862450   0.1027792
> 0.1222084   0.1425500   0.1629794   0.1833898   0.2025316
>   0.0265670   0.0535164   0.0782622   0.0981846   0.1218348
> 0.1462216   0.1700228   0.1945714   0.2187592   0.2437806
>   0.0340150   0.0575714   0.0851972   0.1174214   0.1405062
> 0.1686618   0.1965610   0.2246892   0.2526790   0.2803052
>   0.0348728   0.0733926   0.1020340   0.1292908   0.1611636
> 0.1934834   0.2251806   0.2569800   0.2883594   0.3213730
>   0.0423292   0.0779326   0.1099018   0.1503224   0.1818022
> 0.2175340   0.2541414   0.2897038   0.3259232   0.3611042
>   0.0422816   0.0899742   0.1258612   0.1622720   0.2018848
> 0.2427072   0.2823120   0.3228700   0.3632112   0.4034948
>
> you can see that this is actually slower in most cases than Octave
> (even with the previous implementation).
>
> The speed-ups (in %) of the new code compared to the old one (more is
> better, 0 is no speed-up)
>
> octave:5> 100 * (tocs_oct0 ./ tocs_oct1 - 1)
> ans =
>
>    71.410    45.645    65.257   276.671   245.425   264.099   215.913
>  200.206   137.462   159.501
>   279.935   377.184   372.942   252.378   239.414   164.228   150.863
>  134.938   117.508   127.072
>   131.090   159.694   160.428   152.063   111.125    93.595    84.036
>   73.706    69.362    73.946
>   147.882   173.962    98.506    82.752    67.542    56.413    49.128
>   44.409    45.604    45.689
>   194.121   103.835   105.200    55.150    44.523    32.485    31.522
>   79.101    78.090    76.002
>   160.868   104.162    54.297    54.854    39.997    26.754    85.606
>   72.388    77.340    72.599
>   108.435    30.080    17.717    26.727    29.975    68.627    64.974
>   56.256    60.479    56.995
>    41.537    59.772    33.078    21.937    69.506    77.616    65.040
>   55.455    57.333    54.725
>    55.819    15.888    12.059    27.125    80.391    57.886    51.079
>   47.244    41.459    40.772
>    27.655    62.245    37.662    15.809    54.951    43.691    37.429
>   40.620    34.589    33.229
>
> apparently (and according to expectations), the small kernels are the
> most affected ones.
> I think 8x8 is unnecessarily high (generates 64 functions), I'll
> probably reduce that to 7x5 or something like that.
>
> For comparison, speed-ups compared to Matlab:
>
> ans =
>
>   -16.755    49.776    78.742   273.261   225.548   250.730   219.470
>  208.410   147.342   175.528
>    94.059   331.103   317.473   276.633   352.618   320.198   314.298
>  307.414   277.673   290.621
>   121.310   221.083   231.530   229.527   177.146   170.538   162.481
>  151.523   155.490   169.359
>   192.703   243.976   215.087   166.290   161.886   153.596   144.878
>  139.552   147.134   153.449
>   227.861   202.071   198.142   178.828   152.589   143.506   136.433
>  230.673   234.399   227.670
>   213.994   229.637   153.839   149.484   141.306   127.836   239.685
>  232.389   235.262   231.185
>   223.769   143.992   128.079   152.401   136.408   236.879   222.300
>  214.870   219.451   211.552
>   134.249   214.378   166.509   122.917   235.448   238.524   223.761
>  219.089   220.577   213.096
>   165.833   147.250   134.435   159.175   243.886   239.052   226.887
>  221.394   212.395   207.667
>   121.073   220.633   179.284   141.619   231.754   215.597   203.638
>  210.248   198.981   200.693
>
> hmmm. Maybe there were some improvements in Matlab since?
>
> There are still a few gotchas:
>
> First, only the "full" convolution is treated directly, the other
> cases are extracted from it. I can try to add direct code for the
> "valid" case in the future.
>
> Also, quoting the sources:
>
> // FIXME: Only the axpy-form accumulation is used. This is natural for outer
> // convolution as it requires no boundary treatment.
> // This typically requires one more memory store per operation, but as the
> // memory access pattern is equivalent (switching ro and rw parts), I wouldn't
> // expect a significant difference. cf. for instance sum(), where row-wise sum
> // (axpy-based) is no slower than column-wise (dot-based).
> // It would be nice, however, if the inner convolution was computed directly 
> by
> // dot-based accumulation.
>
> Looking at the assembler code generated for oct-convn.cc, I would
> expect the inner loops to be vectorized, but it doesn't appear so -
> they're only unrolled. This is a bit of surprise to me because with
> newer gcc -ftree-vectorize is on with -03. I googled a little and
> found that aliasing might be the issue, but adding __restrict__
> doesn't seem to help. One more thing to figure out.
>
> regards
>

In any case, convn results are decisive:

old:
octave:1> a = rand (100, 100, 100); b = rand (5, 5, 5); tic; c = convn
(a, b); toc
Elapsed time is 10.5321 seconds.

new:
octave:2> a = rand (100, 100, 100); b = rand (5, 5, 5); tic; c = convn
(a, b); toc
Elapsed time is 0.206056 seconds.

I removed the old convn implementation, as the new code appears to be
much faster.

enjoy

-- 
RNDr. Jaroslav Hajek, PhD
computing expert & GNU Octave developer
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz

[Prev in Thread]

Current Thread

[Next in Thread]

Re: conv2 performance, Jaroslav Hajek, 2010/03/01
- Re: conv2 performance, Jaroslav Hajek, 2010/03/02
  - Re: conv2 performance, Jaroslav Hajek <=
    - Re: conv2 performance, Jaroslav Hajek, 2010/03/03
    - Re: conv2 performance, Michael D. Godfrey, 2010/03/03
    - Re: conv2 performance, Jaroslav Hajek, 2010/03/03
    - Re: conv2 performance, John W. Eaton, 2010/03/03
    - Re: conv2 performance, Michael D. Godfrey, 2010/03/03
    - Re: conv2 performance, Jaroslav Hajek, 2010/03/04
    - Re: conv2 performance, Michael Goffioul, 2010/03/04
    - Re: conv2 performance, John W. Eaton, 2010/03/04
    - Re: conv2 performance, Jaroslav Hajek, 2010/03/04
    - Re: conv2 performance, John W. Eaton, 2010/03/04

Prev by Date: Re: conv2 performance
Next by Date: proposal for minor new feature
Previous by thread: Re: conv2 performance
Next by thread: Re: conv2 performance
Index(es):
- Date
- Thread