octave-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #52809] interpreter performance is slow on dev


From: Dan Sebald
Subject: [Octave-bug-tracker] [bug #52809] interpreter performance is slow on development branch
Date: Thu, 4 Jan 2018 18:40:28 -0500 (EST)
User-agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0

Follow-up Comment #4, bug #52809 (project octave):

>>> Is it not possible to get a trace on just  the a = a + b + 123.0
sequence?
>>
>> I haven't done any real profiling, but from my tests it looks like the
>> evaluation of the expression is the real problem.  Comparing
>>
>>    t0=tic; for i=1:1000; for j=1:1000; end; end; t1=toc(t0); t1
>>
>> in 4.2.1 and the current dev version I see nearly identical results. Both
>> were built with GCC 7.2.0 on a Debian system.
>>
>> jwe
> 
> I filed a bug report (https://savannah.gnu.org/bugs/index.php?52809) to
> keep track of this.  Quoting from the write-up there:

Interesting.  It's not necessarily the expression A = B + C portion that
contributes, but any kind of expression that appears in the list used by
tree_evaluator::visit_statement_list.  (I looked at this last night, so at
least I'm now a bit familiar.)

Consider the various limits again with the "no-command" and a real simple
no-variable command:


octave:1> for lim_p = 0:6
>   lim1 = 10^lim_p;
>   lim2 = 10^(6-lim_p);
>   a = 1; b = 1; t0=tic; for i=1:lim1; for j=1:lim2; end; end; t1=toc(t0);
t1
> end
t1 =  0.18503
t1 =  0.18548
t1 =  0.18704
t1 =  0.19305
t1 =  0.24489
t1 =  0.75854
t1 =  5.8472


What I take away from this result is when that outer for-loop gets to large
iterations that inner for-loop evaluation begins to dominate.  That is, the
*evaluation* of the for-loop.  With this simple example, I'm guessing the
inner for loop--though large iterations--has an empty statement_list to visit,
i.e., checks the list, finds none, so continues onto the next iteration.

Question: Is whatever is inside a for loop re-tokenized, etc. with every pass?
 Or is that tokenization done for the contents of the inner loop just once and
then cached?

The loop path eventually goes through tree_evaluator::visit_statement_list,
which as one might guess goes through a list of all the lexical line
statements.  By testing a few things here and there I conclude that not much
of what is in this part of the parser contributes a whole lot to the overall
time.  There is some strange FIXME comment about having to do the following:


    // FIXME: commented out along with else clause below.
    // static octave_value_list empty_list;
[snip]
                //              result_values = empty_list;


in order to keep the reference count down to avoid generating extra copies,
but there is no "result_values" in the routine anywhere, so maybe this was
just copied from somewhere else and is a conceptual place-holder.  In other
words, was never implemented.

Nonetheless, to test this we can try something like


octave:2> for lim_p = 0:6
>   lim1 = 10^lim_p;
>   lim2 = 10^(6-lim_p);
>   a = 1; b = 1; t0=tic; for i=1:lim1; for j=1:lim2; 1; end; end; t1=toc(t0);
t1
> end
t1 =  1.0119
t1 =  1.0167
t1 =  1.0118
t1 =  1.0166
t1 =  1.0708
t1 =  1.6114
t1 =  6.9960


There is no change to ref-count involved in the above, so that comment
shouldn't be relevant.  But there is already a 5x increase in CPU usage
(probably the evaluation routine).  How about if we compare small-memory
versus large memory?:

SMALL MEMORY

octave:3> for lim_p = 0:6
>   lim1 = 10^lim_p;
>   lim2 = 10^(6-lim_p);
>   a = 1; b = 1; t0=tic; for i=1:lim1; for j=1:lim2; a=b; end; end;
t1=toc(t0); t1
> end
t1 =  3.1692
t1 =  3.1592
t1 =  3.1674
t1 =  3.1748
t1 =  3.2181
t1 =  3.7610
t1 =  9.0173


Another factor of 3 increase (for most) already, just for "a=b" versus "1". 
Certainly parsing and tokenizing can't be that much worse.  So, the fact that
variables are involved could be important.

LARGE MEMORY

octave:4> for lim_p = 0:6
>   lim1 = 10^lim_p;
>   lim2 = 10^(6-lim_p);
>   a = 1; b = ones(10000); t0=tic; for i=1:lim1; for j=1:lim2; a=b; end; end;
t1=toc(t0); t1
> end
t1 =  3.2118
t1 =  3.2141
t1 =  3.2061
t1 =  3.2156
t1 =  3.2567
t1 =  3.7914
t1 =  9.1665


There's a tiny percentage increase for assigning large amounts of memory. 
That suggests that any sort of reference counting issues has little influence
on the CPU consumption.

Well, the place that the actual "a + b + 123.0" would be evaluated is

tree_evaluator::visit_statement (tree_statement& stmt)

and in turn

tree_evaluator::evaluate (tree_decl_elt *elt)

One thing to try is commenting out some aspects of these routines and see how
much it decreases the looping usage.  That would hint where the most savings
could be.  But it might cause internal errors to skip some things.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?52809>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]