gnugo-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [gnugo-devel] paul_3_13.6


From: Dave Denholm
Subject: Re: [gnugo-devel] paul_3_13.6
Date: 10 Dec 2002 07:16:10 +0000

Paul Pogonyshev <address@hidden> writes:

> Dave wrote:
> > I'm confused...
> >
> > >      /* go to next state */
> > > -    delta = pdfa->states[state].next[dfa_p[dfa_pos + spiral[l][row]]];
> > > +    delta = pdfa->states[state].next[dfa_p[dfa_pos + spiral[row][l]]];
> > >      state += delta;
> > >      row++;
> > >    } while (delta != 0); /* while not on error state */
> >
> > I see the inner loop being over row, not 'l'. So swapping the
> > indices hurts the cache, rather than helping it ?
> >
> > If there are enough registers, compiler should be able to put
> > spiral[row] into a register.
> >
> > x86 can scale memory accesses by small powers of 2, but 8*sizeof(int)
> > is probably too big.
> 
> i did check that it really improved performance. here is an explanation
> (which i think might be true ;).
> 
> when we have spiral[8][MAX_ORDER] array, it looks this way in memory:
> 
> 0123456.....
> 0123456.....
> .............
> 0123456..... (eight times)
> 
> where 0, 1 and so on mean the second index. the dfa matcher uses the
> first few dozens of elements of each spiral[ll] array most of the
> time (99.9% i think). each this array has DFA_MAX_BOARD
> * DFA_MAX_BOARD * 4 > 1600 elements. so, we have eight _different_
> regions of frequently accessed memory. this must significantly
> decrease caching efficiency.
> 

Ah, I'm with you...

The other perspective, however, is that, IIRC, x86 uses a primary cache line
of 32 bytes. So your order forces every sequential read to come from a different
cache line. With the previous ordering, sequential reads come from the
same cache line. Since we never touch later elements of spiral[],
they would never be brought into the cache at all.

So it's balancing primary cache use with secondary cache (whose layout
Idon;t know off-hand.)
On a machine without much physical ram, there's also paging to worry about,
and your ordering is clearly a big win here.

But we make large number of random reads to board and dfa, which will
throw lines out of the primary cache at random. So one could argue
that the fewer cache lines used in the inner loop, the better.


Incidentally, an interesting experiment I may have mentioned before
is to get the initial alignment of memory structures right.
Eg making spiral[] 32-byte aligned may help cache behaviour
some more (whichever order is used).

> when the indices had been swapped, all the frequently used elements
> combined into a _single_ continuous region of memory. besides, i

Hmm - continous region of virtual main memory, but caches are
probably more important ?

> think gcc is smart enough, so it doesn't actually evaluate
> `row*8*sizeof(int)', but adds 8*sizeof(int) to the pointer it uses.
> 

Hmm - x86 is rather short of registers. I did try playing
with the source to get good register usage. Eg x86 can scale
by small powers of 2 when reading memory. spiral[l][row]
is one instruction if spiral[l] and row are in registers.
But I don't know if the scaling goes up as far as 32 ?

dd
-- 
address@hidden          http://www.insignia.com



reply via email to

[Prev in Thread] Current Thread [Next in Thread]