avr-gcc-list
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [avr-gcc-list] avr superoptimizer


From: Georg-Johann Lay
Subject: Re: [avr-gcc-list] avr superoptimizer
Date: Tue, 21 Apr 2009 17:36:08 +0200
User-agent: Thunderbird 2.0.0.21 (Windows/20090302)

Sean D'Epagnier schrieb:
Hi,

On 4/20/09, Georg-Johann Lay <address@hidden> wrote:
Hi Sean,

as far as I understand, your tool runs on asm code, i.e. you map short
sequences of asm instructions (which must not contain code labels) to
other instruction sequences.


Right, and it's far from complete.  In theory it could handle code
with labels and branches but I haven't implemented that yet, and it
would be slower for processing.. but I have ideas for how to do it
with reasonable efficiency.

The lookup table is generated by brute force? Observing the results of
asm sequences and trying to map to an other sequence that shows better
performance depending on the cost function (time, space, ...).


Yes, right now it's brute force.  Fortunately the lookup table can be
stored to disk with a checksum of each instruction sequence which is
computed in such a way that it is guaranteed to be the same checksum
for a longer sequence if they are identical

Then you use patterns that actually match some pieces of gcc output to
automatically generate peephole2 resp. peephole patterns for avr-gcc?


Not quite yet.  My peepholes are assembly->assembly.  The peepholes
for gcc are rtl.

gcc knows two kinds of peepholes: RTL->RTL (peephole2) and RTL->asm (peephole). How would you integrate your work into gcc? You would have to map asm to RTL in order express your rules in terms of RTL. Moreover, this mapping is far from being unique: it's pointless to generate peepholes whose input is never generated by gcc. How could this missing link be established?

I must admit that I am not a big fan of insn peepholes generated this
way: peepholes can fix some mess from register allocation, but what is
possible on peep level is very limited because register allocation is
finished. So you cannot get some scratch register, and if a register is
no more used after the peephole, you cannot free it and use for
something else. Moreover, if there is a peephole that matches a sequence
   A, B
it won't match
   A, X, B
where X is independent of A and B.


Right, this is a major concern to me.   I know of a number of
peepholes defined in avr.md which do not always get applied in cases
that it seems like they should.
ome gcc peepholes need a free register available, class "d" in most cases. Others need to know that a specific register dies in an insn, which is not always the case.

Maybe it's possible to invent your work to generate patterns for insn
combine rather that for insn peephole? That pass runs before register
allocation and allows to transform from RTL to RTL.

Yes, I'm still don't think it's the perfect solution though.  I have
to look into this.

Skimmin over the code abr-gcc generates for libgcc, e.g., we see much
romm for improving code both in size and in speed.

Finally I have some comments/question on code snippets in your avrfreaks
post:

*I*

from:
    eor   r6, r6
    eor   r7, r7
    eor   r8, r8
    eor   r9, r9
to:
    ldi   r6, 0 ;; typo
    eor   r7, r7
    movw   r8, r6

Besides the typo, avr-gcc already knows how to do this, see
avr.c::output_movsisf:

AVR_HAVE_MOVW ? (AS1 (clr,%A0) CR_TAB
                AS1 (clr,%B0) CR_TAB
                AS2 (movw,%C0,%A0))
             : (AS1 (clr,%A0) CR_TAB
                AS1 (clr,%B0) CR_TAB
                AS1 (clr,%C0) CR_TAB
                AS1 (clr,%D0));

Maybe you did -fsplit-wide-types? In many situations
-fno-split-wide-types yields better code, but not always. Without
splitting wide types RTL is sometimes a bit unhandy because it has to
deal with larger entities, but with splitting wide types it's harder to
keep track of the bigger picture.

I did not know about this optimization.  I was just using -Os.  I will try it.

Also, that peephole only works for 32bit numbers correct?  What if
there happen to be 2 16 bit ones?  Or even 4 8bit numbers that happen
to be able to benefit from this. Also what if you want to load 0x3bd3
into the upper and lower half using ldi, ldi, movw?  Currently gcc
just does 4 ldis

This is no peephole. It is the routine that prints 32-bit mode regs that load zero. Splitting wide types knocks out this function, of course, because that tries to break down SImode to HImode/QImode. I cannot say what happens for 2*HI and if gcc can be driven to reuse regs whose value is known by adapting ome cost functions. In fact, I supplied some patches that will reuse reg contents on SI operations. This can be done for AND, IOR, XOR, CMP, MOV, etc. However, the problem is not to write the optimizations. The problem is that the patches are rotting somewhere in the web because I didn't get a copyright assignment. So no one will ever integrate them in gcc even if they work.

*II*

from:
    ldi   r16, 101
    ori   r16, 50
    swap   r16
to:
    ldi   r16, 255
    andi   r16, 119

I am a little bit confused. Is the source an output of avr-gcc with
optimization turned on? If so, we should find out why the compiler
generates this code and remove the cause rather than the symptom and say
"ldi r16, 119".


Sorry the second set of examples are ones I manually entered.. gcc did
not generate it.  I wanted to demonstrate the capability of my
superoptimizer has for solving constants (pretty cool isn't it?)  And
Yes that's very interesting for 32 bit RISC machines that need several instructions to load constants. Using arithmetik like >> or << on constants can shrike the code a little bit.

*III*
from:
    mov   r24, r25
    ldi   r25, 0
    mov   r25, r24
    eor   r24, r24
to:
    eor   r24, r24

The original snippet would look something like
   (set (reg:HI 101) (zero_extend:HI (reg:QI 100)))
   (set (reg:HI 102) (ashift:HI (reg:HI 101) (const_int 8)))
Combine will try
   (set (reg:HI 102)
        (ashift:HI (zero_extend:HI (reg:QI 100))
                   (const_int 8)))
Which could be split/mapped to
   (set (subreg:QI (reg:HI 102) 1) (reg:QI 100))
   (set (subreg:QI (reg:HI 102) 0) (const_int 0))
Note that this is more general and contains the code above if reload
decides to allocate 102 to R24 and 100 to R25 (or 102 to REGNO and 100
to REGNO+1).

Unfortunately, skimming generated asm for such sequences and writing
patterns to catch them is very time consuming. But I am unsure if
automatically generated peepholes2 is what we want
-- there will be bulk of patterns in the backend where no one really
knows where they come from. There will be no individual comments why
they are there. It will be harder to maintain the backend.

I would specify them as generated and keep them in a separate file.

Perhaps the new gcc plugin framework will open a door? So that the functionality could be supplied in a plugin, without all the license hick-hack that is implied when gcc sources are to be changed...

-- I think before adding peepholes we should try to fix the very
problems: maybe missing combine patterns, playing around with command
line options, smarter ways to printout assembler, maybe costs, insn
constraints, see if the bad code still persist in gcc 4.5, etc.

Yes I agree, it is better to handcode a few patterns to take care of
90% of the cases than to have a few hundred generated cases.

As I said, IMHO peep2 should be a last resort to fix mess if more
sophisticated and more general approaches fail. I guess a bunch of the
cases you see and treat is just because gcc doesn't handle what it could
have handled if it was described somewhere.

Yes, and it's annoying the way gcc is 32bit centric so it means all
the patterns have to be duplicated for 8, 16, and 32 bits on avr.

Maybe I'm getting carried away, but Ideally gcc would figure out how
to add 32bit numbers if it knows how to add 8 bit ones.. it should be
able to generate multiplication and division routines using it's
knowledge of the assembly language.. and then it would be trivial to
support 24bit integers, or fixed point types of any size for any
target (It was a pain writing all the multiplication and division
routines for 8 different types of fixed point numbers)  If it could
do this type of thing, it would significantly speed up writing new
backends since you would only need to define the instruction set to
the compiler, not rtl to the instruction set.

There are -- or at least there were -- projects that try to generate a cross compiler (resp. machine description to feed a cross compiler) directly from the hardware description/specification. But that is still a field of science and research.

As far as gcc is concerned, it actually *does* expand SImode(32 bits) by means of smaller modes like HImode(16 bits) and/or QImode(8 bits) if you do not suply SImode/remove SImode from the backend. Note that word_mode is HImode resp QImode (depends on -mint8) and Pmode is HImode, too.

However, the code is far from what would please your eyes. Yust have a look how avr-gcc breaks down 64-bit numbers (DImode) to smaller modes. This looks fine for moves and maybe bit operations. But addition is horror because there is nothing like an add-with-carry or compare-with-carry: The carry will be computed explicitely by means of shifts, etc, i.e. extremely expensive. This means that the best way to support SImode in gcc is to actually provide this mode -- and this means you have to supply it /completely/ and rewrite the difficult parts, especially the move instruction which is most painful.

Some backend describe add-carry explicitely on RTL level. However, the RTL optimizers don't know anything about it, and things will remain this way exept some carry-magic is integrated in gcc. That is not trivial and a major change with new standars insns, optimization passes and low-level stuff that's far from being trivial.

Georg-Johann






reply via email to

[Prev in Thread] Current Thread [Next in Thread]