[avr-gcc-list] Re: avr-gcc optimisation

avr-gcc-list

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[avr-gcc-list] Re: avr-gcc optimisation

From:	David Brown
Subject:	[avr-gcc-list] Re: avr-gcc optimisation
Date:	Sun, 01 Mar 2009 14:50:21 +0100
User-agent:	Thunderbird 2.0.0.19 (Windows/20081209)

Nicholas Vinen wrote:

address@hidden wrote:
OK, I only spent a few minutes looking at old code and I found someobviously sub-optimal results. It distills down to this:
#include <avr/io.h>

int main(void) {
  unsigned long packet = 0;

  while(1) {
    if( !(PINC & _BV(PC2)) ) {
      packet = (packet<<1)|(((unsigned char)PINC>>1)&1);
    }
    PORTB = packet;
  }
}
Did you write the code like this just to test the optimiser? It
As far as I understand, it's a stripped down example to demonstrate thecode bloat in a reproducable way (combileable source).
It originally came from a program that did something sensible. This wasthe part of the code which received data from another microcontrollerone bit at a time into a 32 bit buffer (packet) MSB first, and then whenit's full it does various things depending on what's in the packet. Thisseemed like the simplest way to do it, and I think it should havecompiled something compact. I was surprised when I looked at the assembly.
Here's another example of a couple of inefficiencies which caused metrouble in the same application. Take these two bits of code.
a)

#define PINC  (*((unsigned char volatile*) 0x20))
#define PORTB (*((unsigned char volatile*) 0x22))

void foo ()
{
    unsigned char a, b, c;

    while(1)
    {
        a = PINC;
        b = PINC;
        c = ((unsigned short)a*(unsigned short)b)>>8;
        PORTB = c;
    }
}

b)

#define PINC  (*((unsigned short volatile*) 0x20))
#define PORTB (*((unsigned short volatile*) 0x22))

void foo ()
{
    unsigned short a, b, c;

    while(1)
    {
        a = PINC;
        b = PINC;
        c = ((unsigned long)a*(unsigned long)b)>>16;
        PORTB = c;
    }
}
Now, interestingly, in the first case the multiply gets inlined and thissaves it having to cross-multiply the top half of the two shorts whichwill always be zero, as well as removing the function call overhead etc.(this is using -O3):
.L2:
        .stabn  68,0,11,.LM1-.LFBB1
.LM1:
        in r18,32-0x20   ;  a,   ;  7   *movqi/4        [length = 1]
        .stabn  68,0,12,.LM2-.LFBB1
.LM2:
        in r24,32-0x20   ;  b,   ;  9   *movqi/4        [length = 1]
        .stabn  68,0,13,.LM3-.LFBB1
.LM3:
        mul r24,r18      ;  b, a         ;  10  umulqihi3       [length = 3]
        movw r24,r0      ;  tmp46
        clr r1
        .stabn  68,0,14,.LM4-.LFBB1
.LM4:
        out 34-0x20,r25  ; ,     ;  14  *movqi/3        [length = 1]
        rjmp .L2         ;       ;  35  jump    [length = 1]
It's still doing the final shift very inefficiently - all it has to dois output the top half of the result, but instead it twiddles theregisters around to do the shift first, then outputs the lower half.Kind of silly, but overall, not too horrible.

This "pointless twiddling" is, as I understand it, an artefact of theway avr-gcc uses r0 as a temporary register, and r1 as a zero register.r0 is only considered valid within an RTL encoding - it can't holdanything beyond the fixed code sequences that are generated forparticular RTL patterns. Thus the result has to be moved into anotherregister before the next pattern (the "out"). Additionally, you areseeing an effect of int-promotion which sometimes results in pairs ofregisters being used when a single register would do (there are fewer ofthese now than there used to be in avr-gcc).

The second case is not asgood (also using -O3):
.L2:
        .stabn  68,0,11,.LM1-.LFBB1
.LM1:
        in r18,32-0x20   ;  a,   ;  7   *movhi/2        [length = 2]
        in r19,(32)+1-0x20       ;  a,
        .stabn  68,0,12,.LM2-.LFBB1
.LM2:
        in r22,32-0x20   ;  b,   ;  9   *movhi/2        [length = 2]
        in r23,(32)+1-0x20       ;  b,
        .stabn  68,0,13,.LM3-.LFBB1
.LM3:
        ldi r24,lo8(0)   ;  b,   ;  37  *movhi/4        [length = 2]
        ldi r25,hi8(0)   ;  b,
        ldi r20,lo8(0)   ;  a,   ;  39  *movhi/4        [length = 2]
        ldi r21,hi8(0)   ;  a,
        rcall __mulsi3   ;  14  *mulsi3_call    [length = 1]
        movw r22,r24     ;  tmp49, tmp48         ;  42  *lshrsi3_const/3        
[length = 3]
        clr r24  ;  tmp49
        clr r25  ;  tmp49
        .stabn  68,0,14,.LM4-.LFBB1
.LM4:
        out (34)+1-0x20,r23      ; , tmp49       ;  19  *movhi/3        [length 
= 2]
        out 34-0x20,r22  ; , tmp49
        rjmp .L2         ;       ;  43  jump    [length = 1]
Still the same pointless twiddling at the end, but this time it'sactually loading zeroes into the top 16 bits of each of the two inputsand calling __mulsi3. This, I'm fairly sure, is going to do a wholebunch of useless multiplies with zero and adds of zero. Is there areason it won't inline and optimise __mulsi3? Perhaps the 8*8=>16 caseis handled especially but the 16*16=>32 isn't, or perhaps I need to addsome command line to avr-gcc to make it understand that inlining alarger intrinsic is worthwhile. I would have thought -O3 would do thisalready, especially since it should result in a reduction in code sizebecause __mulsi3 is only called once and inlining it could remove abunch of overhead and dead instructions.
I guess what I'm doing here is not very common, but the performancedifference between this and the hand-written assembly is going to bepretty drastic. Inline assembly can help a lot in this case (i.e. createa 16x16=>32 function), but it would be nice if that wasn't necessary.
Am I asking too much from the compiler?

You are asking a lot from C. The C language has no concept of a16x16->32 multiply (or an 8x8->16 multiply). You can't express what youwant (16x16->32) in C, so you must promote it to 32x32->32. Thecompiler has then generated the code you asked for. In some cases, suchas the 8x8->16, the compiler can spot and eliminate extra multiplies.However, that requires inlining. In the first case, the compilerinitially inlines a 16x16->16 multiply, then optimises it. But aninline of a general 32x32->32 multiply would be very large - thus it isa library call.


You are not asking for the impossible, but you *are* asking quite a lot!

[Prev in Thread]

Current Thread

[Next in Thread]

[avr-gcc-list] Re: avr-gcc optimisation, David Brown <=

Prev by Date: [avr-gcc-list] Re: C vs. assembly performance
Next by Date: Re: [avr-gcc-list] Re: C vs. assembly performance
Previous by thread: Re: [avr-gcc-list] Re: C vs. assembly performance
Next by thread: Re: [avr-gcc-list] Re: sprintf
Index(es):
- Date
- Thread