avr-gcc-list
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[avr-gcc-list] Re: avr-gcc optimisation


From: David Brown
Subject: [avr-gcc-list] Re: avr-gcc optimisation
Date: Sun, 01 Mar 2009 14:50:21 +0100
User-agent: Thunderbird 2.0.0.19 (Windows/20081209)

Nicholas Vinen wrote:
address@hidden wrote:
OK, I only spent a few minutes looking at old code and I found some obviously sub-optimal results. It distills down to this:

#include <avr/io.h>

int main(void) {
  unsigned long packet = 0;

  while(1) {
    if( !(PINC & _BV(PC2)) ) {
      packet = (packet<<1)|(((unsigned char)PINC>>1)&1);
    }
    PORTB = packet;
  }
}
Did you write the code like this just to test the optimiser? It

As far as I understand, it's a stripped down example to demonstrate the code bloat in a reproducable way (combileable source).

It originally came from a program that did something sensible. This was the part of the code which received data from another microcontroller one bit at a time into a 32 bit buffer (packet) MSB first, and then when it's full it does various things depending on what's in the packet. This seemed like the simplest way to do it, and I think it should have compiled something compact. I was surprised when I looked at the assembly.

Here's another example of a couple of inefficiencies which caused me trouble in the same application. Take these two bits of code.

a)

#define PINC  (*((unsigned char volatile*) 0x20))
#define PORTB (*((unsigned char volatile*) 0x22))

void foo ()
{
    unsigned char a, b, c;

    while(1)
    {
        a = PINC;
        b = PINC;
        c = ((unsigned short)a*(unsigned short)b)>>8;
        PORTB = c;
    }
}

b)

#define PINC  (*((unsigned short volatile*) 0x20))
#define PORTB (*((unsigned short volatile*) 0x22))

void foo ()
{
    unsigned short a, b, c;

    while(1)
    {
        a = PINC;
        b = PINC;
        c = ((unsigned long)a*(unsigned long)b)>>16;
        PORTB = c;
    }
}


Now, interestingly, in the first case the multiply gets inlined and this saves it having to cross-multiply the top half of the two shorts which will always be zero, as well as removing the function call overhead etc. (this is using -O3):

.L2:
        .stabn  68,0,11,.LM1-.LFBB1
.LM1:
        in r18,32-0x20   ;  a,   ;  7   *movqi/4        [length = 1]
        .stabn  68,0,12,.LM2-.LFBB1
.LM2:
        in r24,32-0x20   ;  b,   ;  9   *movqi/4        [length = 1]
        .stabn  68,0,13,.LM3-.LFBB1
.LM3:
        mul r24,r18      ;  b, a         ;  10  umulqihi3       [length = 3]
        movw r24,r0      ;  tmp46
        clr r1
        .stabn  68,0,14,.LM4-.LFBB1
.LM4:
        out 34-0x20,r25  ; ,     ;  14  *movqi/3        [length = 1]
        rjmp .L2         ;       ;  35  jump    [length = 1]


It's still doing the final shift very inefficiently - all it has to do is output the top half of the result, but instead it twiddles the registers around to do the shift first, then outputs the lower half. Kind of silly, but overall, not too horrible.

This "pointless twiddling" is, as I understand it, an artefact of the way avr-gcc uses r0 as a temporary register, and r1 as a zero register. r0 is only considered valid within an RTL encoding - it can't hold anything beyond the fixed code sequences that are generated for particular RTL patterns. Thus the result has to be moved into another register before the next pattern (the "out"). Additionally, you are seeing an effect of int-promotion which sometimes results in pairs of registers being used when a single register would do (there are fewer of these now than there used to be in avr-gcc).

The second case is not as good (also using -O3):

.L2:
        .stabn  68,0,11,.LM1-.LFBB1
.LM1:
        in r18,32-0x20   ;  a,   ;  7   *movhi/2        [length = 2]
        in r19,(32)+1-0x20       ;  a,
        .stabn  68,0,12,.LM2-.LFBB1
.LM2:
        in r22,32-0x20   ;  b,   ;  9   *movhi/2        [length = 2]
        in r23,(32)+1-0x20       ;  b,
        .stabn  68,0,13,.LM3-.LFBB1
.LM3:
        ldi r24,lo8(0)   ;  b,   ;  37  *movhi/4        [length = 2]
        ldi r25,hi8(0)   ;  b,
        ldi r20,lo8(0)   ;  a,   ;  39  *movhi/4        [length = 2]
        ldi r21,hi8(0)   ;  a,
        rcall __mulsi3   ;  14  *mulsi3_call    [length = 1]
        movw r22,r24     ;  tmp49, tmp48         ;  42  *lshrsi3_const/3        
[length = 3]
        clr r24  ;  tmp49
        clr r25  ;  tmp49
        .stabn  68,0,14,.LM4-.LFBB1
.LM4:
        out (34)+1-0x20,r23      ; , tmp49       ;  19  *movhi/3        [length 
= 2]
        out 34-0x20,r22  ; , tmp49
        rjmp .L2         ;       ;  43  jump    [length = 1]


Still the same pointless twiddling at the end, but this time it's actually loading zeroes into the top 16 bits of each of the two inputs and calling __mulsi3. This, I'm fairly sure, is going to do a whole bunch of useless multiplies with zero and adds of zero. Is there a reason it won't inline and optimise __mulsi3? Perhaps the 8*8=>16 case is handled especially but the 16*16=>32 isn't, or perhaps I need to add some command line to avr-gcc to make it understand that inlining a larger intrinsic is worthwhile. I would have thought -O3 would do this already, especially since it should result in a reduction in code size because __mulsi3 is only called once and inlining it could remove a bunch of overhead and dead instructions.

I guess what I'm doing here is not very common, but the performance difference between this and the hand-written assembly is going to be pretty drastic. Inline assembly can help a lot in this case (i.e. create a 16x16=>32 function), but it would be nice if that wasn't necessary.

Am I asking too much from the compiler?


You are asking a lot from C. The C language has no concept of a 16x16->32 multiply (or an 8x8->16 multiply). You can't express what you want (16x16->32) in C, so you must promote it to 32x32->32. The compiler has then generated the code you asked for. In some cases, such as the 8x8->16, the compiler can spot and eliminate extra multiplies. However, that requires inlining. In the first case, the compiler initially inlines a 16x16->16 multiply, then optimises it. But an inline of a general 32x32->32 multiply would be very large - thus it is a library call.

You are not asking for the impossible, but you *are* asking quite a lot!





reply via email to

[Prev in Thread] Current Thread [Next in Thread]