avr-gcc-list
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[avr-gcc-list] Missing compiler optimizations?


From: Erik Walthinsen
Subject: [avr-gcc-list] Missing compiler optimizations?
Date: Sat, 21 Jan 2006 22:58:08 -0800
User-agent: Debian Thunderbird 1.0.7 (X11/20051017)

I'm going through and converting select functions in my program to assembly, sometimes for code size, sometimes for speed. Usually I get about 2x improvement in one category or the other, non infrequently both at the same time.

The scary thing is that I'm looking at what GCC is producing (both 3.4.4[avr-patched] and 4.0.2[un-patched]) and some of it is missing really *basic* optimization concepts. It's a little shocking to me that 4.0 is generally slightly worse than 3.4, which is not doing very great to begin with.

For example:

#define PARTPINS 28
uint8_t pinconfig[PARTPINS];
void load_eeprom_defaults() {
  uint8_t i;
  uint8_t *ee = 0, *ram;

  ram = (uint8_t *)pinconfig;
  for (i=0;i<PARTPINS;i++) {
    pinconfig[i] = bootloader_eeprom_read_byte(ee++);
  }
}

With gcc 3.4.4 I get:

void load_eeprom_defaults() {
 678:   ef 92           push    r14
 67a:   ff 92           push    r15
 67c:   1f 93           push    r17
 67e:   cf 93           push    r28
 680:   df 93           push    r29
  uint8_t i;
  uint8_t *ee = 0, *ram;
 682:   c0 e0           ldi     r28, 0x00       ; 0
 684:   d0 e0           ldi     r29, 0x00       ; 0

  /* pinconfig */
  ram = (uint8_t *)pinconfig;
 686:   8f e3           ldi     r24, 0x3F       ; 63
 688:   e8 2e           mov     r14, r24
 68a:   81 e0           ldi     r24, 0x01       ; 1
 68c:   f8 2e           mov     r15, r24
 68e:   1b e1           ldi     r17, 0x1B       ; 27
  for (i=0;i<PARTPINS;i++) {
    *(ram++) = bootloader_eeprom_read_byte(ee++);
 690:   ce 01           movw    r24, r28
 692:   21 96           adiw    r28, 0x01       ; 1
694: a6 d9 rcall .-3252 ; 0xfffff9e2 <__eeprom_end+0xff7ef9e2>
 696:   f7 01           movw    r30, r14
 698:   81 93           st      Z+, r24
 69a:   7f 01           movw    r14, r30
 69c:   11 50           subi    r17, 0x01       ; 1
 69e:   17 ff           sbrs    r17, 7
6a0: f7 cf rjmp .-18 ; 0x690 <load_eeprom_defaults+0x18>
 6a2:   df 91           pop     r29
 6a4:   cf 91           pop     r28
 6a6:   1f 91           pop     r17
 6a8:   ff 90           pop     r15
 6aa:   ef 90           pop     r14
 6ac:   08 95           ret
  }

GCC 4.0.2 is a bittle better, shaving two push/pops but gaining an instruction somewhere else.

By hand, I get:

.global load_eeprom_defaults_asm
load_eeprom_defaults_asm:
/* Registers:
        r31:r30         ram address
        r27:r26         eeprom address
        r25:r24         eeprom_read_byte address argument
        r24             eeprom read byte
        r23             i (port)
*/
        ldi r26, lo8(0)         /* load zero EEPROM address */
        ldi r27, hi8(0)

/* load the pinconfig array */
        ldi r30, lo8(pinconfig) /* load Z with pinconfig */
        ldi r31, hi8(pinconfig)
        ldi r23, PARTPINS       /* start loop with PARTPINS */
pinconfigLoop:
        movw r24, r26           /* copy EEPROM addr to argument */
        rcall eeprom_read_byte
        st Z+, r24              /* store the result in pinconfig[] */
        adiw r26, 1             /* increment the EEPROM addr */
        dec r23                 /* decrement pin count */
        brne pinconfigLoop      /* ...and loop */

        ret

That's a reduction from 27 instructions to 12, and *zero* stack space. Run time (non-critical in this case, but...) should be ~15 cycles instead of ~40.

Right off the bat the compiler is making use of mostly low registers, as well as Y (r29:r28), all of which have to be saved. Why isn't it using the high registers that are caller-saved (r18-r27, r30-r31) instead, and saving more than a third of the code size and half the cycle count just in the push/pops?

The use of low registers has a number of other major disadvantages. First off, in loading the 'ram' pointer, it has to ldi into high registers (ldi only works on r16-r31), then mov into the low registers. Another one, not seen in this particular example unless you switch from *(ram++) to pinconfig[i] as the lvalue, is the loss of adiw. Because the compiler stores things in low registers, it has to use subi+subc of 0xffff just to increment the pointer by one. adiw works if you use the top 4 pairs of registers, 3 of which are conveniently caller-saved.

In other cases I've seen all kinds of relatively common idioms get translated very poorly. A classic one is constructing a word from two bytes, which in my case happens as TWI interrupts feed them to me one at a time:

  w = b | (c<<8);
  68:   99 27           eor     r25, r25
  6a:   33 27           eor     r19, r19
  6c:   32 2f           mov     r19, r18
  6e:   22 27           eor     r18, r18
  70:   82 2b           or      r24, r18
  72:   93 2b           or      r25, r19

w is 16 bits, b and c are both 8. This should be doable with just two mov's, but instead it takes 6 instructions. I understand that it's promoting the variables, but it's still inefficient.


I understand that the larger issues of register selection are rather systemic and thus probably hard to solve, but a lot of these smaller idioms seem like they're just crying out for a simple RTL match that can generate efficient code. I know next to nothing about gcc specifically and compilers in general, but I know gcc has a peephole optimizer, and AFAIU that's what's supposed to solve these cases.

What would it take in order to collect and implement improvements to idioms like this? I run into them fairly regularly, and if someone knows how to convert these things into something the compiler can use, I'll gladly write them up and document what I find. Heck, if it's easy enough to add them to the compiler I'll do the whole thing.

TIA,
   Omega
   aka Erik Walthinsen
   address@hidden




reply via email to

[Prev in Thread] Current Thread [Next in Thread]