avr-gcc-list
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [avr-gcc-list] Poor handling of non-power-of-2 multiplication on ATt


From: Georg-Johann Lay
Subject: Re: [avr-gcc-list] Poor handling of non-power-of-2 multiplication on ATtiny
Date: Mon, 06 Apr 2015 18:40:02 +0200
User-agent: Thunderbird 2.0.0.24 (Windows/20100228)

Paul "LeoNerd" Evans schrieb:
I'm writing code for an ATtiny (tiny84 specifically).

I currently have an array and a function defined thus:

  static struct {
    uint8_t state;
    uint8_t v1;
  } leds[3];

  uint8_t led_state(uint8_t i) { return leds[i].state; }

Elements of this array are 2 bytes long, and this function compiles to
some fairly sensible looking code:

000003b8 <led_state>:
 3b8:   e8 2f           mov     r30, r24
 3ba:   f0 e0           ldi     r31, 0x00       ; 0
 3bc:   ee 0f           add     r30, r30
 3be:   ff 1f           adc     r31, r31
 3c0:   e1 57           subi    r30, 0x71       ; 113
 3c2:   ff 4f           sbci    r31, 0xFF       ; 255
 3c4:   80 81           ld      r24, Z
 3c6:   08 95           ret

The instruction at 3bc adds r30 to itself, effectively causing it to
contain 2*i, which is the offset in bytes required to be added to the
base address of the `leds` array to find the required field.

If instead I define my array thus:

  static struct {
    uint8_t state;
    uint16_t v1;
  } leds[3];

I get the most horrible:

000003be <led_state>:
 3be:   90 e0           ldi     r25, 0x00       ; 0
 3c0:   63 e0           ldi     r22, 0x03       ; 3
 3c2:   70 e0           ldi     r23, 0x00       ; 0
 3c4:   2d d3           rcall   .+1626          ; 0xa20 <__mulhi3>
 3c6:   81 57           subi    r24,0x71        ; 113
 3c8:   9f 4f           sbci    r25, 0xFF       ; 255
 3ca:   fc 01           movw    r30, r24
 3cc:   80 81           ld      r24, Z
 3ce:   08 95           ret

Here, gcc has given up trying to perform a non power-of-two
multiplication and just inlined a call to the generic __mulhi3 function
instead.

The code of __mulhi3 is not inlined, it's a library call. GCC has currently no means to determine the accumulated costs of library calls like the one emit in your example. Presumably you compiled with -Os and the sequence above is actually shorter than expanding *3 inline.


This is somewhat disappointing, as if I further expand my array with a
dummy field for padding, I now get:

000003be <led_state>:
 3be:   e8 2f           mov     r30, r24
 3c0:   f0 e0           ldi     r31, 0x00       ; 0
 3c2:   ee 0f           add     r30, r30
 3c4:   ff 1f           adc     r31, r31
 3c6:   ee 0f           add     r30, r30
 3c8:   ff 1f           adc     r31, r31
 3ca:   e1 57           subi    r30, 0x71       ; 113
 3cc:   ff 4f           sbci    r31, 0xFF       ; 255
 3ce:   80 81           ld      r24, Z
 3d0:   08 95           ret

because it has neatly managed to perform that power-of-two
multiplication, this time of 4*i.

I feel this is a situation that could be resolved better; for instance,
a multiplication by 3 in the middle case could have been performed by

  mov r30, r24
  ldi r31, 0x00    ; Z == i
  add r30, r30
  adc r31, r31     ; Z == 2*i
  add r30, r24
  adc r31, r1      ; Z == 2*i + i == 3*i

in just as many instructions as the 4*i case, without my having to

That sequence is not as short as calling __mulhi3. If you prefer faster code as indicated, try optimizing for speed with -O2.

waste extra bytes of precious RAM just to provide that dummy padding.
Furthermore, I care a lot about instruction counts to access because
this LED management code runs inside a timer match interrupt to
implement software PWM, so it has to be fast and tight.


As a workaround for now, I can restructure my code into two separate
arrays of integers, instead of one array of two-int structs, and thus
ensure that all arrays contain only a power-of-two number of bytes, so
all accesses will be efficient. However, it would be nice if gcc could
manage these multiplications a little nicer.

-----

As another side note, consider the massive difference between

  uint8_t mul10(uint8_t x) { return x * 10; }

 3be:   6a e0           ldi     r22, 0x0A       ; 10
 3c0:   29 d3           rcall   .+1618          ; 0xa14 <__mulqi3>
 3c2:   08 95           ret

vs. something like

  mov r0, r24   ; r0 = x
  add r24, r24  ; r24 = 2*x
  add r24, r24  ; r24 = 4*x
  add r24, r0   ; r24 = 5*x
  add r24, r24  ; r24 = 10*x
  ret

Again, try -O2.


Is there currently underway an attempt to provide better optimisation
of these sorts of cases? If not, where/how could I contribute such?

None that I know of. Maybe adding a small penalty to the rtx costs for such shifts will do the trick. The avr backend currently uses the shortest sequence for -Os, maybe that could be relaxed for not-so-tiny devices even in presence of -Os so that saving 1 instruction does not blow up the execution time to the x-fold.

The compiler has just a very coarse understanding of flash size: it is only aware of the architecture like -mmcu=avr4 so that adding libcall penalty could be added if, say, flash >= 16 KiB, i.e. if there is a JMP instruction (AVR_HAVE_JMP_CALL from avr.h).


Johann



reply via email to

[Prev in Thread] Current Thread [Next in Thread]