avr-gcc-list
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[avr-gcc-list] Missed optimisations


From: David Brown
Subject: [avr-gcc-list] Missed optimisations
Date: Mon, 14 Jan 2008 15:00:59 +0100
User-agent: Thunderbird 2.0.0.9 (Windows/20071031)

I've just tried out the crc code I posted recently on avr-chat:

> static const uint8_t PROGMEM crcTable8n[16] = {
>         0x00, 0x8D, 0x97, 0x1A, 0xA3, 0x2E, 0x34, 0xB9,
>         0xCB, 0x46, 0x5C, 0xD1, 0x68, 0xE5, 0xFF, 0x72
> };
>
> uint8_t CrcByte8n(uint8_t crc, uint8_t data) {
>         // Fold next byte into CRC
> crc = (crc << 4) ^ (pgm_read_byte(&crcTable8n[(crc >> 4) ^ (data >> 4)])); > crc = (crc << 4) ^ (pgm_read_byte(&crcTable8n[(crc >> 4) ^ (data & 0x0f)]));
>
>         return crc;
> }
>

The generated assemble from avr-gcc 4.2.2 (latest winavr release) with -Os is:

  40                    CrcByte8n:
  41                    /* prologue: frame size=0 */
  42                    /* prologue end (size=0) */
  43 0000 982F                  mov r25,r24      ;  tmp48, crc
  44 0002 9627                  eor r25,r22      ;  tmp48, data
  45 0004 9295                  swap r25         ;  tmp48
  46 0006 9F70                  andi r25,lo8(15)         ;  tmp48,
  47 0008 20E0                  ldi r18,lo8(crcTable8n)  ;  tmp51,
  48 000a 30E0                  ldi r19,hi8(crcTable8n)  ;  tmp51,
  49 000c F901                  movw r30,r18     ;  tmp52, tmp51
  50 000e E90F                  add r30,r25      ;  tmp52, tmp48
  51 0010 F11D                  adc r31,__zero_reg__     ;  tmp52
  52                    /* #APP */
  53 0012 E491                  lpm r30, Z       ;  __result
  54                            
  55                    /* #NOAPP */
  56 0014 8295                  swap r24         ;  crc.36
  57 0016 807F                  andi r24,lo8(-16)        ;  crc.36,
  58 0018 8E27                  eor r24,r30      ;  crc.36, __result
  59 001a E82F                  mov r30,r24      ;  tmp55, crc.36
  60 001c E295                  swap r30         ;  tmp55
  61 001e EF70                  andi r30,lo8(15)         ;  tmp55,
  62 0020 F0E0                  ldi r31,lo8(0)   ; ,
  63 0022 70E0                  ldi r23,lo8(0)   ;  data,
  64 0024 6F70                  andi r22,lo8(15)         ;  data,
  65 0026 7070                  andi r23,hi8(15)         ;  data,
  66 0028 E627                  eor r30,r22      ;  tmp56, data
  67 002a F727                  eor r31,r23      ;  tmp56, data
  68 002c E20F                  add r30,r18      ;  tmp56, tmp51
  69 002e F31F                  adc r31,r19      ;  tmp56, tmp51
  70                    /* #APP */
  71 0030 E491                  lpm r30, Z       ;  __result
  72                            
  73                    /* #NOAPP */
  74 0032 8295                  swap r24         ;  crc.36
  75 0034 807F                  andi r24,lo8(-16)        ;  crc.36,
  76 0036 8E27                  eor r24,r30      ;  crc.36, __result
  77 0038 90E0                  ldi r25,lo8(0)   ;  <result>,
  78                    /* epilogue: frame size=0 */
  79 003a 0895                  ret


The code is basically good (the "swap" instruction is used for the shifts, which is very nice - a big improvement over the older 3.4 gcc), but there are a few missed optimisations shown here that are probably quite common in other code.

Why is the address of crcTable8n loaded into r18:r19 first, before being copied into r30:r31 for the address calculation? It seems that this happens when the address is reused - if it is not reused, then r30:r31 are loaded directly. However, the reuse does not benefit from having the address in a register - the "add r30, r18" and "adc r31, r19" on lines 68 and 69 could be replaced with subi and sbci instructions to save space and time, and to free registers r18:r19. On most RISC cpus, storing the address in a register for reuse would be a benefit, which is probably why this code is generated - on the AVR, it is not helpful (at least, not here).

Secondly, the "(data & 0x0f)" clause generates messy 16-bit code. I realise C requires integer promotion in such cases, but it's important to try to remove unnecessary code such as loading the high register with zero, then anding it with zero, then eoring it. gcc version 3.4.6 was sometimes marginally better at such code. It should be noted that the quality of the generated code depends very much on the exact expression - the original "[(crc >> 4) ^ (data & 0x0f)]" generates poor code, while the equivalent "[((crc >> 4) ^ data) & 0x0f]" generates tight code.

Running the same code in c++ mode produces worse results - there are more promotions to 16-bit which are not removed by the optimiser. I realise there are no maintainers for the c++ port, and the c++ specific libraries are not implemented, but is there a particular problem that is causing this difference?


I've looked through the bug lists, and can't see any of these points in existing bug reports, although there may be some connection with http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11259


Best regards,

David




reply via email to

[Prev in Thread] Current Thread [Next in Thread]