avr-gcc-list
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Testing alternatives to functions from lib1funcs.S


From: Georg-Johann Lay
Subject: Re: Testing alternatives to functions from lib1funcs.S
Date: Sun, 21 Apr 2024 15:22:31 +0200
User-agent: Mozilla Thunderbird

Am 21.04.24 um 10:08 schrieb Wolfgang Hospital:>  Dear all,>
Is there a test scaffold for the functions from lib1funcs.S,
correctness, size&speed over the variety of 8-bit AVR cores?

Size is the easiest one: Just determine the size of, say
-nodefaultlibs -nostartfiles against a respective compilation
with -Wl,-u,__divmodqi4

Benchmarking speed is not so easy.  I am using the avrtest core
simulator because it is fast, simulating a core is enough, and
it has some extra features, e.g. get random values and get values
out of the target, e.g. LOG_FMT_DOUBLE ("double = %f\n", x);

https://github.com/sprintersb/atest

See the end of this mail for an example.

For correctness, most of the functions are tested off testsuite
by hand-written programs that test new implementations against
existing ones, like in the code below.  Such tests don't make sense
any more when the new version is integrated.  And performance
tests / comparisons are misplaced in the GCC testsuite anyway.

Is there a more comprehensive statement of calling conventions than https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention,

It is comprehensive, but likely not complete.  For completeness, you'll
have to resort to avr.md and the files it includes.  There is no
table that lists the non-ABT stuff though; you'll have to find the
transparent calls, usually of type "xcall".  Notice however that
such functions may be ABI or non-ABI.  Transparent calls are basically
used for two purposes:
* Non-ABI calls like some mul stuff that gets param in X reg.
* ABI calls that don't clobber all callee-used regs, in order to
  model the smaller footprint.

in particular explicitly stating which functions are guaranteed to have __zero_reg__ 0 on entry/where it suffices to have __zero_reg__ 0 on return as opposed to preserving its value?

When a function does /not/ have zero_reg=0 on entry, then the compiler
or libc (or application code) has a bug.  Same when zero_reg!=0 on
exit.

I've been tinkeringaround, the "ldi  r_cnt, 9""rjmp entry point" in __udivmodqi4 instead of "ldi  r_cnt, 8""lsl  r_arg1" annoying me for years. (Biggest relative strict improvement I found, FWIW.)

I went ahead and applied it, see https://gcc.gnu.org/PR114794

In order to test it, I ran the following code with
avrtest_log -q -no-log ...

<CODE>
#include <stdint.h>
#include "avrtest.h"

volatile uint8_t q8, my_q8;
volatile uint8_t r8, my_r8;

extern void __udivmodqi4 (void);
extern void my_udivmodqi4 (void);

__asm("\n"
"r_rem     = 25    /* remainder */" "\n"
"r_arg1    = 24    /* dividend, quotient */" "\n"
"r_arg2    = 22    /* divisor */" "\n"
"r_cnt     = 23    /* loop count */" "\n"
".pushsection .text" "\n"
".global my_udivmodqi4" "\n"
"my_udivmodqi4:" "\n\t"
"  sub     r_rem,r_rem     ; clear remainder and carry" "\n\t"
"  ldi     r_cnt,8         ; init loop counter" "\n\t"
"  lsl     r_arg1          ; shift dividend" "\n\t"
"__udivmodqi4_loop:" "\n\t"
"  rol     r_rem           ; shift dividend into remainder" "\n\t"
"  cp      r_rem,r_arg2    ; compare remainder & divisor" "\n\t"
"  brcs    __udivmodqi4_ep ; remainder <= divisor" "\n\t"
"  sub     r_rem,r_arg2    ; restore remainder" "\n\t"
"__udivmodqi4_ep:" "\n\t"
"  rol     r_arg1          ; shift dividend (with CARRY)" "\n\t"
"  dec     r_cnt           ; decrement loop counter" "\n\t"
"  brne    __udivmodqi4_loop" "\n\t"
"  com     r_arg1          ; complement result" "\n\t"
"                          ; because C flag was complemented in loop" "\n\t"
"  ret" "\n\t"
".popsection");

static inline __attribute__((__always_inline__))
void my_divmod8 (volatile uint8_t *pq, volatile uint8_t *prem,
                 uint8_t dividend, uint8_t divisor)
{
    register uint8_t rem asm("25");
    register uint8_t q asm("24");
    register uint8_t r22 asm("22") = divisor;
    register uint8_t r24 asm("24") = dividend;
    asm ("%~call %x[func]"
         : "=r" (q), "=r" (rem)
         : "r" (r22), "r" (r24), [func] "i" (my_udivmodqi4)
         : "r23");
    *pq = q;
    *prem = rem;
}

static inline __attribute__((__always_inline__))
void divmod8 (volatile uint8_t *pq, volatile uint8_t *prem,
              uint8_t dividend, uint8_t divisor)
{
    register uint8_t rem asm("25");
    register uint8_t q asm("24");
    register uint8_t r22 asm("22") = divisor;
    register uint8_t r24 asm("24") = dividend;
    asm ("%~call %x[func]"
         : "=r" (q), "=r" (rem)
         : "r" (r22), "r" (r24), [func] "i" (__udivmodqi4)
         : "r23");
    *pq = q;
    *prem = rem;
}

void bench_divmod8 (void)
{
    uint8_t a = 0;
    do
    {
        uint8_t b = 1;
        do
        {
            PERF_START_CALL (1);
            divmod8 (&q8, &r8, a, b);
            PERF_STOP (1);

            PERF_START_CALL (2);
            my_divmod8 (&my_q8, &my_r8, a, b);
            PERF_STOP (2);

            if (q8 != my_q8 || r8 != my_r8)
                __builtin_abort();
        } while (++b);
    } while (++a);
}

int main (void)
{
    bench_divmod8();
    PERF_DUMP_ALL;
    return 0;
}
</CODE>

The input space is only 16 bits wide, so a full coverage is possible.
With larger input spaces, one could use avrtest_[p]rand() or
similar means to randomize the input.

The output is as follows:

$ avrtest_log -mmcu=avr5 -no-log ben.elf -m 100000000 -q

--- Dump # 1:
 Timer T1 "" (65280 rounds):  00ec--00fc
              Instructions        Ticks
    Total:      3765820         5222400
    Mean:            57              80
    Stand.Dev:      0.9             0.0
    Min:             57              80
    Max:             65              80
    Calls (abs) in [   2,   3] was:   2 now:   2
    Calls (rel) in [   0,   1] was:   0 now:   0
    Stack (abs) in [08fb,08f9] was:08fb now:08fb
    Stack (rel) in [   0,   2] was:   0 now:   0

           Min round Max round    Min tag           /   Max tag
    Calls       -all-same-                          /
    Stack       -all-same-                          /
    Instr.         1     65026    -no-tag-          /   -no-tag-
    Ticks       -all-same-                          /

 Timer T2 "" (65280 rounds):  0108--0116
              Instructions        Ticks
    Total:      3569980         4896000
    Mean:            54              75
    Stand.Dev:      0.9             0.0
    Min:             54              75
    Max:             62              75
    Calls (abs) in [   2,   3] was:   2 now:   2
    Calls (rel) in [   0,   1] was:   0 now:   0
    Stack (abs) in [08fb,08f9] was:08fb now:08fb
    Stack (rel) in [   0,   2] was:   0 now:   0

           Min round Max round    Min tag           /   Max tag
    Calls       -all-same-                          /
    Stack       -all-same-                          /
    Instr.         1     65026    -no-tag-          /   -no-tag-
    Ticks       -all-same-                          /

So the new code requires 5 ticks less (changed from 80 to 75)

"Calls" is the (relative or absolute) call depth.
"Stack" is the (relative or absolute) stack usage.

Johann

Recommendations for a platform to vent such ideas welcome (I know of stackoverflow.com).

regards

W. Hospital

--
Wolfgang Hospital



reply via email to

[Prev in Thread] Current Thread [Next in Thread]