avr-gcc-list
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [avr-gcc-list] [FIX] _clz and friends not found (test builtin-bitops


From: Wouter van Gulik
Subject: Re: [avr-gcc-list] [FIX] _clz and friends not found (test builtin-bitops-1)
Date: Tue, 29 Jan 2008 10:22:50 +0100
User-agent: Thunderbird 2.0.0.9 (Windows/20071031)

Dmitry K. schreef:
On Friday 25 January 2008 22:35, Wouter van Gulik wrote:
__clzqi2:
    clr     r_count     ; load with 0
    com     r_count     ; invert (load with -1) + set carry
__clzqi2_loop:
    rol     r_arg1L         ; Rotate through carry
    inc     r_count         ; Carry not touch by inc
    brcc    __clzqi2_loop   ; Branch on no carry

That is splendid!


Thanks, the original idea is from the guy posting in the gcc bug report for clz.

After a superficial view:
. A short rcall/rjmp is not safe with unknown (big) library
for intermodule link. Also a conditional branch.

Hmm, this is all ready so for __mulqihi3 and __umulqihi3, they do a rjmp to __mulhi3. So I thought it was save. Also note that avr-libc's libm uses this exclusively. Maybe this is fixed by the linker relaxation?

On the conditional branch I totally agree. That is not a smart thing todo.

. Are this function intended for math functions? If so,
the strong size optimization is not a best solution (IMHO).
An addition of few words may speed up in few times for
some 32/16-bit functions.


Yes they are intended for math I guess. But I think they will never be used. CLZ/CTZ is only used when using floats and not linking with avr-libc's math library (libm). This has poor result any way, so user will probably quickly switch to libm.

I mainly implemented CLZ/popcount to fix the huge use of RAM by these functions.

CLZ is already a little optimized for speed. I could optimize further. Make the 16 bit check on a high byte being zero and then decide which one he should do on byte base. But then I would also need to link in the QI implementation, using even more flash. The qi loop is 4 cycles, the hi loop is 5. Excluding the penalty for branching etc. So we could have a negative effect on speed for some values.

There is no bug report from a user that he is missing the other functionality, so I thought it is more a matter of having it so gcc passes the testsuite, not having it as fast,mean and lean implementation.

Wouter




reply via email to

[Prev in Thread] Current Thread [Next in Thread]