|
From: | Wouter van Gulik |
Subject: | Re: [avr-gcc-list] [FIX] _clz and friends not found (test builtin-bitops-1) |
Date: | Tue, 29 Jan 2008 10:22:50 +0100 |
User-agent: | Thunderbird 2.0.0.9 (Windows/20071031) |
Dmitry K. schreef:
On Friday 25 January 2008 22:35, Wouter van Gulik wrote:__clzqi2: clr r_count ; load with 0 com r_count ; invert (load with -1) + set carry __clzqi2_loop: rol r_arg1L ; Rotate through carry inc r_count ; Carry not touch by inc brcc __clzqi2_loop ; Branch on no carryThat is splendid!
Thanks, the original idea is from the guy posting in the gcc bug report for clz.
After a superficial view: . A short rcall/rjmp is not safe with unknown (big) library for intermodule link. Also a conditional branch.
Hmm, this is all ready so for __mulqihi3 and __umulqihi3, they do a rjmp to __mulhi3. So I thought it was save. Also note that avr-libc's libm uses this exclusively. Maybe this is fixed by the linker relaxation?
On the conditional branch I totally agree. That is not a smart thing todo.
. Are this function intended for math functions? If so, the strong size optimization is not a best solution (IMHO). An addition of few words may speed up in few times for some 32/16-bit functions.
Yes they are intended for math I guess. But I think they will never be used. CLZ/CTZ is only used when using floats and not linking with avr-libc's math library (libm). This has poor result any way, so user will probably quickly switch to libm.
I mainly implemented CLZ/popcount to fix the huge use of RAM by these functions.
CLZ is already a little optimized for speed. I could optimize further. Make the 16 bit check on a high byte being zero and then decide which one he should do on byte base. But then I would also need to link in the QI implementation, using even more flash. The qi loop is 4 cycles, the hi loop is 5. Excluding the penalty for branching etc. So we could have a negative effect on speed for some values.
There is no bug report from a user that he is missing the other functionality, so I thought it is more a matter of having it so gcc passes the testsuite, not having it as fast,mean and lean implementation.
Wouter
[Prev in Thread] | Current Thread | [Next in Thread] |