qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] target-arm: use clz32() instead of a for loop


From: Stuart Brady
Subject: Re: [Qemu-devel] [PATCH] target-arm: use clz32() instead of a for loop
Date: Fri, 23 Oct 2009 01:34:17 +0100
User-agent: Mutt/1.5.13 (2006-08-11)

On Thu, Oct 15, 2009 at 11:14:52PM +0200, Aurelien Jarno wrote:
> @@ -394,10 +395,7 @@ uint32_t HELPER(uxtb16)(uint32_t x)
>  
>  uint32_t HELPER(clz)(uint32_t x)
>  {
> -    int count;
> -    for (count = 32; x; count--)
> -        x >>= 1;
> -    return count;
> +    return clz32(x);
>  }
>  
>  int32_t HELPER(sdiv)(int32_t num, int32_t den)

Just a quick note that the implementation of clz, ctz and popcnt is
still listed in the TCG TODO list.  The last time I looked, I noticed
that quite a few architectures have clz/ctz instructions:

   http://lkml.indiana.edu/hypermail/linux/kernel/0601.3/1683.html

For those that don't, I think a combination the following two hacks at
http://graphics.stanford.edu/~seander/bithacks.html could be used:

   'Round up to the next highest power of 2'
   'Counting bits set, in parallel'

With this, it should be possible to implement clz and ctz without too
many operations for both 32-bit and 64-bit integers, without requiring
floats, lookup tables or branches.  Of course, __builtin_clz() might
well do a better job...

BTW, it may be worth pointing out:

   B[4] = 0x0000ffff;
   B[3] = B[4] ^ (B[4] << 8) => 0x00ff00ff
   B[2] = B[3] ^ (B[3] << 4) => 0x0f0f0f0f
   B[1] = B[2] ^ (B[2] << 2) => 0x33333333
   B[0] = B[1] ^ (B[1] << 1) => 0x55555555

In reality, I wonder if five separate loads would be quicker, though.

Cheers,
-- 
Stuart Brady




reply via email to

[Prev in Thread] Current Thread [Next in Thread]