freetype-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ft-devel] FT_MulFix assembly


From: Graham Asher
Subject: Re: [ft-devel] FT_MulFix assembly
Date: Mon, 06 Sep 2010 08:20:12 +0100
User-agent: Thunderbird 2.0.0.24 (Windows/20100228)

Have you done an ARM version? Forgive my inattentiveness if you've already announced one. It just struck me that this sort of optimisation is even more necessary on mobile devices.

Graham


James Cloos wrote:
The final result for amd64 looks like:

  static __inline__ long
  FT_MulFix_x86_64( long  a,
                   long  b )
  {
    register long  result;

    __asm__ __volatile__ (
      "movq  %1, %%rax\n"
      "imul  %2\n"
      "addq  %%rdx, %%rax\n"
      "addq  $0x8000, %%rax\n"
      "sarq  $16, %%rax\n"
      : "=a"(result)
      : "g"(a), "g"(b)
      : "rdx" );
    return result;
  }


The use of long, though requires review.  The C version uses FT_Long
(not FT_Int32 like the other asm versions), but FT_Long is not a #define
or a typedef at the point where the asm version are located.

That said, using long there on amd64 prevents unnecessary 32<->64 bit
conversions in the resulting code.

The above code has a latency of 1+5+1+1+1 = 10 clocks on an amdfam10 cpu.

The assembly generated by the C code is 45 lines and 158 octets long,
contains six conditional jumps, three each of explicit compares and
tests, and still benchmarks are just as fast.  Out-of-order processing
wins out over hand-coded asm. :-/

It *might* make more of a difference on an in-order processor like the
Arom.  But I do not have one to test.

I can still finish a patch, and have collected the info I need to do one
for mips64, too, where I expect it will be more important.  I also expect
that the i386 version could be tidied a bit.

Is the amd64 version desired, given how little benefit it has?

-JimC




reply via email to

[Prev in Thread] Current Thread [Next in Thread]