qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH RFC 0/7] Translate guest vector operations to ho


From: Alex Bennée
Subject: Re: [Qemu-devel] [PATCH RFC 0/7] Translate guest vector operations to host vector operations
Date: Thu, 16 Oct 2014 11:03:20 +0100

Kirill Batuzov <address@hidden> writes:

>> (4) Consider supporting generic vector operations in the TCG?
>
> I gave it a go and was quite happy with the result. I have implemented the 
> add_i32x4
> opcode which is addition of 128-bit vectors composed of four 32-bit integers
> and used it to translate NEON vadd.i32 to SSE paddd instruction. I used ARM 
> for
> my guest because I'm familiar with this architecture and it is different from
> my host.
>
> I got a 3x speedup on my testcase:
<snip>
> OUT: [size=196]
> 0x60442450:  mov    -0x4(%r14),%ebp
> 0x60442454:  test   %ebp,%ebp
> 0x60442456:  jne    0x60442505
> 0x6044245c:  movdqu 0x658(%r14),%xmm0
> 0x60442465:  movdqu 0x668(%r14),%xmm1
> 0x6044246e:  paddd  %xmm1,%xmm0
> 0x60442472:  paddd  %xmm1,%xmm0
> 0x60442476:  paddd  %xmm1,%xmm0
> 0x6044247a:  paddd  %xmm1,%xmm0
> 0x6044247e:  movdqu %xmm0,0x658(%r14)
> <...>

It certainly looks promising although as I suspect you know add is a
pretty easy target ;-)

>
>> But for target-alpha, there's one vector comparison operation that appears in
>> every guest string operation, and is used heavily enough that it's in the top
>> 10 functions in the profile: cmpbge (compare bytes greater or equal).
>
> cmpbge can be translated as follows:
>
> cmpge_i8x8      tmp0, arg1, arg2
> select_msb_i8x8 res, tmp0
>
> where cmpge is "compare grater or equal" with following semantic:
> res[i] = <111...11> if arg1[i] >= arg2[i]
> res[i] = <000...00> if arg1[i] <  arg2[i]
> There is such operation in NEON. In SSE we can emulate it with PCMPEQB, 
> PCMPGTB
> and POR.
>
> select_msb is "select most significant bit". SSE instruction PMOVMSKB.
>
>> While making helper functions faster is good I've wondered if they is
>> enough genericsm across the various SIMD/vector operations we could add
>> add TCG ops to translate them? The ops could fall back to generic helper
>> functions using the GCC instrinsics if we know there is no decent
>> back-end support for them?
>
> From Valgrind experience there are enough genericism. Valgrind can translate
> SSE, AltiVec and NEON instructions to vector opcodes. Most of the opcodes are
> reused between instruction sets.

Doesn't Valgrind have the advantage of same-arch->same-arch (I've not
looked at it's generated code in detail though).

> But keep in mind - there are a lot of vector opcodes. Much much more than
> scalar ones. You can see full list in Valgrind sources
> (VEX/pub/libvex_ir.h).

I think we could only approach this is in a piecemeal way guided by
performance bottlenecks when we find them.

> We can reduce the amount of opcodes by converting vector element size from 
> part
> of an opcode to a constant argument. But we will lose some flexibility offered
> by the TARGET_HAS_opcode macro when target has support for some sizes but not 
> for
> others. For example SSE has vector minimum for sizes i8x16, i16x8, i32x4 but
> does not have one for size i64x2. 
>
> Some implementation details and concerns.
>
> The most problematic issue was the fact that with vector registers we have one
> entity that can be accessed as both global variable and memory location. I
> solved it by introducing the sync_temp opcode that instructs register 
> allocator to
> save global variable to its memory location if it is on the register. If a
> variable is not on a register or memory is already coherent - no store is 
> issued,
> so performance penalty for it is minimal. Still this approach has a serious
> drawback: we need to generate sync_temp explicitly. But I do not know any 
> better
> way to achieve consistency.

I'm not sure I follow. I thought we only needed the memory access when
the backend can't support the vector width operations so shouldn't have
stuff in the vector registers?

> Note that as of this RFC I have not finished conversion of ARM guest so mixing
> NEON with VFP code can cause a miscompile.
>
> The second problem is that a backend may or may not support vector 
> operations. We
> do not want each frontend to check it on every operation. I created a wrapper 
> that
> generates vector opcode if it is supported or generates emulation code.
>
> For add_i32x4 emulation code is generated inline. I tried to make it a helper
> but got a very significant performance loss (5x slowdown). I'm not sure about
> the cause but I suspect that memory was a bottleneck and extra stores needed
> by calling conventions mattered a lot.

So the generic helper was more API heavy than the existing NEON helpers?
>
> The existing constraints are good enough to express that vector registers and
> general purpose registers are different and can not be used instead of each
> other.
>
> One unsolved problem is global aliasing. With general purpose registers we 
> have
> no aliasing between globals. The only example I know where registers can alias
> is the x86 ah/ax/eax/rax case. They are handled as one global. With vector
> registers we have NEON where an 128-bit Q register consists of two 64-bit
> D registers each consisting of two 32-bit S registers. I think I'll need
> to add alias list to each global listing every other global it can clobber and
> then iterate over it in the optimizer. Fortunately this list will be static 
> and not
> very long.
>
> Why I think all this is worth doing:
>
> (1) Performance. 200% speedup is a lot. My test was specifically crafted and 
> real
>     life applications may not have that much vector operations on average, but
>     there is a specific class of applications where it will matter a lot - 
> media
>     processing applications like ffmpeg.
>
> (2) Some unification of common operations. Right now every target reimplements
>     common vector operations (like vector add/sub/mul/min/compare etc.). We 
> can
>     do it once in the common TCG code.
>
> Still there are some cons I mentioned earlier. The need to support a lot of
> opcodes is the most significant in the long run I think. So before I commit my
> time to conversion of more operations I'd like to hear your opinions if this
> approach is acceptable and worth spending efforts.

Overall I'm pretty keen to explore this further. If we can get the
backend interface right and make it an easier proposition to tcg-up
various vector operations when bottle-necks arise it will be a big win.

A lot will depend on where those bottle-necks are though. If for example
the media codecs all use very ARCH specific special sauce instructions
we might never claw back that much.

I'll have a look through the patches and comment there when I've gotten
my head round the back-end issues.

Thanks for coding this up ;-)

-- 
Alex Bennée



reply via email to

[Prev in Thread] Current Thread [Next in Thread]