qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 01/18] tcg: add support for 128bit vector type


From: Richard Henderson
Subject: Re: [Qemu-devel] [PATCH 01/18] tcg: add support for 128bit vector type
Date: Mon, 23 Jan 2017 10:43:31 -0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.6.0

On 01/23/2017 02:30 AM, Kirill Batuzov wrote:
> Because 4 adds on 4 i32 registers work good only when the size of
> vector elements matches the size of scalar variables we use for
> representation of a vector. add_i16x8 will not be that great if we use
> 4 i32 variables: each will need to be split into two values, processed
> independently and merged back afterwards.

Certainly.  But that's pretty much exactly how they are processed now.  Usually
via a helper function that accepts an i64 input as a pair of i32 arguments.

> Scalar variables lack primitives to work with them as vectors of shorter
> values. This is one of the reasons I added v64 type instead of using i64
> for 64-bit vector operations. And this is the reason I'm so opposed to
> using them to represent vector types if vector registers are not
> supported by host. Handling vector operations with element size that
> does not match representation will be complicated, may require special
> handling for different operations and will produce a lot of if-s in code.

A lot of if's?  I've no idea what you're talking about.

A v64 type makes sense because generally we're going to allocate them to a
different register set than i64.  That said, i64 is perfectly adequate for
implementing add_i8x8:

  t0  = in1 & 0x7f7f7f7f7f7f7f7f
  t1  = in0 + t0;
  t2  = in1 & 0x8080808080808080
  out = t1 ^ t2

This is less expensive than addition by pieces if there are at least 4 pieces.

> The method I'm proposing can handle any operation regardless of
> representation. This includes handling situation where host supports
> vector registers but does not support required operation (for example 
> SSE/AVX does not support multiplication of vectors of 8-bit values).

Not for nothing but it's trivial to expand with punpcklbw, punpckhbw, pmullw,
pand, packuswb.  That said, if an expansion gets too complicated, it's still
better to move it into a helper than expand 16 * (load, op, store).


r~





reply via email to

[Prev in Thread] Current Thread [Next in Thread]