qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH v1 03/43] accel/tcg: Add gvec size changing operations


From: Richard Henderson
Subject: Re: [RFC PATCH v1 03/43] accel/tcg: Add gvec size changing operations
Date: Tue, 3 Dec 2024 12:57:57 -0600
User-agent: Mozilla Thunderbird

On 12/3/24 12:08, Anton Johansson wrote:
On 22/11/24, Richard Henderson wrote:
On 11/20/24 19:49, Anton Johansson wrote:
Adds new functions to the gvec API for truncating, sign- or zero
extending vector elements.  Currently implemented as helper functions,
these may be mapped onto host vector instructions in the future.

For the time being, allows translation of more complicated vector
instructions by helper-to-tcg.

Signed-off-by: Anton Johansson <anjo@rev.ng>
---
   accel/tcg/tcg-runtime-gvec.c     | 41 +++++++++++++++++
   accel/tcg/tcg-runtime.h          | 22 +++++++++
   include/tcg/tcg-op-gvec-common.h | 18 ++++++++
   tcg/tcg-op-gvec.c                | 78 ++++++++++++++++++++++++++++++++
   4 files changed, 159 insertions(+)

diff --git a/accel/tcg/tcg-runtime-gvec.c b/accel/tcg/tcg-runtime-gvec.c
index afca89baa1..685c991e6a 100644
--- a/accel/tcg/tcg-runtime-gvec.c
+++ b/accel/tcg/tcg-runtime-gvec.c
@@ -1569,3 +1569,44 @@ void HELPER(gvec_bitsel)(void *d, void *a, void *b, void 
*c, uint32_t desc)
       }
       clear_high(d, oprsz, desc);
   }
+
+#define DO_SZ_OP1(NAME, DSTTY, SRCTY)                                      \
+void HELPER(NAME)(void *d, void *a, uint32_t desc)                         \
+{                                                                          \
+    intptr_t oprsz = simd_oprsz(desc);                                     \
+    intptr_t elsz = oprsz/sizeof(DSTTY);                                   \
+    intptr_t i;                                                            \
+                                                                           \
+    for (i = 0; i < elsz; ++i) {                                           \
+        SRCTY aa = *((SRCTY *) a + i);                                     \
+        *((DSTTY *) d + i) = aa;                                           \
+    }                                                                      \
+    clear_high(d, oprsz, desc);                                            \

This formulation is not valid.

(1) Generic forms must *always* operate strictly on columns.  This
formulation is either expanding a narrow vector to a wider vector or
compressing a wider vector to a narrow vector.

(2) This takes no care for byte ordering of the data between columns.  This
is where sticking strictly to columns helps, in that we can assume that data
is host-endian *within the column*, but we cannot assume anything about the
element indexing of ptr + i.

Concerning (1) and (2), is this a limitation imposed on generic vector
ops. to simplify mapping to host vector instructions where
padding/alignment of elements might differ?  From my understanding, the
helper above should be fine since we can assume contiguous elements?

This is a limitation imposed on generic vector ops, because different target/arch/ represent their vectors in different ways.

For instance, Arm and RISC-V chunk the vector in to host-endian uint64_t, with the chunks indexed little-endian. But PPC puts the entire 128-bit vector in host-endian bit ordering, so the uint64_t chunks are host-endian.

On a big-endian host, ptr+1 may be addressing element i-1 or i-7 instead of i+1.

I see, I don't think we can make this work for Hexagon vector ops., as
an example consider V6_vadduwsat which performs an unsigned saturated
add of 32-bit elements, currently we emit

     void emit_V6_vadduwsat(intptr_t vec2, intptr_t vec7, intptr_t vec6) {
         VectorMem mem = {0};
         intptr_t vec5 = temp_new_gvec(&mem, 256);
         tcg_gen_gvec_zext(MO_64, MO_32, vec5, vec7, 256, 128, 256);

         intptr_t vec1 = temp_new_gvec(&mem, 256);
         tcg_gen_gvec_zext(MO_64, MO_32, vec1, vec6, 256, 128, 256);

         tcg_gen_gvec_add(MO_64, vec1, vec1, vec5, 256, 256);

         intptr_t vec3 = temp_new_gvec(&mem, 256);
         tcg_gen_gvec_dup_imm(MO_64, vec3, 256, 256, 4294967295ull);

         tcg_gen_gvec_umin(MO_64, vec1, vec1, vec3, 256, 256);

         tcg_gen_gvec_trunc(MO_32, MO_64, vec2, vec1, 128, 256, 128);
     }

so we really do rely on the size-changing property of zext here, the
input vectors are 128-byte and we expand them to 256-byte.  We could
expand vector operations within the instruction to the largest vector
size, but would need to zext and trunc to destination and source
registers anyway.
Yes, well, this is the output of llvm though, yes?

Did you forget to describe TCG's native saturating operations to the compiler? tcg_gen_gvec_usadd performs exactly this operation.

And if you'd like to improve llvm, usadd(a, b) equals umin(a, ~b) + b.
Fewer operations without having to change vector sizes.
Similarly for unsigned saturating subtract: ussub(a, b) equals umax(a, b) - b.


r~



reply via email to

[Prev in Thread] Current Thread [Next in Thread]