[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC PATCH v1 03/43] accel/tcg: Add gvec size changing operations
From: |
Anton Johansson |
Subject: |
Re: [RFC PATCH v1 03/43] accel/tcg: Add gvec size changing operations |
Date: |
Tue, 3 Dec 2024 21:15:23 +0100 |
On 03/12/24, Richard Henderson wrote:
> On 12/3/24 12:08, Anton Johansson wrote:
> > On 22/11/24, Richard Henderson wrote:
> > > On 11/20/24 19:49, Anton Johansson wrote:
> > > > Adds new functions to the gvec API for truncating, sign- or zero
> > > > extending vector elements. Currently implemented as helper functions,
> > > > these may be mapped onto host vector instructions in the future.
> > > >
> > > > For the time being, allows translation of more complicated vector
> > > > instructions by helper-to-tcg.
> > > >
> > > > Signed-off-by: Anton Johansson <anjo@rev.ng>
> > > > ---
> > > > accel/tcg/tcg-runtime-gvec.c | 41 +++++++++++++++++
> > > > accel/tcg/tcg-runtime.h | 22 +++++++++
> > > > include/tcg/tcg-op-gvec-common.h | 18 ++++++++
> > > > tcg/tcg-op-gvec.c | 78
> > > > ++++++++++++++++++++++++++++++++
> > > > 4 files changed, 159 insertions(+)
> > > >
> > > > diff --git a/accel/tcg/tcg-runtime-gvec.c b/accel/tcg/tcg-runtime-gvec.c
> > > > index afca89baa1..685c991e6a 100644
> > > > --- a/accel/tcg/tcg-runtime-gvec.c
> > > > +++ b/accel/tcg/tcg-runtime-gvec.c
> > > > @@ -1569,3 +1569,44 @@ void HELPER(gvec_bitsel)(void *d, void *a, void
> > > > *b, void *c, uint32_t desc)
> > > > }
> > > > clear_high(d, oprsz, desc);
> > > > }
> > > > +
> > > > +#define DO_SZ_OP1(NAME, DSTTY, SRCTY)
> > > > \
> > > > +void HELPER(NAME)(void *d, void *a, uint32_t desc)
> > > > \
> > > > +{
> > > > \
> > > > + intptr_t oprsz = simd_oprsz(desc);
> > > > \
> > > > + intptr_t elsz = oprsz/sizeof(DSTTY);
> > > > \
> > > > + intptr_t i;
> > > > \
> > > > +
> > > > \
> > > > + for (i = 0; i < elsz; ++i) {
> > > > \
> > > > + SRCTY aa = *((SRCTY *) a + i);
> > > > \
> > > > + *((DSTTY *) d + i) = aa;
> > > > \
> > > > + }
> > > > \
> > > > + clear_high(d, oprsz, desc);
> > > > \
> > >
> > > This formulation is not valid.
> > >
> > > (1) Generic forms must *always* operate strictly on columns. This
> > > formulation is either expanding a narrow vector to a wider vector or
> > > compressing a wider vector to a narrow vector.
> > >
> > > (2) This takes no care for byte ordering of the data between columns.
> > > This
> > > is where sticking strictly to columns helps, in that we can assume that
> > > data
> > > is host-endian *within the column*, but we cannot assume anything about
> > > the
> > > element indexing of ptr + i.
> >
> > Concerning (1) and (2), is this a limitation imposed on generic vector
> > ops. to simplify mapping to host vector instructions where
> > padding/alignment of elements might differ? From my understanding, the
> > helper above should be fine since we can assume contiguous elements?
>
> This is a limitation imposed on generic vector ops, because different
> target/arch/ represent their vectors in different ways.
>
> For instance, Arm and RISC-V chunk the vector in to host-endian uint64_t,
> with the chunks indexed little-endian. But PPC puts the entire 128-bit
> vector in host-endian bit ordering, so the uint64_t chunks are host-endian.
>
> On a big-endian host, ptr+1 may be addressing element i-1 or i-7 instead of
> i+1.
Ah, I see the problem now thanks for the explanation:)
> > I see, I don't think we can make this work for Hexagon vector ops., as
> > an example consider V6_vadduwsat which performs an unsigned saturated
> > add of 32-bit elements, currently we emit
> >
> > void emit_V6_vadduwsat(intptr_t vec2, intptr_t vec7, intptr_t vec6) {
> > VectorMem mem = {0};
> > intptr_t vec5 = temp_new_gvec(&mem, 256);
> > tcg_gen_gvec_zext(MO_64, MO_32, vec5, vec7, 256, 128, 256);
> >
> > intptr_t vec1 = temp_new_gvec(&mem, 256);
> > tcg_gen_gvec_zext(MO_64, MO_32, vec1, vec6, 256, 128, 256);
> >
> > tcg_gen_gvec_add(MO_64, vec1, vec1, vec5, 256, 256);
> >
> > intptr_t vec3 = temp_new_gvec(&mem, 256);
> > tcg_gen_gvec_dup_imm(MO_64, vec3, 256, 256, 4294967295ull);
> >
> > tcg_gen_gvec_umin(MO_64, vec1, vec1, vec3, 256, 256);
> >
> > tcg_gen_gvec_trunc(MO_32, MO_64, vec2, vec1, 128, 256, 128);
> > }
> >
> > so we really do rely on the size-changing property of zext here, the
> > input vectors are 128-byte and we expand them to 256-byte. We could
> > expand vector operations within the instruction to the largest vector
> > size, but would need to zext and trunc to destination and source
> > registers anyway.
> Yes, well, this is the output of llvm though, yes?
Yes
> Did you forget to describe TCG's native saturating operations to the
> compiler? tcg_gen_gvec_usadd performs exactly this operation.
>
> And if you'd like to improve llvm, usadd(a, b) equals umin(a, ~b) + b.
> Fewer operations without having to change vector sizes.
> Similarly for unsigned saturating subtract: ussub(a, b) equals umax(a, b) - b.
In this case LLVM wasn't able to optimize it to a llvm.uadd.sat
intrinsic, in which case we would have emitted tcg_gen_gvec_usadd I
believe. We can manually optimize the above pattern to a llvm.uadd.sat
to avoid extra size changes.
This might be fixed in future LLVM versions, but otherwise seems like a
reasonable change to push upstream.
The point is that we have a lot of Hexagon instructions where size
changes are probably unavoidable, another example is V6_vshuffh which
takes in a <16 x i16> vector and shuffles the upper <8xi16> into the upper
16-bits of a <8 x i32> vector
void emit_V6_vshuffh(intptr_t vec3, intptr_t vec7) {
VectorMem mem = {0};
intptr_t vec2 = temp_new_gvec(&mem, 128);
tcg_gen_gvec_zext(MO_32, MO_16, vec2, vec7, 128, 64, 128);
intptr_t vec0 = temp_new_gvec(&mem, 128);
tcg_gen_gvec_zext(MO_32, MO_16, vec0, (vec7 + 64ull), 128, 64, 128);
intptr_t vec1 = temp_new_gvec(&mem, 128);
tcg_gen_gvec_shli(MO_32, vec1, vec0, 16, 128, 128);
tcg_gen_gvec_or(MO_32, vec3, vec1, vec2, 128, 128);
}
Not to bloat the email too much with examples, you can see 3 more here
https://pad.rev.ng/11IvAKhiRy2cPwC7MX9nXA
Maybe we rely on the target defining size-changing operations if they
are needed?
//Anton