qemu-arm
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 26/51] target/arm: Implement SME LD1, ST1


From: Peter Maydell
Subject: Re: [PATCH v3 26/51] target/arm: Implement SME LD1, ST1
Date: Thu, 23 Jun 2022 12:41:23 +0100

On Mon, 20 Jun 2022 at 19:13, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> We cannot reuse the SVE functions for LD[1-4] and ST[1-4],
> because those functions accept only a Zreg register number.
> For SME, we want to pass a pointer into ZA storage.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  target/arm/helper-sme.h    |  82 +++++
>  target/arm/sme.decode      |   9 +
>  target/arm/sme_helper.c    | 615 +++++++++++++++++++++++++++++++++++++
>  target/arm/translate-sme.c |  69 +++++
>  4 files changed, 775 insertions(+)




> +/*
> + * FIXME: The ARMVectorReg elements are stored in host-endian 64-bit units.
> + * We do not have a defined ordering of the 64-bit units for host-endian
> + * 128-bit quantities.  For now, just leave the host words in little-endian
> + * order and hope for the best.
> + */

I don't understand this comment. The architecture ought to specify
what order the two halves of a 128-bit quantity ought to go in the
ZArray storage. If we can't guarantee that a host int128_t can be
handled in a way that does the right thing, then we just can't
use int128_t.

> +#define DO_LDQ(HNAME, VNAME, BE, HOST, TLB)                                 \
> +static inline void HNAME##_host(void *za, intptr_t off, void *host)         \
> +{                                                                           \
> +    uint64_t val0 = HOST(host), val1 = HOST(host + 8);                      \
> +    uint64_t *ptr = za + off;                                               \
> +    ptr[0] = BE ? val1 : val0, ptr[1] = BE ? val0 : val1;                   \
> +}                                                                           \
> +static inline void VNAME##_v_host(void *za, intptr_t off, void *host)       \
> +{                                                                           \
> +    HNAME##_host(za, off * sizeof(ARMVectorReg), host);                     \
> +}                                                                           \

> +/*
> + * Common helper for all contiguous predicated loads.
> + */
> +
> +static inline QEMU_ALWAYS_INLINE
> +void sme_ld1(CPUARMState *env, void *za, uint64_t *vg,
> +             const target_ulong addr, uint32_t desc, const uintptr_t ra,
> +             const int esz, uint32_t mtedesc, bool vertical,
> +             sve_ldst1_host_fn *host_fn,
> +             sve_ldst1_tlb_fn *tlb_fn,
> +             ClearFn *clr_fn,
> +             CopyFn *cpy_fn)

We now have several rather long functions that are
pretty complicated and pretty similar handling the various
SVE and SME loads and stores. Is there really no hope for
sharing code ?


> diff --git a/target/arm/translate-sme.c b/target/arm/translate-sme.c
> index d2a7232491..978af74d1d 100644
> --- a/target/arm/translate-sme.c
> +++ b/target/arm/translate-sme.c
> @@ -151,3 +151,72 @@ static bool trans_MOVA(DisasContext *s, arg_MOVA *a)
>
>      return true;
>  }
> +
> +static bool trans_LDST1(DisasContext *s, arg_LDST1 *a)
> +{
> +    typedef void GenLdSt1(TCGv_env, TCGv_ptr, TCGv_ptr, TCGv, TCGv_i32);
> +
> +    /*
> +     * Indexed by [esz][be][v][mte][st], which is (except for load/store)
> +     * also the order in which the elements appear in the function names,
> +     * and so how we must concatenate the pieces.
> +     */
> +
> +#define FN_LS(F)     { gen_helper_sme_ld1##F, gen_helper_sme_st1##F }
> +#define FN_MTE(F)    { FN_LS(F), FN_LS(F##_mte) }
> +#define FN_HV(F)     { FN_MTE(F##_h), FN_MTE(F##_v) }
> +#define FN_END(L, B) { FN_HV(L), FN_HV(B) }
> +
> +    static GenLdSt1 * const fns[5][2][2][2][2] = {
> +        FN_END(b, b),
> +        FN_END(h_le, h_be),
> +        FN_END(s_le, s_be),
> +        FN_END(d_le, d_be),
> +        FN_END(q_le, q_be),
> +    };
> +
> +#undef FN_LS
> +#undef FN_MTE
> +#undef FN_HV
> +#undef FN_END
> +
> +    TCGv_ptr t_za, t_pg;
> +    TCGv_i64 addr;
> +    int desc = 0;
> +    bool be = s->be_data == MO_BE;
> +    bool mte = s->mte_active[0];
> +
> +    if (!dc_isar_feature(aa64_sme, s)) {
> +        return false;
> +    }
> +    if (!sme_smza_enabled_check(s)) {
> +        return true;
> +    }
> +
> +    t_za = get_tile_rowcol(s, a->esz, a->rs, a->za_imm, a->v);
> +    t_pg = pred_full_reg_ptr(s, a->pg);
> +    addr = tcg_temp_new_i64();
> +
> +    tcg_gen_shli_i64(addr, cpu_reg(s, a->rn), a->esz);
> +    tcg_gen_add_i64(addr, addr, cpu_reg_sp(s, a->rm));

Aren't rm and rn the wrong way around here?  The pseudocode
says that rn is the base (can be SP, not scaled) and rm is
the offset (can be XZR, scaled by mbytes).

> +
> +    if (mte) {
> +        desc = FIELD_DP32(desc, MTEDESC, MIDX, get_mem_index(s));
> +        desc = FIELD_DP32(desc, MTEDESC, TBI, s->tbid);
> +        desc = FIELD_DP32(desc, MTEDESC, TCMA, s->tcma);
> +        desc = FIELD_DP32(desc, MTEDESC, WRITE, a->st);
> +        desc = FIELD_DP32(desc, MTEDESC, SIZEM1, (1 << a->esz) - 1);
> +        desc <<= SVE_MTEDESC_SHIFT;
> +    } else {
> +        addr = clean_data_tbi(s, addr);
> +    }
> +    desc = simd_desc(s->svl, s->svl, desc);
> +
> +    fns[a->esz][be][a->v][mte][a->st](cpu_env, t_za, t_pg, addr,
> +                                      tcg_constant_i32(desc));
> +
> +    tcg_temp_free_ptr(t_za);
> +    tcg_temp_free_ptr(t_pg);
> +    tcg_temp_free_i64(addr);
> +    return true;
> +}

thanks
-- PMM



reply via email to

[Prev in Thread] Current Thread [Next in Thread]