[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC 1/1] target/riscv: use tcg ops generation to emulate whole reg
From: |
Richard Henderson |
Subject: |
Re: [RFC 1/1] target/riscv: use tcg ops generation to emulate whole reg rvv loads/stores. |
Date: |
Wed, 18 Dec 2024 13:08:46 -0600 |
User-agent: |
Mozilla Thunderbird |
On 12/18/24 11:08, Paolo Savini wrote:
This patch aims at emulating the whole register loads and stores through
direct generation of tcg operations rather than through the aid of a helper
function.
Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
---
target/riscv/insn_trans/trans_rvv.c.inc | 104 +++++++++++++-----------
1 file changed, 56 insertions(+), 48 deletions(-)
diff --git a/target/riscv/insn_trans/trans_rvv.c.inc
b/target/riscv/insn_trans/trans_rvv.c.inc
index b9883a5d32..63d8b05bf1 100644
--- a/target/riscv/insn_trans/trans_rvv.c.inc
+++ b/target/riscv/insn_trans/trans_rvv.c.inc
@@ -1100,26 +1100,34 @@ GEN_VEXT_TRANS(vle64ff_v, MO_64, r2nfvm, ldff_op,
ld_us_check)
typedef void gen_helper_ldst_whole(TCGv_ptr, TCGv, TCGv_env, TCGv_i32);
static bool ldst_whole_trans(uint32_t vd, uint32_t rs1, uint32_t nf,
- gen_helper_ldst_whole *fn,
- DisasContext *s)
+ uint32_t log2_esz, gen_helper_ldst_whole *fn,
+ DisasContext *s, bool is_load)
{
- TCGv_ptr dest;
- TCGv base;
- TCGv_i32 desc;
-
- uint32_t data = FIELD_DP32(0, VDATA, NF, nf);
- data = FIELD_DP32(data, VDATA, VM, 1);
- dest = tcg_temp_new_ptr();
- desc = tcg_constant_i32(simd_desc(s->cfg_ptr->vlenb,
- s->cfg_ptr->vlenb, data));
-
- base = get_gpr(s, rs1, EXT_NONE);
- tcg_gen_addi_ptr(dest, tcg_env, vreg_ofs(s, vd));
+ TCGv addr;
+ addr = get_address(s, rs1, 0);
+ uint32_t size = s->cfg_ptr->vlenb * nf;
- mark_vs_dirty(s);
Did you lose this? You can't rely on the call from finalize_rvv_inst at the end. Dirty
must be already be set for the exception path if any of these memory operations fault.
-
- fn(dest, base, tcg_env, desc);
+ TCGv_i128 t16 = tcg_temp_new_i128();
+ /*
+ * Load/store minimum vlenb bytes per iteration.
+ * When possible do this atomically.
+ * Update vstart with the number of processed elements.
+ */
+ for (int i=0; i < size; i+=16) {
Spaces. scripts/checkpatch.pl should have complained.
+ addr = get_address(s, rs1, i);
+ if (is_load) {
+ tcg_gen_qemu_ld_i128(t16, addr, s->mem_idx,
+ MO_LE | MO_128 | MO_ATOM_WITHIN16);
FWIW, for MO_128, MO_ATOM_WITHIN16 is equivalent to MO_ATOM_IFALIGN, which is
the default.
Where you can improve things if by specifying *less* atomicity. For instance:
log2_esz == MO_8: MO_ATOM_NONE
When operating on bytes, no larger atomicity is required.
log2_esz <= MO_64: MO_ATOM_IFALIGN_PAIR
Require atomicity on the two MO_64 halves, but no more than that.
I don't currently support further granularity, because there doesn't seem to be a need,
and it's already somewhat confusing.
There's no handling for vstart != 0.
You could easily fall back to the helper for !s->vstart_eq_zero.
+ tcg_gen_st_i128(t16, tcg_env, vreg_ofs(s, vd) + i);
+ tcg_gen_addi_tl(cpu_vstart, cpu_vstart, 16 >> log2_esz);
+ } else {
+ tcg_gen_ld_i128(t16, tcg_env, vreg_ofs(s, vd) + i);
+ tcg_gen_qemu_st_i128(t16, addr, s->mem_idx,
+ MO_LE | MO_128 | MO_ATOM_WITHIN16);
+ tcg_gen_addi_tl(cpu_vstart, cpu_vstart, 16 >> log2_esz);
The final iteration must set cpu_vstart to 0.
Otherwise this looks quite plausible.
r~