qemu-riscv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH v4 2/5] target/riscv: rvv: Provide a fast path using dire


From: Richard Henderson
Subject: Re: [RFC PATCH v4 2/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unmasked unit-stride load/store
Date: Wed, 19 Jun 2024 21:29:19 -0700
User-agent: Mozilla Thunderbird

On 6/13/24 10:51, Max Chou wrote:
This commit references the sve_ldN_r/sve_stN_r helper functions in ARM
target to optimize the vector unmasked unit-stride load/store
instructions by following items:

* Get the loose bound of activate elements
* Probing pages/resolving host memory address/handling watchpoint at beginning
* Provide new interface to direct access host memory

The original element load/store interface is replaced by the new element
load/store functions with _tlb & _host postfix that means doing the
element load/store through the original softmmu flow and the direct
access host memory flow.

Signed-off-by: Max Chou <max.chou@sifive.com>
---
  target/riscv/insn_trans/trans_rvv.c.inc |   3 +
  target/riscv/vector_helper.c            | 637 +++++++++++++++++++-----
  target/riscv/vector_internals.h         |  48 ++
  3 files changed, 551 insertions(+), 137 deletions(-)

diff --git a/target/riscv/insn_trans/trans_rvv.c.inc 
b/target/riscv/insn_trans/trans_rvv.c.inc
index 3a3896ba06c..14e10568bd7 100644
--- a/target/riscv/insn_trans/trans_rvv.c.inc
+++ b/target/riscv/insn_trans/trans_rvv.c.inc
@@ -770,6 +770,7 @@ static bool ld_us_mask_op(DisasContext *s, arg_vlm_v *a, 
uint8_t eew)
      /* Mask destination register are always tail-agnostic */
      data = FIELD_DP32(data, VDATA, VTA, s->cfg_vta_all_1s);
      data = FIELD_DP32(data, VDATA, VMA, s->vma);
+    data = FIELD_DP32(data, VDATA, VM, 1);
      return ldst_us_trans(a->rd, a->rs1, data, fn, s, false);
  }
@@ -787,6 +788,7 @@ static bool st_us_mask_op(DisasContext *s, arg_vsm_v *a, uint8_t eew)
      /* EMUL = 1, NFIELDS = 1 */
      data = FIELD_DP32(data, VDATA, LMUL, 0);
      data = FIELD_DP32(data, VDATA, NF, 1);
+    data = FIELD_DP32(data, VDATA, VM, 1);
      return ldst_us_trans(a->rd, a->rs1, data, fn, s, true);
  }
@@ -1106,6 +1108,7 @@ static bool ldst_whole_trans(uint32_t vd, uint32_t rs1, uint32_t nf,
      TCGv_i32 desc;
uint32_t data = FIELD_DP32(0, VDATA, NF, nf);
+    data = FIELD_DP32(data, VDATA, VM, 1);
      dest = tcg_temp_new_ptr();
      desc = tcg_constant_i32(simd_desc(s->cfg_ptr->vlenb,
                                        s->cfg_ptr->vlenb, data));

This is ok, and would warrant a separate patch.


+    if (vm == 0) {
+        for (i = vstart; i < evl; ++i) {
+            if (vext_elem_mask(v0, i)) {
+                reg_idx_last = i;
+                if (reg_idx_first < 0) {
+                    reg_idx_first = i;
+                }
+            }
+        }

This isn't great, and isn't used for now, since only unmasked unit-stride is handled so far. I think this first patch should be simpler and *assume* VM is set.

+/*
+ * Resolve the guest virtual addresses to info->page[].
+ * Control the generation of page faults with @fault.  Return false if
+ * there is no work to do, which can only happen with @fault == FAULT_NO.
+ */
+static bool vext_cont_ldst_pages(CPURISCVState *env, RVVContLdSt *info,
+                                 target_ulong addr, bool is_load,
+                                 uint32_t desc, uint32_t esz, uintptr_t ra,
+                                 bool is_us_whole)
+{
+    uint32_t vm = vext_vm(desc);
+    uint32_t nf = vext_nf(desc);
+    bool nofault = (vm == 1 ? false : true);

Why is nofault == "!vm"?

Also, it's silly to use ?: with true/false -- use the proper boolean expression in the first place.

That said... faults with RVV must interact with vstart.

I'm not sure what the best code organization is.

Perhaps a subroutine, passed the first and last elements for a single page.

  Update vstart, resolve the page, allowing the exception.
  If watchpoints, one call to cpu_check_watchpoint for the entire memory range.
  If ram, iterate through the rest of the page using host accesses; otherwise,
  iterate through the rest of the page using tlb accesses, making sure vstart
  is always up-to-date.

The main routine looks for the page_split, invokes the subroutine for the first (and likely only) page. Special case any split-page element. Invoke the subroutine for the second page.


r~



reply via email to

[Prev in Thread] Current Thread [Next in Thread]