[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[RFC] risc-v vector (RVV) emulation performance issues
From: |
Daniel Henrique Barboza |
Subject: |
[RFC] risc-v vector (RVV) emulation performance issues |
Date: |
Mon, 24 Jul 2023 10:40:08 -0300 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 |
Hi,
As some of you are already aware the current RVV emulation could be faster.
We have at least one commit (bc0ec52eb2, "target/riscv/vector_helper.c:
skip set tail when vta is zero") that tried to address at least part of the
problem.
Running a simple program like this:
-------
#define SZ 10000000
int main ()
{
int *a = malloc (SZ * sizeof (int));
int *b = malloc (SZ * sizeof (int));
int *c = malloc (SZ * sizeof (int));
for (int i = 0; i < SZ; i++)
c[i] = a[i] + b[i];
return c[SZ - 1];
}
-------
And then compiling it without RVV support will run in 50 milis or so:
$ time ~/work/qemu/build/qemu-riscv64 -cpu
rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 ./foo-novect.out
real 0m0.043s
user 0m0.025s
sys 0m0.018s
Building the same program with RVV support slows it 4-5 times:
$ time ~/work/qemu/build/qemu-riscv64 -cpu
rv64,debug=false,vext_spec=v1.0,v=true,vlen=1024 ./foo.out
real 0m0.196s
user 0m0.177s
sys 0m0.018s
Using the lowest 'vlen' val allowed (128) will slow down things even further,
taking it to
~0.260s.
'perf record' shows the following profile on the aforementioned binary:
23.27% qemu-riscv64 qemu-riscv64 [.] do_ld4_mmu
21.11% qemu-riscv64 qemu-riscv64 [.] vext_ldst_us
14.05% qemu-riscv64 qemu-riscv64 [.] cpu_ldl_le_data_ra
11.51% qemu-riscv64 qemu-riscv64 [.] cpu_stl_le_data_ra
8.18% qemu-riscv64 qemu-riscv64 [.] cpu_mmu_lookup
8.04% qemu-riscv64 qemu-riscv64 [.] do_st4_mmu
2.04% qemu-riscv64 qemu-riscv64 [.] ste_w
1.15% qemu-riscv64 qemu-riscv64 [.] lde_w
1.02% qemu-riscv64 [unknown] [k] 0xffffffffb3001260
0.90% qemu-riscv64 qemu-riscv64 [.] cpu_get_tb_cpu_state
0.64% qemu-riscv64 qemu-riscv64 [.] tb_lookup
0.64% qemu-riscv64 qemu-riscv64 [.] riscv_cpu_mmu_index
0.39% qemu-riscv64 qemu-riscv64 [.] object_dynamic_cast_assert
First thing that caught my attention is vext_ldst_us from
target/riscv/vector_helper.c:
/* load bytes from guest memory */
for (i = env->vstart; i < evl; i++, env->vstart++) {
k = 0;
while (k < nf) {
target_ulong addr = base + ((i * nf + k) << log2_esz);
ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
k++;
}
}
env->vstart = 0;
Given that this is a unit-stride load that access contiguous elements in memory
it
seems that this loop could be optimized/removed since it's loading/storing bytes
one by one. I didn't find any TCG op to do that though. I assume that ARM SVE
might
have something of the sorts. Richard, care to comment?
The current support we have is good enough for booting a kernel and tests, but
things
aggravate fast if one attempts to run a x264 SPEC with it. With a SPEC run we
have
other insns appearing as hot but for now it would be good to see if we can
optimize
these loads and stores.
Any ideas on how to tackle this? Thanks,
Daniel
- [RFC] risc-v vector (RVV) emulation performance issues,
Daniel Henrique Barboza <=