|
From: | Richard Henderson |
Subject: | Re: [Qemu-devel] [PATCH] tcg: increase MAX_OP_PER_INSTR to 395 |
Date: | Fri, 23 Sep 2016 12:54:48 -0700 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0 |
On 09/22/2016 04:53 PM, Joseph Myers wrote:
MAX_OP_PER_INSTR is currently 266, reported in commit 14dcdac82f398cbac874c8579b9583fab31c67bf to be the worst case for the ARM A64 decoder. Whether or not it was in fact the worst case at that time in 2014, I'm observing the instruction 0x4c006020 (st1 {v0.16b-v2.16b}, [x1]) generate 386 ops from disas_ldst_multiple_struct with current sources,
For the record, I reproduce your results on a 32-bit host with v0-v3. I assume the v2 here is a typo.
While increasing the max per insn is indeed one way to approach this, aarch64 is being remarkably inefficient in this case. With the following, I see a reduction from 387 ops to 261 ops; for a 64-bit host, the reduction is from 258 ops to 195 ops.
I should also note that the implementation of this insn should be even simpler. I see this insn as performing 8 64-bit, little-endian, unaligned loads. We should be able to implement this insn for a 64-bit host in about 25 ops, which implies that the current code is nearly 8 times too large.
The same should be true for other combinations of sizes for ldst. I recognize that it gets more complicated for big-endian guest and element sizes larger than 1, but for element sizes larger than 1 we automatically have <= half of the number of ops seen here.
r~
z
Description: Text document
[Prev in Thread] | Current Thread | [Next in Thread] |