qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] tcg: increase MAX_OP_PER_INSTR to 395


From: Richard Henderson
Subject: Re: [Qemu-devel] [PATCH] tcg: increase MAX_OP_PER_INSTR to 395
Date: Fri, 23 Sep 2016 12:54:48 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0

On 09/22/2016 04:53 PM, Joseph Myers wrote:
MAX_OP_PER_INSTR is currently 266, reported in commit
14dcdac82f398cbac874c8579b9583fab31c67bf to be the worst case for the
ARM A64 decoder.

Whether or not it was in fact the worst case at that time in 2014, I'm
observing the instruction 0x4c006020 (st1 {v0.16b-v2.16b}, [x1])
generate 386 ops from disas_ldst_multiple_struct with current sources,

For the record, I reproduce your results on a 32-bit host with v0-v3. I assume the v2 here is a typo.

While increasing the max per insn is indeed one way to approach this, aarch64 is being remarkably inefficient in this case. With the following, I see a reduction from 387 ops to 261 ops; for a 64-bit host, the reduction is from 258 ops to 195 ops.

I should also note that the implementation of this insn should be even simpler. I see this insn as performing 8 64-bit, little-endian, unaligned loads. We should be able to implement this insn for a 64-bit host in about 25 ops, which implies that the current code is nearly 8 times too large.

The same should be true for other combinations of sizes for ldst. I recognize that it gets more complicated for big-endian guest and element sizes larger than 1, but for element sizes larger than 1 we automatically have <= half of the number of ops seen here.


r~

Attachment: z
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]