[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions
From: |
Paolo Bonzini |
Subject: |
Re: [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions |
Date: |
Tue, 19 Jul 2016 19:16:07 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 |
On 14/07/2016 22:29, Pranith Kumar wrote:
> + } else if (curr_mb_type == TCG_BAR_STRL &&
> + prev_mb_type == TCG_BAR_LDAQ) {
> + /* Consecutive load-acquire and store-release barriers
> + * can be merged into one stronger SC barrier
> + * ldaq; strl => ld; mb; st
> + */
> + args[0] = (args[0] & 0x0F) | TCG_BAR_SC;
> + tcg_op_remove(s, prev_op);
Is this really an optimization? For example the processor could reorder
"st1; ldaq1; strl2; ld2" to "ldaq1; ld2; st1; strl2". It cannot do this
if you change ldaq1/strl2 to ld1/mb/st2.
On x86 for example a memory fence costs ~50 clock cycles, while normal
loads and stores are of course faster.
Of course this is useful if your target doesn't have ldaq/strl
instructions. In this case, however, you probably want to lower ldaq to
"ld;mb" and strl to "mb;st"; the other optimizations then will remove
the unnecessary barrier.
Paolo