qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB


From: Ilya Leoshkevich
Subject: Re: [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB
Date: Wed, 07 Dec 2022 23:09:07 +0100
User-agent: Evolution 3.46.1 (3.46.1-1.fc37)

On Tue, 2022-12-06 at 16:22 -0600, Richard Henderson wrote:
> On 12/6/22 13:29, Ilya Leoshkevich wrote:
> > On Thu, Dec 01, 2022 at 10:51:49PM -0800, Richard Henderson wrote:
> > > This reverts 829e1376d940 ("tcg/s390: Introduce TCG_REG_TB"), and
> > > several follow-up patches.  The primary motivation is to reduce
> > > the
> > > less-tested code paths, pre-z10.  Secondarily, this allows the
> > > unconditional use of TCG_TARGET_HAS_direct_jump, which might be
> > > more
> > > important for performance than any slight increase in code size.
> > > 
> > > Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> > > ---
> > >   tcg/s390x/tcg-target.h     |   2 +-
> > >   tcg/s390x/tcg-target.c.inc | 176 +++++-------------------------
> > > -------
> > >   2 files changed, 23 insertions(+), 155 deletions(-)
> > 
> > Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>
> > 
> > I have a few questions/ideas for the future below.  
> > > 
...

> > 
> > > -    /* Use the constant pool if USE_REG_TB, but not for small
> > > constants.  */
> > > -    if (maybe_out_small_movi(s, type, TCG_TMP0, val)) {
> > > -        if (type == TCG_TYPE_I32) {
> > > -            tcg_out_insn(s, RR, XR, dest, TCG_TMP0);
> > > -        } else {
> > > -            tcg_out_insn(s, RRE, XGR, dest, TCG_TMP0);
> > > -        }
> > > -    } else if (USE_REG_TB) {
> > > -        tcg_out_insn(s, RXY, XG, dest, TCG_REG_TB, TCG_REG_NONE,
> > > 0);
> > > -        new_pool_label(s, val, R_390_20, s->code_ptr - 2,
> > > -                       tcg_tbrel_diff(s, NULL));
> > > +    tcg_out_movi(s, type, TCG_TMP0, val);
> > > +    if (type == TCG_TYPE_I32) {
> > > +        tcg_out_insn(s, RR, XR, dest, TCG_TMP0);
> > >       } else {
> > > -        /* Perform the xor by parts.  */
> > > -        tcg_debug_assert(HAVE_FACILITY(EXT_IMM));
> > > -        if (val & 0xffffffff) {
> > > -            tcg_out_insn(s, RIL, XILF, dest, val);
> > > -        }
> > > -        if (val > 0xffffffff) {
> > > -            tcg_out_insn(s, RIL, XIHF, dest, val >> 32);
> > > -        }
> > > +        tcg_out_insn(s, RRE, XGR, dest, TCG_TMP0);
> > >       }
> > >   }
> > 
> > Wouldn't it be worth keeping XILF/XIFH here?
> 
> I don't know.  It's difficult for me to guess whether a dependency
> chain like
> 
>      val -> xor -> xor
> 
> (3 insns with serial dependencies) is better than
> 
>      val   --> xor
>      load  -/
> 
> (3 insns, but only one serial dependency) is better.  But there may
> also be instruction 
> fusion going on at the micro-architectural level, so that there's
> really only one xor.
> 
> If you have suggestions, I'm all ears.

I ran some experiments, and it turned out you were right: xilf+xihf is
slower exactly because of dependency chains.
So no need to change anything here and sorry for the noise.

> > I don't have any numbers right now, but it looks more
> > compact/efficient
> > than a load + XGR.
> 
> If we assume general-instruction-extension-facility (z10?), LGRL +
> XGR is smaller than 
> XILF + XIFH, ignoring the constant pool entry which might be shared,
> and modulo the µarch 
> questions above.
> 
> 
> > Same for OGR above; I even wonder if both implementations could be
> > unified.
> 
> Sadly not, because of OILL et al.  There are no 16-bit xor immediate
> insns.
> 
> > > +        /*
> > > +         * branch displacement must be aligned for atomic
> > > patching;
> > > +         * see if we need to add extra nop before branch
> > > +         */
> > > +        if (!QEMU_PTR_IS_ALIGNED(s->code_ptr + 1, 4)) {
> > > +            tcg_out16(s, NOP);
> > >           }
> > > +        tcg_out16(s, RIL_BRCL | (S390_CC_ALWAYS << 4));
> > > +        s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
> > > +        tcg_out32(s, 0);
> > >           set_jmp_reset_offset(s, a0);
> > 
> > This seems to work in practice, but I don't think patching branch
> > offsets is allowed by PoP (in a multi-threaded environment). For
> > example, I had to do [2] in order to work around this limitation in
> > ftrace.
> 
> Really?  How does the processor distinguish between overwriting
> opcode/condition vs 
> overwriting immediate operand when invalidating cached instructions?

The problem is that nothing in PoP prevents a CPU from fetching
instruction bytes one by one, in random order and more than one time.
It's not that this is necessarily going to happen, rather, it's just
that this has never been verified for all instructions and formally
stated. The observation that I use in the ftrace patch is not
formalized either, but it's specific to a single instruction and should
hold in practice for the existing hardware to the best of my knowledge.

> If overwriting operand truly isn't correct, then I think we have to
> use indirect branch 
> always for goto_tb.
> 
> > A third benefit seems to be that we now have one more register to
> > allocate.
> 
> Yes.  It's call clobbered, so it isn't live so often, but sometimes.
> 
> 
> r~




reply via email to

[Prev in Thread] Current Thread [Next in Thread]