[Qemu-devel] Re: S390 TCG target

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] Re: S390 TCG target

From:	Richard Henderson
Subject:	[Qemu-devel] Re: S390 TCG target
Date:	Wed, 12 May 2010 16:44:14 -0700
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.9) Gecko/20100330 Fedora/3.0.4-1.fc12 Thunderbird/3.0.4
Reference: http://lists.gnu.org/archive/html/qemu-devel/2009-12/msg01954.html

>  static const int tcg_target_call_iarg_regs[] = {
> +    TCG_REG_R2,
> +    TCG_REG_R3,
> +    TCG_REG_R4,
> +    TCG_REG_R5,
>  };

R6 is also an argument register.

> +static void tcg_out_e3(TCGContext* s, int op, int r1, int r2, int disp)
> +{
> +    tcg_out16(s, 0xe300 | (r1 << 4));
> +    tcg_out32(s, op | (r2 << 28) | ((disp & 0xfff) << 16) | ((disp >> 12) << 
> 8));
> +}

Follow-up:
I think you'd do well to go ahead and pass in the X operand.  I think
there are several places we can use the index part of the address.  Just
add a comment reminding us that a 0 index means "0" not "R0".

> +    } else if (!(arg & 0xffffffffffff0000UL)) {
> +        /* llill %rret, arg */
> +        tcg_out32(s, S390_INS_LLILL | (ret << 20) | arg);
> +    } else if (!(arg & 0xffffffff00000000UL) || type == TCG_TYPE_I32) {
> +        /* llill %rret, arg */
> +        tcg_out32(s, S390_INS_LLILL | (ret << 20) | (arg & 0xffff));
> +        /* iilh %rret, arg */
> +        tcg_out32(s, S390_INS_IILH | (ret << 20) | ((arg & 0xffffffff) >> 
> 16));
> +    } else {
> +        /* branch over constant and store its address in R13 */
> +        tcg_out_brasl(s, TCG_REG_R13, 14);
> +        /* 64-bit constant */
> +        tcg_out32(s,arg >> 32);
> +        tcg_out32(s,arg);
> +        /* load constant to ret */
> +        tcg_out_e3(s, E3_LG, ret, TCG_REG_R13, 0);
> +    }

Follow-up:
It would be a good idea to handle more special cases here,
which can use the loads and inserts into the other 3 16-bit
ranges within the word.  I'm thinking in particular of the
LLILH, LLIHL, LLIHH, IIHL, IIHH instructions.

Indeed, I think that would compare well in general with your
branch-and-load path, given the lack of a true constant pool.
You currently emit 6+8+6 = 20 bytes (could be reduced to 18
with a bras instead of brasl), whereas a full sequence of

  // movi r, 0x1122334455667788
  llill r,0x7788
  iilh  r,0x5566
  iihl  r,0x3344
  iihh  r,0x1122

is only 16 bytes.  E.g:

static void tcg_out_movi(TCGContext *s, TCGType type,
                         int ret, tcg_target_long val)
{
    static const uint32_t imm_insns[2][4] = {
        { S390_INS_LLILL, S390_INS_LLILH, S390_INS_LLIHL, S390_INS_LLIHH },
        { S390_INS_IILL, S390_INS_IILH, S390_INS_IIHL, S390_INS_IIHH }
    };
    int i, ins;

    if (type == TCG_TYPE_I32) {
        val &= 0xffffffff;
    }
    if (val == 0) {
        // ret = ret - ret
        tcg_out_rr(s, RR_SR, ret, ret);
        return;
    }
    if (val == (int16_t)val) {
        tcg_out32(s, S390_INS_LGHI | (ret << 20) | (val & 0xffff));
        return;
    }
    for (i = 0, ins = 0; val; i++, val >>= 16) {
        uint32_t part = val & 0xffff;
        if (part) {
            tcg_out32(s, imm_insns[ins][i] | (ret << 20) | part);
            ins = 1;
        }
    }
}

There's also LAY which, with 0 as base and index, can be
used to generate a 20-bit signed constant.

There's also LARL which, given its 32-bit offset, could be
use effectively to load address constants from the main qemu
binary.  It's probably worth checking for here, instead of as
you do later for INDEX_op_goto_tb alone.

All that said, there's nothing actively incorrect in what you
have at the moment.

> +    if (arg2 < -0x80000 || arg2 > 0x7ffff) {
> +        /* load the displacement */
> +        tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R13, arg2);
> +        /* add the address */
> +        tcg_out_b9(s, B9_AGR, TCG_REG_R13, arg1);
> +        /* load the data */
> +        tcg_out_e3(s, op, arg, TCG_REG_R13, 0);
> +    } else {
> +        /* load the data */
> +        tcg_out_e3(s, op, arg, arg1, arg2);
> +    }

Re: tcg_out_e3 as mentioned above, the AGR here is not necessary.

static void tcg_out_ld(TCGContext *s, TCGType type, int dest, int base,
                       tcg_target_long ofs)
{
    int op = (type == TCG_TYPE_I32 ? E3_LLGF : E3_LG);
    int index = 0;

    if (ofs < -0x80000 || ofs >= 0x80000) {
        // Combine the low 16-bits of the offset with the
        // actual load insn; the high 48-bits must come from
        // an immediate load.
        index = TCG_REG_R13;
        tcg_out_movi(s, TCG_TYPE_PTR, index, ofs & ~0xffffull);
        ofs &= 0xffff;
    }

    tcg_out_e3(s, op, ret, base, index, ofs);
}

Although this should be generalized a bit to be able to handle
emission of 8- and 16-bit values as well.  It can probably also
be merged with tcg_out_st which also wants to be able to use the
smaller esa/370 instructions when the offset is small enough.
Perhaps:

/* Emit a load/store type instruction.  Inputs are:
   DATA:     The register to be loaded or stored.
   BASE+OFS: The effective address.
   OPC_U12:  If the operation has a form that uses a 12-bit unsigned
             immediate (e.g. STC), otherwise 0.
   OPC_E3:   The 'E3' opcode for the operation which takes a 20-bit
             signed immediate (e.g. STCY).  */
static void tcg_out_ldst(TCGContext *s, int opc_u12, int opc_e3,
                         int data, int base, tcg_target_long ofs)

> +static void tcg_out_st(TCGContext *s, TCGType type, int arg,
> +                             int arg1, tcg_target_long arg2)
> +{
> +    dprintf("tcg_out_st arg 0x%x arg1 0x%x arg2 0x%lx\n", arg, arg1, arg2);
> +
> +    if (type == TCG_TYPE_I32) {
> +        if (((long)arg2) < -0x800 || ((long)arg2) > 0x7ff) {
> +            tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R13, arg2);
> +            tcg_out_b9(s, B9_AGR, 13, arg1);
> +            tcg_out_store(s, ST_ST, arg, TCG_REG_R13, 0);
> +        } else {
> +            tcg_out_store(s, ST_ST, arg, arg1, arg2);
> +        }
> +    }
> +    else {
> +        if (((long)arg2) < -0x80000 || ((long)arg2) > 0x7ffff) {
> +            tcg_abort();
> +        }
> +        tcg_out_e3(s, E3_STG, arg, arg1, arg2);
> +    }
>  }

Similar comments as for tcg_out_ld above.  Also note the
availability of STY for using a 20-bit offset in the 32-bit
store path, which would also allow combining the large-offset
constant load path between the 32- and 64-bit stores.

> +    switch (c) {
> +    case 'S':                   /* qemu_st constraint */
> +        tcg_regset_reset_reg (ct->u.regs, TCG_REG_R4);
> +        /* fall through */
> +    case 'L':                   /* qemu_ld constraint */
> +        tcg_regset_reset_reg (ct->u.regs, TCG_REG_R3);
> +        break;
> +    }

There's no need to distingush between S and L.  R4 is not
clobbered until after all inputs have been consumed.

> +#if TARGET_LONG_BITS == 32
> +    tcg_out_b9(s, B9_LLGFR, arg1, addr_reg);
> +    tcg_out_b9(s, B9_LLGFR, arg0, addr_reg);
> +#else
> +    tcg_out_b9(s, B9_LGR, arg1, addr_reg);
> +    tcg_out_b9(s, B9_LGR, arg0, addr_reg);
> +#endif
> +
> +    tcg_out_sh64(s, SH64_SRLG, arg1, addr_reg, SH64_REG_NONE,
> +                 TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
> +
> +    tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R13,
> +                 TARGET_PAGE_MASK | ((1 << s_bits) - 1));
> +    tcg_out_b9(s, B9_NGR, arg0, TCG_REG_R13);

Bug: if TARGET_LONG_BITS == 32, the zero-extension is discarded
by the shift due to using the original input register.

That said, the original zero extension isn't really needed.  All
that needs to happen is to modify the mask used with the AND.  E.g.

  shift = TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS;
  tcg_out_sh64(s, SH64_SRLG, arg1, addr_reg, SH64_REG_NONE, shift);

  tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R13,
               (TARGET_LONG_BITS == 32
                ? TARGET_PAGE_MASK & (~0u >> shift)
                : TARGET_PAGE_MASK)
               | ((1 << s_bits) - 1));
  tcg_out_b9(s, B9_NGR, arg0, TCG_REG_R13);

> +    if (is_store) {
> +        tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R13,
> +                     offsetof(CPUState, tlb_table[mem_index][0].addr_write));
> +    } else {
> +        tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R13,
> +                     offsetof(CPUState, tlb_table[mem_index][0].addr_read));
> +    }
> +    tcg_out_b9(s, B9_AGR, arg1, TCG_REG_R13);
> +
> +    tcg_out_b9(s, B9_AGR, arg1, TCG_AREG0);
> +
> +    tcg_out_e3(s, E3_CG, arg0, arg1, 0);

Followup:
TCG_AREG0 could be the index operand to tcg_out_e3, and it's not uncommon
for the offsetof to fit into 20 bits.  Unlike i386, you're not saving code
space by pre-computing the whole address into a register and using a simple
(reg) and 4(reg) addresses later.  Supposing the offsetof doesn't fit into
20 bits, then you do have to do an immediate load and add.

> +    /* bras %r13, label2 */
> +    tcg_out32(s, 0xa7d50000);

Why BRAS and not BRC 15 (i.e. unconditional).

> +    if (is_store) {
> +        tcg_out_e3(s, E3_LG, arg1, arg1, offsetof(CPUTLBEntry, addend) -
> +                                         offsetof(CPUTLBEntry, addr_write));
> +    } else {
> +        tcg_out_e3(s, E3_LG, arg1, arg1, offsetof(CPUTLBEntry, addend) -
> +                                         offsetof(CPUTLBEntry, addr_read));
> +    }
> +
> +#if TARGET_LONG_BITS == 32
> +    /* zero upper 32 bits */
> +    tcg_out_b9(s, B9_LLGFR, arg0, addr_reg);
> +#else
> +    /* just copy */
> +    tcg_out_b9(s, B9_LGR, arg0, addr_reg);
> +#endif
> +    tcg_out_b9(s, B9_AGR, arg0, arg1);

Followup:
The LG+AGR pair should be replaced with AG.

> +        tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R13,
> +                     (tcg_target_ulong)qemu_st_helpers[s_bits]);
> +        tcg_out_rr(s, RR_BASR, TCG_REG_R14, TCG_REG_R13);

Followup:
Extract a common function to emit function calls.  Use BRASL
when possible, and it will be much of the time.

> +            tcg_out_sh64(s, SH64_SLLG, data_reg, arg0, SH64_REG_NONE, 56);
> +            tcg_out_sh64(s, SH64_SRAG, data_reg, data_reg, SH64_REG_NONE, 
> 56);

Followup:
You'd do well to make separate ext{8,16,32}{s,u} functions 
for clarity, and to (hopefully) match most of the other ports.

> +        /* swapped unsigned halfword load with upper bits zeroed */
> +        tcg_out_e3(s, E3_LRVH, data_reg, arg0, 0);
> +        tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R13, 0xffffL);
> +        tcg_out_b9(s, B9_NGR, data_reg, 13);

Followup:
Note that LRVH does not modify bits 0-47.  So long as data_reg != arg0,
then we can zero data_reg first with SR instead of masking afterward.
Similarly for LRV since SR is smaller than LLGFR.

> +static void tcg_prepare_qemu_ldst(TCGContext* s, int data_reg, int addr_reg,
> +                                int mem_index, int opc,
> +                                uint16_t **label2_ptr_p, int is_store)
> +{
> +    /* user mode, no address translation required */
> +    *arg0 = addr_reg;

Bug: I think you need to zero-extend for TARGET_LONG_BITS == 32.

> +static void tcg_out_op(TCGContext *s, int opc,
> +               const TCGArg *args, const int *const_args)

Bug: The opc argument is now an enumeration type.

> +    case INDEX_op_exit_tb:
> +        dprintf("op 0x%x exit_tb 0x%lx 0x%lx 0x%lx\n",
> +                opc, args[0], args[1], args[2]);
> +        /* return value */
> +        tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R2, args[0]);
> +        tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R13, (unsigned 
> long)tb_ret_addr);
> +        /* br %r13 */
> +        tcg_out16(s, S390_INS_BR | TCG_REG_R13);

Followup:
Use BRCL 15,tb_ret_addr.

> +            /* load address stored at s->tb_next + args[0] */
> +            tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_R13, TCG_REG_R13, 0);
> +            /* and go there */
> +            tcg_out_rr(s, RR_BASR, TCG_REG_R13, TCG_REG_R13);

Followup:
Use BRC 15,0(r13).

> +    case INDEX_op_ld8u_i64:
> +        dprintf("op 0x%x ld8u_i32 0x%lx 0x%lx 0x%lx\n",
> +                opc, args[0], args[1], args[2]);
> +        if ((long)args[2] > -0x80000 && (long)args[2] < 0x7ffff) {
> +            tcg_out_e3(s, E3_LLGC, args[0], args[1], args[2]);
> +        } else {
> +            /* XXX displacement too large, have to calculate address 
> manually 
> */
> +            tcg_abort();
> +        }
> +        break;

The XXX can be taken care of by generalizing the example tcg_out_ld
I gave above.  Similarly for the other loads within tcg_out_op.

> +    case INDEX_op_ld8s_i32:
> +        dprintf("op 0x%x ld8s_i32 0x%lx 0x%lx 0x%lx\n",
> +                opc, args[0], args[1], args[2]);
> +        /* XXX */
> +        tcg_abort();

Use LGB.

> +    case INDEX_op_ld16s_i32:
> +        dprintf("op 0x%x ld16s_i32 0x%lx 0x%lx 0x%lx\n",
> +                opc, args[0], args[1], args[2]);
> +        /* XXX */
> +        tcg_abort();

Use LGH.

> +    case INDEX_op_add_i32:
> +        if (const_args[2]) {
> +            if (args[0] == args[1]) {
> +                tcg_out_a7(s, A7_AHI, args[1], args[2]);
> +            } else {
> +                tcg_out_rr(s, RR_LR, args[0], args[1]);
> +                tcg_out_a7(s, A7_AHI, args[0], args[2]);
> +            }
> +        } else if (args[0] == args[1]) {
> +            tcg_out_rr(s, RR_AR, args[1], args[2]);
> +        } else if (args[0] == args[2]) {
> +            tcg_out_rr(s, RR_AR, args[0], args[1]);
> +        } else {
> +            tcg_out_rr(s, RR_LR, args[0], args[1]);
> +            tcg_out_rr(s, RR_AR, args[0], args[2]);
> +        }
> +        break;

Followup:
Use LA/LAY to handle the full 3-operand add.  This gives us

  a0==a1 & a2 is s16    AHI
  a0!=a1 & a2 is s12    LA
           a2 is s20    LAY
  a0==a1 | a0==a2       AR/AGR
  a0!=a1 & a0!=a2       LA

> +    case INDEX_op_sub_i32:
> +        dprintf("op 0x%x sub_i32 0x%lx 0x%lx 0x%lx\n",
> +                opc, args[0], args[1], args[2]);
> +        if (args[0] == args[1]) {
> +            /* sr %ra0/1, %ra2 */
> +            tcg_out_rr(s, RR_SR, args[1], args[2]);
> +        } else if (args[0] == args[2]) {
> +            /* lr %r13, %raa0/2 */
> +            tcg_out_rr(s, RR_LR, TCG_REG_R13, args[2]);
> +            /* lr %ra0/2, %ra1 */
> +            tcg_out_rr(s, RR_LR, args[0], args[1]);
> +            /* sr %ra0/2, %r13 */
> +            tcg_out_rr(s, RR_SR, args[0], TCG_REG_R13);
> +        } else {
> +            /* lr %ra0, %ra1 */
> +            tcg_out_rr(s, RR_LR, args[0], args[1]);
> +            /* sr %ra0, %ra2 */
> +            tcg_out_rr(s, RR_SR, args[0], args[2]);
> +        }
> +        break;

This and most of the other logical operations should be
simplified by using matching constraints.

> +    case INDEX_op_neg_i32:
> +        dprintf("op 0x%x neg_i32 0x%lx 0x%lx 0x%lx\n",
> +                opc, args[0], args[1], args[2]);
> +        /* FIXME: optimize args[0] != args[1] case */
> +        tcg_out_rr(s, RR_LR, 13, args[1]);
> +        /* lghi %ra0, 0 */
> +        tcg_out32(s, S390_INS_LGHI | (args[0] << 20));
> +        tcg_out_rr(s, RR_SR, args[0], 13);

Use LCR/LCGR.

> +          /* msr %ra0, %ra2 */
> +          tcg_out32(s, S390_INS_MSR | (args[0] << 4) | args[2]);

Followup:
Use MHI/MGHI when appropriate.

> +    case INDEX_op_divu_i32:
> +    case INDEX_op_remu_i32:
> +        dprintf("op 0x%x div/remu_i32 0x%lx 0x%lx 0x%lx\n",
> +                opc, args[0], args[1], args[2]);
> +        tcg_out_movi(s, TCG_TYPE_I32, TCG_REG_R12, 0);
> +        tcg_out_rr(s, RR_LR, TCG_REG_R13, args[1]);
> +        tcg_out_b9(s, B9_DLR, TCG_REG_R12, args[2]);
> +        if (opc == INDEX_op_divu_i32) {
> +          tcg_out_rr(s, RR_LR, args[0], TCG_REG_R13);        /* quotient */
> +        } else {
> +          tcg_out_rr(s, RR_LR, args[0], TCG_REG_R12);        /* remainder */
> +        }
> +        break;

You should be using INDEX_op_div{,u}2_i{32,64}, which produce
both division and remainder, as with x86.  One could also use
register constraints to force the use of, say, R2,R3 for the 
outputs.  Which would then be copied as-needed to the 
destinations desired by TCG's register allocator.

Signed 32 and 64-bit division should be performed with DSGR.
Unsigned 64-bit division should be performed with DLGR.

> +    case INDEX_op_br:
> +        dprintf("op 0x%x br 0x%lx 0x%lx 0x%lx\n",
> +                opc, args[0], args[1], args[2]);
> +        l = &s->labels[args[0]];
> +        if (l->has_value) {
> +            tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R13, l->u.value);
> +        } else {
> +            /* larl %r13, ... */
> +            tcg_out16(s, S390_INS_LARL | (TCG_REG_R13 << 4));
> +            tcg_out_reloc(s, s->code_ptr, R_390_PC32DBL, args[0], -2);
> +            s->code_ptr += 4;
> +        }
> +        tcg_out_rr(s, RR_BASR, TCG_REG_R13, TCG_REG_R13);
> +        break;

Followup:
There are no TB's that can't be handled by direct branches.
Use BRCL.  Similarly for the brcond code.

> +#if defined(__s390x__)
> +    { INDEX_op_mov_i64, { "r", "r" } },
> +    { INDEX_op_movi_i64, { "r" } },

No need for the ifdef.  We always assume 64-bit.

> +    tcg_regset_set32(tcg_target_call_clobber_regs, 0,
> +                     (1 << TCG_REG_R0) |
> +                     (1 << TCG_REG_R1) |
> +                     (1 << TCG_REG_R2) |
> +                     (1 << TCG_REG_R3) |
> +                     (1 << TCG_REG_R4) |
> +                     (1 << TCG_REG_R5) |
> +                     (1 << TCG_REG_R14)); /* link register */

It's probably better to use R14 as a generic scratch register.
This, combined with using register constraints for division as
mentioned above, means that you can free up both R12 and R13 for
general use.

It's probably worth using a define to represent the scratch
register, so that it can easily be changed.  E.g. TCG_TMP0,
following TCG_AREG0.

> +    tcg_regset_clear(s->reserved_regs);
> +    /* frequently used as a temporary */
> +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_R13);
> +    /* another temporary */
> +    tcg_regset_set_reg(s->reserved_regs, TCG_REG_R12);

Should mark R15 as reserved.

> +    /* XXX many insns can't be used with R0, so we better avoid it for now */
> +    /* TCG_REG_R0 */

This isn't the best way to prevent the register from being used.
Instead it should be marked as reserved.

> +static void tcg_out_mov(TCGContext *s, int ret, int arg)
>  {
> +    tcg_out_b9(s, B9_LGR, ret, arg);
>  }

This should be moved up so that it can be used throughout.

> +static void tcg_out_addi(TCGContext *s, int reg, tcg_target_long val)
>  {
>      tcg_abort();
>  }

If you're trying to make sure that tcg.c doesn't try to allocate
more stack for function call arguments, this is cute but deserves
a comment.

If not, it should be easily implementable by calling back into
tcg_out_ldst (described above) with LAY as the E3 opcode, perhaps
with a quick check for the applicability of AHI.

> +#define TCG_CT_CONST_U12 0x200

Unused, really, since you define no constraint for it.

> +#define E3_LG          0x04

All these should be moved into tcg-target.c.



r~
[Prev in Thread]
Current Thread
[Next in Thread]
[Qemu-devel] Re: S390 TCG target, Richard Henderson <=
Prev by Date: Re: [Qemu-devel] [PATCH] linux-user: rlimit conversion between host and target.
Next by Date: [Qemu-devel] [PATCH 2/2] Add cache=volatile parameter to -drive
Previous by thread: Re: [Qemu-devel] [PATCH] linux-user: rlimit conversion between host and target.
Next by thread: [Qemu-devel] [PATCH 0/2] Enable qemu block layer to not flush v2
Index(es):
- Date
- Thread