qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] Porting TCG to alpha platform


From: Richard Henderson
Subject: Re: [Qemu-devel] [PATCH] Porting TCG to alpha platform
Date: Tue, 19 Jan 2010 12:18:56 -0800
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.5) Gecko/20091209 Fedora/3.0-4.fc12 Thunderbird/3.0

On 01/19/2010 12:47 AM, identifier scorpio wrote:
I ported TCG to alpha platform, the patch is currently based on stable-0.10 
branch,
and now it can run linux-0.2.img testing image on my alpha XP1000 workstation.
but it still can't run MS-windows, and I hope someone, especially those guys 
that
are working on target-alpha, may help me to find the bugs.

Your patch is mangled. If you can't use sendmail directly, then attach the patch.

+/*
+ * $26 ~ $31 are special, reserved,
+ * and $25 is deliberately reserved for jcc operation
+ * and $0 is usually used for return function result, better allocate it later
+ * and $15 is used for cpu_env pointer, allocate it at last
+*/

I don't see any reason why $28 should be reserved, particularly. Although I'd use it as a temporary before anything else. I expect that you could get away with not reserving $26 and $27 as well, and simply know that it's free since only $16-$21 have values at the point of a call and all others have no live values.

+static const int tcg_target_reg_alloc_order[] = {
+    TCG_REG_1, TCG_REG_2, TCG_REG_3, TCG_REG_4, TCG_REG_5, TCG_REG_6,
+    TCG_REG_7, TCG_REG_8, TCG_REG_22,
+    TCG_REG_9, TCG_REG_10, TCG_REG_11, TCG_REG_12, TCG_REG_13, TCG_REG_14,
+    TCG_REG_16, TCG_REG_17, TCG_REG_18, TCG_REG_19, TCG_REG_20, TCG_REG_21
+};

Existing targets put call-saved registers at the front of the allocation order. From looking at tcg.c, I think the only way values actually get saved across calls is to already happen to be allocated to a call-saved register.

+#define OP_ADDSUBCMP   0x10
+#define FUNC_ADDL      0x00
+#define FUNC_SUBL      0x09
+#define FUNC_ADDQ      0x20

Things might be a bit cleaner if you pre-assembled these values and passed only one value to the tcg_out_* routines. E.g.

#define INSN_ADDQ         ((0x10 << 26) | (0x20 << 5))

Compare the sparc port, for example.

+static int target_parse_constraint(TCGArgConstraint *ct, const char **pct_str)
+{
+    const char *ct_str = *pct_str;
+
+    switch(ct_str[0])
+    {
+    case 'r':
...
+    case 'L':

Do you really need extra temporaries for L?  You already have 3.

You'd also do well to define I,J,K constraints for constants (mirroring gcc's letters for clarity).

+static inline void tcg_out_inst2(TCGContext *s, int Opcode, int Ra, int Disp)
+static inline void tcg_out_inst3_disp(TCGContext *s, int Opcode, int Ra, int 
Rb, int Disp)
+static inline void tcg_out_inst3_func(TCGContext *s, int Opcode, int Ra, int 
Rb, int Func, int Disp)
+static inline void tcg_out_inst4(TCGContext *s, int Opcode, int Ra, int Rb, 
int Func, int Rc)
+static inline void tcg_out_inst4i(TCGContext *s, int Opcode, int Ra, int Lit, 
int Func, int Rc)

inst2, inst3, inst4 isn't very descriptive.  How about nice names like

  tcg_out_fmt_mem  // memory format w/ displacement
  tcg_out_fmt_opr  // operate format w/ register
  tcg_out_fmt_opi  // operate format w/ immediate
  tcg_out_fmt_jmp  // jump instruction format

+/*
+ * mov a 64-bit immediate 'arg' to regsiter 'Ra', this function will
+ * generate fixed length (8 insns, 32 bytes) of target insn sequence.
+*/
+static void tcg_out_movi_fixl( \
+    TCGContext *s, TCGType type, int Ra, tcg_target_long arg)

Err.. "8 insns"? You'd only ever need to output 5. Also, why would you ever want to explicitly never elide one of these insns if you could? Say, if only L0 and L3 were non-zero?

+/*
+ * mov 64-bit immediate 'arg' to regsiter 'Ra'. this function will
+ * generate variable length of target insn sequence.
+*/
+static inline void tcg_out_movi( \
+    TCGContext *s, TCGType type, int Ra, tcg_target_long arg)
+{
+    if (type == TCG_TYPE_I32) {
+        if ( arg != (int32_t)arg)
+            tcg_abort();

Well that's most likely a bug. If I32 then I'd consider the high bits garbage. I don't recall any promise that TCG sign-extends 32-bit inputs.

+    if (arg == 0) {
+        tcg_out_inst4(s, OP_LOGIC, Ra, Ra, FUNC_XOR, Ra);

Err, don't xor and create a false dependency on the previous contents. Either do a move from the zero register, or better don't special case this at all and generate the zero from ...

+    else if( arg == (int16_t)arg ) {
+        tcg_out_inst3_disp(s, OP_LDA, Ra, TCG_REG_31, arg );
+    }

... here.  Watch your formatting as well.

+    else if( arg == (int32_t)arg ) {
+        tcg_out_inst3_disp(s, OP_LDAH, Ra, TCG_REG_31, (arg>>16));
+        if( arg&  ((tcg_target_ulong)0x8000) ) {
+            tcg_out_inst3_disp(s, OP_LDAH, Ra, Ra, 1);
+        }

You are emitting an unnecessary instruction most of the time here. You should increment the LDAH constant except for the case that it overflows.

However, I'd say that the best way to rearrange all of this is:

void tcg_out_op_long(TCGContext *s, int opc,
                     int Ra, int Rb, tcg_target_long val)
{
    int l0, l1a, l1b = 0, l2 = 0, l3 = 0;
    long val = orig;
    int Rs;

    switch (opc) {
    case INSN_STB:
    case INSN_STW:
    case INSN_STL:
    case INSN_STQ:
        assert(Ra != TCG_REG_28);
        Rs = TCG_REG_28;
        break;
    default:
        Rs = (Rb != Ra ? Ra : TCG_REG_28);
        break;
    }

    l0 = (int16_t)val;
    val = (val - l0) >> 16;
    l1a = (int16_t)val;

    if (orig >> 31 == -1 || orig >> 31 == 0) {
        if (l1a < 0 && orig >= 0) {
            l1b = 0x4000;
            l1a = (int16_t)(val - 0x4000);
        }
    } else {
        val = (val - l1a) >> 16;
        l2 = (int16_t)val;
        val = (val - l2) >> 16;
        l3 = (int16_t)val;

        assert(Rs != Rb);

        if (l3) {
            tcg_out_fmt_mem(s, INSN_LDAH, Rs, TCG_REG_ZERO, l3);
        }
        if (l2) {
            tcg_out_fmt_mem(s, INSN_LDA, Rs, Rs, l2);
        }
        tcg_out_fmt_opi(s, INSN_SLL, Rs, Rs, 32);
        if (Rb != TCG_REG_ZERO) {
            tcg_out_fmt_opr(s, INSN_ADDQ, Rs, Rs, Rb);
        }

        Rb = Rs;
    }

    if (l1a) {
        tcg_out_fmt_mem(s, INSN_LDAH, Rs, Rb, l1a);
        Rb = Rs;
    }
    if (l1b) {
        tcg_out_fmt_mem(s, INSN_LDAH, Rs, Rb, l1b);
        Rb = Rs;
    }

    if (l0 || opc != INSN_LDA || Rb != Ra) {
        tcg_opc_fmt_mem(s, opc, Ra, Rb, l0);
    }
}

With a function like this you get

void tcg_out_movi(TCGContext *s, TCGType type, int Ra,
                  tcg_target_long c)
{
    if (type == TCG_TYPE_I32) {
        c = (int32_t)c;
    }
    tcg_out_opc_long(s, INSN_LDA, Ra, TCG_REG_ZERO, c);
}

void tcg_out_ld(TCGContext *s, TCGType type, int Ra, int Rb,
                tcg_target_long disp)
{
    tcg_out_opc_long(s, type == TCG_TYPE_I32 ? INSN_LDL : INSN_LDQ,
                     Ra, Rb, disp);
}

void tcg_out_st(TCGContext *s, TCGType type, int Ra, int Rb,
                tcg_target_long disp)
{
    tcg_out_opc_long(s, type == TCG_TYPE_I32 ? INSN_STL : INSN_STQ,
                     Ra, Rb, disp);
}

void tcg_out_addi(TCGContext *s, int Ra, tcg_target_long val)
{
    tcg_out_opc_long(s, INSN_LDA, Ra, Ra, val);
}

+static inline void tgen_arithi( \
+    TCGContext *s, int Opcode, int Func, int Ra, tcg_target_long val)
+{
+    if (val == (uint8_t)val) {
+        tcg_out_inst4i(s, Opcode, Ra, val, Func, Ra);
+    } else {
+        tcg_out_movi(s, TCG_TYPE_I64, TMP_REG1, val);
+        tcg_out_inst4(s, Opcode, Ra, TMP_REG1, Func, Ra);
+    }

With I/J constraints you don't need this special casing.

+static inline void tcg_out_push(TCGContext *s, int reg)
+static inline void tcg_out_pop(TCGContext *s, int reg)

You ought not need this, writing the prologue/epilogue properly. You certainly don't want to adjust the stack pointer that many times.

+static void tcg_out_br(TCGContext *s, int label_index)
+{
+    TCGLabel *l =&s->labels[label_index];
+
+    if (l->has_value) {
+        tcg_target_long val;
+        if ( l->u.value&  0x3)
+            tcg_abort();
+        val = ((tcg_target_long)(l->u.value) - (tcg_target_long)s->code_ptr - 
4)>>  2;
+        if ( val>= -0x100000&&  val<  0x100000) {
+            // if distance can be put into 21-bit field
+            tcg_out_inst2(s, OP_BR, TMP_REG1, val);
+       } else {
+            tcg_abort();
+       }
+    } else {
+        /* record relocation infor */
+        tcg_out_reloc(s, s->code_ptr, R_ALPHA_REFQUAD, label_index, 0);
+        s->code_ptr += 4;

I realize that it doesn't really matter what value you use here, so long as things are consistent with patch_reloc, but it'll be less confusing if you use the proper relocation type: R_ALPHA_BRADDR.

+static void tcg_out_brcond( TCGContext *s, int cond, \
+    TCGArg arg1, TCGArg arg2, int const_arg2, int label_index)
+{
+    int func, opc;
+    TCGLabel *l =&s->labels[label_index];
+
+    if ( cond<  TCG_COND_EQ || cond>  TCG_COND_GTU || const_arg2)
+        tcg_abort();
+
+    func = tcg_cond_to_jcc[cond];
+    tcg_out_inst4(s, OP_ADDSUBCMP, arg1, arg2, func, TMP_REG1);
+
+    // if cond is an odd number, TMP_REG1 = 0 means true
+    opc = (cond&  1) ? OP_BLBC : OP_BLBS;
+
+    if (l->has_value) {
+        tcg_target_long val;
+        if ( l->u.value&  3)
+            tcg_abort();
+        val = ((tcg_target_long)l->u.value - (tcg_target_long)s->code_ptr - 
4)>>  2;
+        if ( val>= -0x100000&&  val<  0x100000) {
+            // if distance can be put into 21-bit field
+            tcg_out_inst2(s, opc, TMP_REG1, val);
+       } else {
+            tcg_abort();
+       }
+    } else {
+        tcg_out_inst2(s, opc^4, TMP_REG1, 1);
+       /* record relocation infor */
+        tcg_out_reloc(s, s->code_ptr, R_ALPHA_REFQUAD, label_index, 0);
+        s->code_ptr += 4;

Bug: You've applied the relocation to the wrong instruction.
Bug: What's with the "opc^4"?

You'd do well to avoid duplication between tcg_out_brcond and tcg_out_br by passing in the branch opcode to tcg_out_br.

+    /* if VM is of 32-bit arch, clear higher 32-bit of addr */

Use a zapnot insn for this.

+    case INDEX_op_call:
+        if (const_args[0]) {
+            tcg_abort();
+       } else {
+            tcg_out_push( s, TCG_REG_26);
+            tcg_out_push( s, TCG_REG_15);
+            tcg_out_mov( s, TCG_REG_27, args[0]);
+            tcg_out_inst3_func(s, OP_CALL, TCG_REG_26, args[0], FUNC_CALL, 0);
+            tcg_out_pop( s, TCG_REG_15);
+            tcg_out_pop( s, TCG_REG_26);

You don't need to push/pop anything here. $26 should be saved by the prologue we emitted, and $15 is call-saved. What you could usefully do is define a register constraint for $27 so that TCG automatically loads the value into that register and saves you a register move here.

+    case INDEX_op_and_i32:
+    case INDEX_op_and_i64:
+        oc = OP_LOGIC;
+        c = FUNC_AND;
+        goto gen_arith;

You'd do well to recognize zapnot constants.  They are very common.

+        tcg_out_inst4i(s, OP_SHIFT, args[1], 32, FUNC_SLL, args[1]);
+        tcg_out_inst4i(s, OP_SHIFT, args[1], 32, FUNC_SRL, args[1]);

zapnot.

+    case INDEX_op_sar_i32:
+        tcg_out_inst4i(s, OP_SHIFT, args[1], 32, FUNC_SLL, args[1]);
+        tcg_out_inst4i(s, OP_SHIFT, args[1], 32, FUNC_SRA, args[1]);

That last shift can be combined with the requested shift via addition. For constant input, this saves an insn; for register input, the addition can be done in parallel with the first shift.

+    case INDEX_op_brcond_i32:
+        tcg_out_mov(s, TMP_REG2, args[0]);
+        tcg_out_mov(s, TMP_REG3, args[1]);
+        if ( args[2]>= TCG_COND_LTU&&  args[2]<= TCG_COND_GTU) {
+            tcg_out_inst4i(s, OP_SHIFT, TMP_REG2, 32, FUNC_SLL, TMP_REG2);
+            tcg_out_inst4i(s, OP_SHIFT, TMP_REG2, 32, FUNC_SRL, TMP_REG2);
+            tcg_out_inst4i(s, OP_SHIFT, TMP_REG3, 32, FUNC_SLL, TMP_REG3);
+            tcg_out_inst4i(s, OP_SHIFT, TMP_REG3, 32, FUNC_SRL, TMP_REG3);
+        } else {
+            tcg_out_inst4i(s, OP_SHIFT, TMP_REG2, 32, FUNC_SLL, TMP_REG2);
+            tcg_out_inst4i(s, OP_SHIFT, TMP_REG2, 32, FUNC_SRA, TMP_REG2);
+            tcg_out_inst4i(s, OP_SHIFT, TMP_REG3, 32, FUNC_SLL, TMP_REG3);
+            tcg_out_inst4i(s, OP_SHIFT, TMP_REG3, 32, FUNC_SRA, TMP_REG3);

For comparing 32-bit inputs, it doesn't actually matter how you extend the inputs, so long as you do it the same for both inputs. Therefore the best solution here is to sign-extend both inputs with "addl r,0,r". Note as well that you don't need temporaries, as the inputs only have 32-bits defined; high bits are garbage in, garbage out.

+    case INDEX_op_ext8s_i32:
+    case INDEX_op_ext8s_i64:
+        tcg_out_inst4(s, OP_SEXT, TCG_REG_31, args[1], FUNC_SEXTB, args[0]);
+        printf("ext8s met\n");
+        break;
+    case INDEX_op_ext16s_i32:
+    case INDEX_op_ext16s_i64:
+        tcg_out_inst4(s, OP_SEXT, TCG_REG_31, args[1], FUNC_SEXTW, args[0]);
+        printf("ext16s met\n");
+        break;

You'll also want to define INDEX_op_ext32s_i64 as "addl r,0,r".

+    case INDEX_op_div2_i32:
+    case INDEX_op_divu2_i32:

Don't define these, but you will need to define

  div_i32, divu_i32, rem_i32, remu_i32
  div_i64, divu_i64, rem_i64, remu_i64

The easiest way to do this is to emit calls to the compiler support routines directly. They have special call conventions, which should be easy enough to satisfy from within the code generator:

 * The alpha chip doesn't provide hardware division, so we have to do it
 * by hand.  The compiler expects the functions
 *
 *      __divqu: 64-bit unsigned long divide
 *      __remqu: 64-bit unsigned long remainder
 *      __divqs/__remqs: signed 64-bit
 *      __divlu/__remlu: unsigned 32-bit
 *      __divls/__remls: signed 32-bit
 *
 * These are not normal C functions: instead of the normal
 * calling sequence, these expect their arguments in registers
 * $24 and $25, and return the result in $27. Register $28 may
 * be clobbered (assembly temporary), anything else must be saved.
 * The return address is in $23.

There's a note in gcc sources re the 32-bit routines:

;; ??? Force sign-extension here because some versions of OSF/1 and
;; Interix/NT don't do the right thing if the inputs are not properly
;; sign-extended.  But Linux, for instance, does not have this
;; problem.  Is it worth the complication here to eliminate the sign
;; extension?

Note as well the register constraints from gcc:

  "a" = $24
  "b" = $25
  "c" = $27

All this suggests that $24, $25, $27 should *not* be reserved, so that TCG can put inputs into the proper places for both division and calls.

+    tcg_out_push(s, TCG_REG_26);
+    tcg_out_push(s, TCG_REG_27);
+    tcg_out_push(s, TCG_REG_28);
+    tcg_out_push(s, TCG_REG_29);

Of these only $26 needs to be saved.

What would be very interesting to do would be to set up a "constant pool" with all of the helper function addresses (including the division routines) and point $29 at it in the prologue. I suspect that this would significantly reduce the amount of code generated for the calling the helpers.

Note that this $29 would have to be reloaded after calls. If you can prove that the constant pool address is within 32-bits of the entire code_gen_buffer, then you can reload $29 as the compiler does -- with an LDAH+LDA pair. Otherwise you could reserve a slot in the stack frame to hold the address of the constant pool, and reload $29 via a normal load from the stack frame.

Alternately, you could experiment with reserving another call-saved register and use that instead of $29. This call-saved register would not need to be reloaded after calls.

Note that if you don't use $29 for a constant pool, it's perfectly valid to use it as either a reserved temporary, or for allocation by TCG.


r~




reply via email to

[Prev in Thread] Current Thread [Next in Thread]