|
From: | Richard Henderson |
Subject: | Re: [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB |
Date: | Tue, 6 Dec 2022 16:22:39 -0600 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.4.2 |
On 12/6/22 13:29, Ilya Leoshkevich wrote:
On Thu, Dec 01, 2022 at 10:51:49PM -0800, Richard Henderson wrote:This reverts 829e1376d940 ("tcg/s390: Introduce TCG_REG_TB"), and several follow-up patches. The primary motivation is to reduce the less-tested code paths, pre-z10. Secondarily, this allows the unconditional use of TCG_TARGET_HAS_direct_jump, which might be more important for performance than any slight increase in code size. Signed-off-by: Richard Henderson <richard.henderson@linaro.org> --- tcg/s390x/tcg-target.h | 2 +- tcg/s390x/tcg-target.c.inc | 176 +++++-------------------------------- 2 files changed, 23 insertions(+), 155 deletions(-)Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com> I have a few questions/ideas for the future below.diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h index 22d70d431b..645f522058 100644 --- a/tcg/s390x/tcg-target.h +++ b/tcg/s390x/tcg-target.h @@ -103,7 +103,7 @@ extern uint64_t s390_facilities[3]; #define TCG_TARGET_HAS_mulsh_i32 0 #define TCG_TARGET_HAS_extrl_i64_i32 0 #define TCG_TARGET_HAS_extrh_i64_i32 0 -#define TCG_TARGET_HAS_direct_jump HAVE_FACILITY(GEN_INST_EXT) +#define TCG_TARGET_HAS_direct_jump 1This change doesn't seem to affect that, but what is the minimum supported s390x qemu host? z900?
Possibly z990, if I'm reading the gcc processor_flags_table[] correctly; long-displacement-facility is definitely a minimum.
We probably should revisit what the minimum for TCG should be, assert those features at startup, and drop the corresponding runtime tests.
I did some benchmarking of various ways to load constants in context of GCC in the past, and it turned out that LLIHF+OILF is more efficient than literal pool [1].
Interesting. If we include extended-immediate-facility (base_GEN9_GA1, z9-109?) in the bare minimum that would definitely simplify a few things.
- /* Use the constant pool if USE_REG_TB, but not for small constants. */ - if (maybe_out_small_movi(s, type, TCG_TMP0, val)) { - if (type == TCG_TYPE_I32) { - tcg_out_insn(s, RR, XR, dest, TCG_TMP0); - } else { - tcg_out_insn(s, RRE, XGR, dest, TCG_TMP0); - } - } else if (USE_REG_TB) { - tcg_out_insn(s, RXY, XG, dest, TCG_REG_TB, TCG_REG_NONE, 0); - new_pool_label(s, val, R_390_20, s->code_ptr - 2, - tcg_tbrel_diff(s, NULL)); + tcg_out_movi(s, type, TCG_TMP0, val); + if (type == TCG_TYPE_I32) { + tcg_out_insn(s, RR, XR, dest, TCG_TMP0); } else { - /* Perform the xor by parts. */ - tcg_debug_assert(HAVE_FACILITY(EXT_IMM)); - if (val & 0xffffffff) { - tcg_out_insn(s, RIL, XILF, dest, val); - } - if (val > 0xffffffff) { - tcg_out_insn(s, RIL, XIHF, dest, val >> 32); - } + tcg_out_insn(s, RRE, XGR, dest, TCG_TMP0); } }Wouldn't it be worth keeping XILF/XIFH here?
I don't know. It's difficult for me to guess whether a dependency chain like val -> xor -> xor (3 insns with serial dependencies) is better than val --> xor load -/(3 insns, but only one serial dependency) is better. But there may also be instruction fusion going on at the micro-architectural level, so that there's really only one xor.
If you have suggestions, I'm all ears.
I don't have any numbers right now, but it looks more compact/efficient than a load + XGR.
If we assume general-instruction-extension-facility (z10?), LGRL + XGR is smaller than XILF + XIFH, ignoring the constant pool entry which might be shared, and modulo the µarch questions above.
Same for OGR above; I even wonder if both implementations could be unified.
Sadly not, because of OILL et al. There are no 16-bit xor immediate insns.
+ /* + * branch displacement must be aligned for atomic patching; + * see if we need to add extra nop before branch + */ + if (!QEMU_PTR_IS_ALIGNED(s->code_ptr + 1, 4)) { + tcg_out16(s, NOP); } + tcg_out16(s, RIL_BRCL | (S390_CC_ALWAYS << 4)); + s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s); + tcg_out32(s, 0); set_jmp_reset_offset(s, a0);This seems to work in practice, but I don't think patching branch offsets is allowed by PoP (in a multi-threaded environment). For example, I had to do [2] in order to work around this limitation in ftrace.
Really? How does the processor distinguish between overwriting opcode/condition vs overwriting immediate operand when invalidating cached instructions?
If overwriting operand truly isn't correct, then I think we have to use indirect branch always for goto_tb.
A third benefit seems to be that we now have one more register to allocate.
Yes. It's call clobbered, so it isn't live so often, but sometimes. r~
[Prev in Thread] | Current Thread | [Next in Thread] |