[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v2] target-i386: present virtual L3 cache info f
From: |
Michael S. Tsirkin |
Subject: |
Re: [Qemu-devel] [PATCH v2] target-i386: present virtual L3 cache info for vcpus |
Date: |
Thu, 1 Sep 2016 16:27:53 +0300 |
On Thu, Sep 01, 2016 at 02:58:05PM +0800, l00371263 wrote:
> From: "Longpeng(Mike)" <address@hidden>
>
> Some software algorithms are based on the hardware's cache info, for example,
> for x86 linux kernel, when cpu1 want to wakeup a task on cpu2, cpu1 will
> trigger
> a resched IPI and told cpu2 to do the wakeup if they don't share low level
> cache. Oppositely, cpu1 will access cpu2's runqueue directly if they share
> llc.
> The relevant linux-kernel code as bellow:
>
> static void ttwu_queue(struct task_struct *p, int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> ......
> if (... && !cpus_share_cache(smp_processor_id(), cpu)) {
> ......
> ttwu_queue_remote(p, cpu); /* will trigger RES IPI */
> return;
> }
> ......
> ttwu_do_activate(rq, p, 0); /* access target's rq directly */
> ......
> }
>
> In real hardware, the cpus on the same socket share L3 cache, so one won't
> trigger a resched IPIs when wakeup a task on others. But QEMU doesn't present
> a
> virtual L3 cache info for VM, then the linux guest will trigger lots of RES
> IPIs
> under some workloads even if the virtual cpus belongs to the same virtual
> socket.
>
> For KVM, this degrades performance, because there will be lots of vmexit due
> to
> guest send IPIs.
>
> The workload is a SAP HANA's testsuite, we run it one round(about 40 minuates)
> and observe the (Suse11sp3)Guest's amounts of RES IPIs which triggering during
> the period:
>
> No-L3 With-L3(applied this patch)
> cpu0: 363890 44582
> cpu1: 373405 43109
> cpu2: 340783 43797
> cpu3: 333854 43409
> cpu4: 327170 40038
> cpu5: 325491 39922
> cpu6: 319129 42391
> cpu7: 306480 41035
> cpu8: 161139 32188
> cpu9: 164649 31024
> cpu10: 149823 30398
> cpu11: 149823 32455
> cpu12: 164830 35143
> cpu13: 172269 35805
> cpu14: 179979 33898
> cpu15: 194505 32754
> avg: 268963.6 40129.8
>
> The VM's topology is "1*socket 8*cores 2*threads".
> After present virtual L3 cache info for VM, the amounts of RES IPI in guest
> reduce 85%.
>
> And we also test the overall system performance if vcpus actually run on
> sparate physical sockets. With L3 cache, the performance improves 7.2%~33.1%
> (avg: 15.7%).
Any idea why? I'm guessing that on bare metal, it is
sometimes cheaper to send IPIs with a separate cache, but on KVM,
it is always cheaper to use memory, as this reduces the # of exits.
Is this it?
It's worth listing here so that e.g. if it ever becomes possible to send
IPIs without exits, we know we need to change this code.
> Signed-off-by: Longpeng(Mike) <address@hidden>
> ---
> Hi Eduardo,
>
> Changes since v1:
> - fix the compat problem: set compat_props on PC_COMPAT_2_7.
> - fix a "intentionally introducde bug": make intel's and amd's consistently.
> - fix the CPUID.(EAX=4, ECX=3):EAX[25:14].
> - test the performance if vcpus running on sparate sockets: with L3 cache,
> the performance improves 7.2%~33.1%(avg: 15.7%).
> ---
> include/hw/i386/pc.h | 8 ++++++++
> target-i386/cpu.c | 49 ++++++++++++++++++++++++++++++++++++++++++++-----
> target-i386/cpu.h | 3 +++
> 3 files changed, 55 insertions(+), 5 deletions(-)
>
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index 74c175c..6072625 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -367,7 +367,15 @@ int e820_add_entry(uint64_t, uint64_t, uint32_t);
> int e820_get_num_entries(void);
> bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
>
> +#define PC_COMPAT_2_7 \
> + {\
> + .driver = TYPE_X86_CPU,\
> + .property = "compat-cache",\
> + .value = "on",\
> + },
> +
> #define PC_COMPAT_2_6 \
> + PC_COMPAT_2_7 \
> HW_COMPAT_2_6 \
> {\
> .driver = "fw_cfg_io",\
Could this get a more informative name?
E.g. l3-cache-shared ?
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index 6a1afab..224d967 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -57,6 +57,7 @@
> #define CPUID_2_L1D_32KB_8WAY_64B 0x2c
> #define CPUID_2_L1I_32KB_8WAY_64B 0x30
> #define CPUID_2_L2_2MB_8WAY_64B 0x7d
> +#define CPUID_2_L3_16MB_16WAY_64B 0x4d
>
>
> /* CPUID Leaf 4 constants: */
> @@ -131,11 +132,18 @@
> #define L2_LINES_PER_TAG 1
> #define L2_SIZE_KB_AMD 512
>
> -/* No L3 cache: */
> +/* Level 3 unified cache: */
> #define L3_SIZE_KB 0 /* disabled */
> #define L3_ASSOCIATIVITY 0 /* disabled */
> #define L3_LINES_PER_TAG 0 /* disabled */
> #define L3_LINE_SIZE 0 /* disabled */
> +#define L3_N_LINE_SIZE 64
> +#define L3_N_ASSOCIATIVITY 16
> +#define L3_N_SETS 16384
> +#define L3_N_PARTITIONS 1
> +#define L3_N_DESCRIPTOR CPUID_2_L3_16MB_16WAY_64B
> +#define L3_N_LINES_PER_TAG 1
> +#define L3_N_SIZE_KB_AMD 16384
>
> /* TLB definitions: */
>
> @@ -2275,6 +2283,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index,
> uint32_t count,
> {
> X86CPU *cpu = x86_env_get_cpu(env);
> CPUState *cs = CPU(cpu);
> + uint32_t pkg_offset;
>
> /* test if maximum index reached */
> if (index & 0x80000000) {
> @@ -2328,7 +2337,11 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index,
> uint32_t count,
> }
> *eax = 1; /* Number of CPUID[EAX=2] calls required */
> *ebx = 0;
> - *ecx = 0;
> + if (cpu->enable_compat_cache) {
> + *ecx = 0;
> + } else {
> + *ecx = L3_N_DESCRIPTOR;
> + }
> *edx = (L1D_DESCRIPTOR << 16) | \
> (L1I_DESCRIPTOR << 8) | \
> (L2_DESCRIPTOR);
> @@ -2374,6 +2387,25 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index,
> uint32_t count,
> *ecx = L2_SETS - 1;
> *edx = CPUID_4_NO_INVD_SHARING;
> break;
> + case 3: /* L3 cache info */
> + if (cpu->enable_compat_cache) {
> + *eax = 0;
> + *ebx = 0;
> + *ecx = 0;
> + *edx = 0;
> + break;
> + }
> + *eax |= CPUID_4_TYPE_UNIFIED | \
> + CPUID_4_LEVEL(3) | \
> + CPUID_4_SELF_INIT_LEVEL;
> + pkg_offset = apicid_pkg_offset(cs->nr_cores, cs->nr_threads);
> + *eax |= ((1 << pkg_offset) - 1) << 14;
> + *ebx = (L3_N_LINE_SIZE - 1) | \
> + ((L3_N_PARTITIONS - 1) << 12) | \
> + ((L3_N_ASSOCIATIVITY - 1) << 22);
> + *ecx = L3_N_SETS - 1;
> + *edx = CPUID_4_INCLUSIVE | CPUID_4_COMPLEX_IDX;
> + break;
> default: /* end of info */
> *eax = 0;
> *ebx = 0;
> @@ -2585,9 +2617,15 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index,
> uint32_t count,
> *ecx = (L2_SIZE_KB_AMD << 16) | \
> (AMD_ENC_ASSOC(L2_ASSOCIATIVITY) << 12) | \
> (L2_LINES_PER_TAG << 8) | (L2_LINE_SIZE);
> - *edx = ((L3_SIZE_KB/512) << 18) | \
> - (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \
> - (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE);
> + if (cpu->enable_compat_cache) {
> + *edx = ((L3_SIZE_KB / 512) << 18) | \
> + (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \
> + (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE);
> + } else {
> + *edx = ((L3_N_SIZE_KB_AMD / 512) << 18) | \
> + (AMD_ENC_ASSOC(L3_N_ASSOCIATIVITY) << 12) | \
> + (L3_N_LINES_PER_TAG << 8) | (L3_N_LINE_SIZE);
> + }
> break;
> case 0x80000007:
> *eax = 0;
> @@ -3364,6 +3402,7 @@ static Property x86_cpu_properties[] = {
> DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
> DEFINE_PROP_BOOL("cpuid-0xb", X86CPU, enable_cpuid_0xb, true),
> DEFINE_PROP_BOOL("lmce", X86CPU, enable_lmce, false),
> + DEFINE_PROP_BOOL("compat-cache", X86CPU, enable_compat_cache, false),
> DEFINE_PROP_END_OF_LIST()
> };
>
> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> index 65615c0..61ef4e3 100644
> --- a/target-i386/cpu.h
> +++ b/target-i386/cpu.h
> @@ -1202,6 +1202,9 @@ struct X86CPU {
> */
> bool enable_lmce;
>
> + /* Compatibility bits for old machine types */
> + bool enable_compat_cache;
> +
> /* Compatibility bits for old machine types: */
> bool enable_cpuid_0xb;
>
> --
> 1.8.3.1
>