Re: [Qemu-devel] [PATCH v2] target-i386: present virtual L3 cache info f

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2] target-i386: present virtual L3 cache info f

From:	Michael S. Tsirkin
Subject:	Re: [Qemu-devel] [PATCH v2] target-i386: present virtual L3 cache info for vcpus
Date:	Thu, 1 Sep 2016 16:27:53 +0300

On Thu, Sep 01, 2016 at 02:58:05PM +0800, l00371263 wrote:
> From: "Longpeng(Mike)" <address@hidden>
> 
> Some software algorithms are based on the hardware's cache info, for example,
> for x86 linux kernel, when cpu1 want to wakeup a task on cpu2, cpu1 will 
> trigger
> a resched IPI and told cpu2 to do the wakeup if they don't share low level
> cache. Oppositely, cpu1 will access cpu2's runqueue directly if they share 
> llc.
> The relevant linux-kernel code as bellow:
> 
>       static void ttwu_queue(struct task_struct *p, int cpu)
>       {
>               struct rq *rq = cpu_rq(cpu);
>               ......
>               if (... && !cpus_share_cache(smp_processor_id(), cpu)) {
>                       ......
>                       ttwu_queue_remote(p, cpu); /* will trigger RES IPI */
>                       return;
>               }
>               ......
>               ttwu_do_activate(rq, p, 0); /* access target's rq directly */
>               ......
>       }
> 
> In real hardware, the cpus on the same socket share L3 cache, so one won't
> trigger a resched IPIs when wakeup a task on others. But QEMU doesn't present 
> a
> virtual L3 cache info for VM, then the linux guest will trigger lots of RES 
> IPIs
> under some workloads even if the virtual cpus belongs to the same virtual 
> socket.
> 
> For KVM, this degrades performance, because there will be lots of vmexit due 
> to
> guest send IPIs.
> 
> The workload is a SAP HANA's testsuite, we run it one round(about 40 minuates)
> and observe the (Suse11sp3)Guest's amounts of RES IPIs which triggering during
> the period:
> 
>         No-L3           With-L3(applied this patch)
> cpu0: 363890          44582
> cpu1: 373405          43109
> cpu2: 340783          43797
> cpu3: 333854          43409
> cpu4: 327170          40038
> cpu5: 325491          39922
> cpu6: 319129          42391
> cpu7: 306480          41035
> cpu8: 161139          32188
> cpu9: 164649          31024
> cpu10:        149823          30398
> cpu11:        149823          32455
> cpu12:        164830          35143
> cpu13:        172269          35805
> cpu14:        179979          33898
> cpu15:        194505          32754
> avg:  268963.6        40129.8
> 
> The VM's topology is "1*socket 8*cores 2*threads".
> After present virtual L3 cache info for VM, the amounts of RES IPI in guest
> reduce 85%.
> 
> And we also test the overall system performance if vcpus actually run on
> sparate physical sockets. With L3 cache, the performance improves 7.2%~33.1%
> (avg: 15.7%).

Any idea why?  I'm guessing that on bare metal, it is
sometimes cheaper to send IPIs with a separate cache, but on KVM,
it is always cheaper to use memory, as this reduces the # of exits.
Is this it?

It's worth listing here so that e.g. if it ever becomes possible to send
IPIs without exits, we know we need to change this code.

> Signed-off-by: Longpeng(Mike) <address@hidden>
> ---
> Hi Eduardo, 
> 
> Changes since v1:
>   - fix the compat problem: set compat_props on PC_COMPAT_2_7.
>   - fix a "intentionally introducde bug": make intel's and amd's consistently.
>   - fix the CPUID.(EAX=4, ECX=3):EAX[25:14].
>   - test the performance if vcpus running on sparate sockets: with L3 cache,
>     the performance improves 7.2%~33.1%(avg: 15.7%).
> ---
>  include/hw/i386/pc.h |  8 ++++++++
>  target-i386/cpu.c    | 49 ++++++++++++++++++++++++++++++++++++++++++++-----
>  target-i386/cpu.h    |  3 +++
>  3 files changed, 55 insertions(+), 5 deletions(-)
> 
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index 74c175c..6072625 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -367,7 +367,15 @@ int e820_add_entry(uint64_t, uint64_t, uint32_t);
>  int e820_get_num_entries(void);
>  bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
>  
> +#define PC_COMPAT_2_7 \
> +    {\
> +        .driver   = TYPE_X86_CPU,\
> +        .property = "compat-cache",\
> +        .value    = "on",\
> +    },
> +
>  #define PC_COMPAT_2_6 \
> +    PC_COMPAT_2_7 \
>      HW_COMPAT_2_6 \
>      {\
>          .driver   = "fw_cfg_io",\


Could this get a more informative name?
E.g. l3-cache-shared ?

> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index 6a1afab..224d967 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -57,6 +57,7 @@
>  #define CPUID_2_L1D_32KB_8WAY_64B 0x2c
>  #define CPUID_2_L1I_32KB_8WAY_64B 0x30
>  #define CPUID_2_L2_2MB_8WAY_64B   0x7d
> +#define CPUID_2_L3_16MB_16WAY_64B 0x4d
>  
>  
>  /* CPUID Leaf 4 constants: */
> @@ -131,11 +132,18 @@
>  #define L2_LINES_PER_TAG       1
>  #define L2_SIZE_KB_AMD       512
>  
> -/* No L3 cache: */
> +/* Level 3 unified cache: */
>  #define L3_SIZE_KB             0 /* disabled */
>  #define L3_ASSOCIATIVITY       0 /* disabled */
>  #define L3_LINES_PER_TAG       0 /* disabled */
>  #define L3_LINE_SIZE           0 /* disabled */
> +#define L3_N_LINE_SIZE         64
> +#define L3_N_ASSOCIATIVITY     16
> +#define L3_N_SETS           16384
> +#define L3_N_PARTITIONS         1
> +#define L3_N_DESCRIPTOR CPUID_2_L3_16MB_16WAY_64B
> +#define L3_N_LINES_PER_TAG      1
> +#define L3_N_SIZE_KB_AMD    16384
>  
>  /* TLB definitions: */
>  
> @@ -2275,6 +2283,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
> uint32_t count,
>  {
>      X86CPU *cpu = x86_env_get_cpu(env);
>      CPUState *cs = CPU(cpu);
> +    uint32_t pkg_offset;
>  
>      /* test if maximum index reached */
>      if (index & 0x80000000) {
> @@ -2328,7 +2337,11 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
> uint32_t count,
>          }
>          *eax = 1; /* Number of CPUID[EAX=2] calls required */
>          *ebx = 0;
> -        *ecx = 0;
> +        if (cpu->enable_compat_cache) {
> +            *ecx = 0;
> +        } else {
> +            *ecx = L3_N_DESCRIPTOR;
> +        }
>          *edx = (L1D_DESCRIPTOR << 16) | \
>                 (L1I_DESCRIPTOR <<  8) | \
>                 (L2_DESCRIPTOR);
> @@ -2374,6 +2387,25 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
> uint32_t count,
>                  *ecx = L2_SETS - 1;
>                  *edx = CPUID_4_NO_INVD_SHARING;
>                  break;
> +            case 3: /* L3 cache info */
> +                if (cpu->enable_compat_cache) {
> +                    *eax = 0;
> +                    *ebx = 0;
> +                    *ecx = 0;
> +                    *edx = 0;
> +                    break;
> +                }
> +                *eax |= CPUID_4_TYPE_UNIFIED | \
> +                        CPUID_4_LEVEL(3) | \
> +                        CPUID_4_SELF_INIT_LEVEL;
> +                pkg_offset = apicid_pkg_offset(cs->nr_cores, cs->nr_threads);
> +                *eax |= ((1 << pkg_offset) - 1) << 14;
> +                *ebx = (L3_N_LINE_SIZE - 1) | \
> +                       ((L3_N_PARTITIONS - 1) << 12) | \
> +                       ((L3_N_ASSOCIATIVITY - 1) << 22);
> +                *ecx = L3_N_SETS - 1;
> +                *edx = CPUID_4_INCLUSIVE | CPUID_4_COMPLEX_IDX;
> +                break;
>              default: /* end of info */
>                  *eax = 0;
>                  *ebx = 0;
> @@ -2585,9 +2617,15 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
> uint32_t count,
>          *ecx = (L2_SIZE_KB_AMD << 16) | \
>                 (AMD_ENC_ASSOC(L2_ASSOCIATIVITY) << 12) | \
>                 (L2_LINES_PER_TAG << 8) | (L2_LINE_SIZE);
> -        *edx = ((L3_SIZE_KB/512) << 18) | \
> -               (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \
> -               (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE);
> +        if (cpu->enable_compat_cache) {
> +            *edx = ((L3_SIZE_KB / 512) << 18) | \
> +                   (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \
> +                   (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE);
> +        } else {
> +            *edx = ((L3_N_SIZE_KB_AMD / 512) << 18) | \
> +                   (AMD_ENC_ASSOC(L3_N_ASSOCIATIVITY) << 12) | \
> +                   (L3_N_LINES_PER_TAG << 8) | (L3_N_LINE_SIZE);
> +        }
>          break;
>      case 0x80000007:
>          *eax = 0;
> @@ -3364,6 +3402,7 @@ static Property x86_cpu_properties[] = {
>      DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
>      DEFINE_PROP_BOOL("cpuid-0xb", X86CPU, enable_cpuid_0xb, true),
>      DEFINE_PROP_BOOL("lmce", X86CPU, enable_lmce, false),
> +    DEFINE_PROP_BOOL("compat-cache", X86CPU, enable_compat_cache, false),
>      DEFINE_PROP_END_OF_LIST()
>  };
>  
> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> index 65615c0..61ef4e3 100644
> --- a/target-i386/cpu.h
> +++ b/target-i386/cpu.h
> @@ -1202,6 +1202,9 @@ struct X86CPU {
>       */
>      bool enable_lmce;
>  
> +    /* Compatibility bits for old machine types */
> +    bool enable_compat_cache;
> +
>      /* Compatibility bits for old machine types: */
>      bool enable_cpuid_0xb;
>  
> -- 
> 1.8.3.1
>

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [PATCH v2] target-i386: present virtual L3 cache info for vcpus, l00371263, 2016/09/01
- Re: [Qemu-devel] [PATCH v2] target-i386: present virtual L3 cache info for vcpus, Michael S. Tsirkin <=
  - Re: [Qemu-devel] [PATCH v2] target-i386: present virtual L3 cache info for vcpus, Longpeng (Mike), 2016/09/01

Prev by Date: Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
Next by Date: Re: [Qemu-devel] [PATCH v8 1/2] virtio-crypto: Add virtio crypto device specification
Previous by thread: [Qemu-devel] [PATCH v2] target-i386: present virtual L3 cache info for vcpus
Next by thread: Re: [Qemu-devel] [PATCH v2] target-i386: present virtual L3 cache info for vcpus
Index(es):
- Date
- Thread