qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v3] target-i386: present virtual L3 cache info f


From: Longpeng (Mike)
Subject: Re: [Qemu-devel] [PATCH v3] target-i386: present virtual L3 cache info for vcpus
Date: Mon, 5 Sep 2016 09:16:38 +0800
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0) Gecko/20120327 Thunderbird/11.0.1

Hi Michael,

On 2016/9/3 6:52, Michael S. Tsirkin wrote:

> On Fri, Sep 02, 2016 at 10:22:55AM +0800, Longpeng(Mike) wrote:
>> From: "Longpeng(Mike)" <address@hidden>
>>
>> Some software algorithms are based on the hardware's cache info, for example,
>> for x86 linux kernel, when cpu1 want to wakeup a task on cpu2, cpu1 will 
>> trigger
>> a resched IPI and told cpu2 to do the wakeup if they don't share low level
>> cache. Oppositely, cpu1 will access cpu2's runqueue directly if they share 
>> llc.
>> The relevant linux-kernel code as bellow:
>>
>>      static void ttwu_queue(struct task_struct *p, int cpu)
>>      {
>>              struct rq *rq = cpu_rq(cpu);
>>              ......
>>              if (... && !cpus_share_cache(smp_processor_id(), cpu)) {
>>                      ......
>>                      ttwu_queue_remote(p, cpu); /* will trigger RES IPI */
>>                      return;
>>              }
>>              ......
>>              ttwu_do_activate(rq, p, 0); /* access target's rq directly */
>>              ......
>>      }
>>
>> In real hardware, the cpus on the same socket share L3 cache, so one won't
>> trigger a resched IPIs when wakeup a task on others. But QEMU doesn't 
>> present a
>> virtual L3 cache info for VM, then the linux guest will trigger lots of RES 
>> IPIs
>> under some workloads even if the virtual cpus belongs to the same virtual 
>> socket.
>>
>> For KVM, this degrades performance, because there will be lots of vmexit due 
>> to
>> guest send IPIs.
>>
>> The workload is a SAP HANA's testsuite, we run it one round(about 40 
>> minuates)
>> and observe the (Suse11sp3)Guest's amounts of RES IPIs which triggering 
>> during
>> the period:
>>
>>         No-L3           With-L3(applied this patch)
>> cpu0:        363890          44582
>> cpu1:        373405          43109
>> cpu2:        340783          43797
>> cpu3:        333854          43409
>> cpu4:        327170          40038
>> cpu5:        325491          39922
>> cpu6:        319129          42391
>> cpu7:        306480          41035
>> cpu8:        161139          32188
>> cpu9:        164649          31024
>> cpu10:       149823          30398
>> cpu11:       149823          32455
>> cpu12:       164830          35143
>> cpu13:       172269          35805
>> cpu14:       179979          33898
>> cpu15:       194505          32754
>> avg: 268963.6        40129.8
>>
>> The VM's topology is "1*socket 8*cores 2*threads".
>> After present virtual L3 cache info for VM, the amounts of RES IPIs in guest
>> reduce 85%.
>>
>> What's more, for KVM, vcpus send IPIs will cause vmexit which is expensive.
>> We had tested the overall system performance if vcpus actually run on sparate
>> physical socket. With L3 cache, the performance improves 
>> 7.2%~33.1%(avg:15.7%).
>>
>> Signed-off-by: Longpeng(Mike) <address@hidden>
> 
> For PC bits:
> Acked-by: Michael S. Tsirkin <address@hidden>

Thanks!

> 
> 
>> ---
>> Changes since v2:
>>   - add more useful commit mesage.
>>   - rename "compat-cache" to "l3-cache-shared".
>>
>> Changes since v1:
>>   - fix the compat problem: set compat_props on PC_COMPAT_2_7.
>>   - fix a "intentionally introducde bug": make intel's and amd's 
>> consistently.
>>   - fix the CPUID.(EAX=4, ECX=3):EAX[25:14].
>>   - test the performance if vcpus running on sparate sockets: with L3 cache,
>>     the performance improves 7.2%~33.1%(avg: 15.7%).
>> ---
>>  include/hw/i386/pc.h |  8 ++++++++
>>  target-i386/cpu.c    | 49 ++++++++++++++++++++++++++++++++++++++++++++-----
>>  target-i386/cpu.h    |  5 +++++
>>  3 files changed, 57 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
>> index 74c175c..c92c54e 100644
>> --- a/include/hw/i386/pc.h
>> +++ b/include/hw/i386/pc.h
>> @@ -367,7 +367,15 @@ int e820_add_entry(uint64_t, uint64_t, uint32_t);
>>  int e820_get_num_entries(void);
>>  bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
>>  
>> +#define PC_COMPAT_2_7 \
>> +    {\
>> +        .driver   = TYPE_X86_CPU,\
>> +        .property = "l3-cache-shared",\
>> +        .value    = "off",\
>> +    },
>> +
>>  #define PC_COMPAT_2_6 \
>> +    PC_COMPAT_2_7 \
>>      HW_COMPAT_2_6 \
>>      {\
>>          .driver   = "fw_cfg_io",\
>> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
>> index 6a1afab..4f93922 100644
>> --- a/target-i386/cpu.c
>> +++ b/target-i386/cpu.c
>> @@ -57,6 +57,7 @@
>>  #define CPUID_2_L1D_32KB_8WAY_64B 0x2c
>>  #define CPUID_2_L1I_32KB_8WAY_64B 0x30
>>  #define CPUID_2_L2_2MB_8WAY_64B   0x7d
>> +#define CPUID_2_L3_16MB_16WAY_64B 0x4d
>>  
>>  
>>  /* CPUID Leaf 4 constants: */
>> @@ -131,11 +132,18 @@
>>  #define L2_LINES_PER_TAG       1
>>  #define L2_SIZE_KB_AMD       512
>>  
>> -/* No L3 cache: */
>> +/* Level 3 unified cache: */
>>  #define L3_SIZE_KB             0 /* disabled */
>>  #define L3_ASSOCIATIVITY       0 /* disabled */
>>  #define L3_LINES_PER_TAG       0 /* disabled */
>>  #define L3_LINE_SIZE           0 /* disabled */
>> +#define L3_N_LINE_SIZE         64
>> +#define L3_N_ASSOCIATIVITY     16
>> +#define L3_N_SETS           16384
>> +#define L3_N_PARTITIONS         1
>> +#define L3_N_DESCRIPTOR CPUID_2_L3_16MB_16WAY_64B
>> +#define L3_N_LINES_PER_TAG      1
>> +#define L3_N_SIZE_KB_AMD    16384
>>  
>>  /* TLB definitions: */
>>  
>> @@ -2275,6 +2283,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
>> uint32_t count,
>>  {
>>      X86CPU *cpu = x86_env_get_cpu(env);
>>      CPUState *cs = CPU(cpu);
>> +    uint32_t pkg_offset;
>>  
>>      /* test if maximum index reached */
>>      if (index & 0x80000000) {
>> @@ -2328,7 +2337,11 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
>> uint32_t count,
>>          }
>>          *eax = 1; /* Number of CPUID[EAX=2] calls required */
>>          *ebx = 0;
>> -        *ecx = 0;
>> +        if (!cpu->enable_l3_cache_shared) {
>> +            *ecx = 0;
>> +        } else {
>> +            *ecx = L3_N_DESCRIPTOR;
>> +        }
>>          *edx = (L1D_DESCRIPTOR << 16) | \
>>                 (L1I_DESCRIPTOR <<  8) | \
>>                 (L2_DESCRIPTOR);
>> @@ -2374,6 +2387,25 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
>> uint32_t count,
>>                  *ecx = L2_SETS - 1;
>>                  *edx = CPUID_4_NO_INVD_SHARING;
>>                  break;
>> +            case 3: /* L3 cache info */
>> +                if (!cpu->enable_l3_cache_shared) {
>> +                    *eax = 0;
>> +                    *ebx = 0;
>> +                    *ecx = 0;
>> +                    *edx = 0;
>> +                    break;
>> +                }
>> +                *eax |= CPUID_4_TYPE_UNIFIED | \
>> +                        CPUID_4_LEVEL(3) | \
>> +                        CPUID_4_SELF_INIT_LEVEL;
>> +                pkg_offset = apicid_pkg_offset(cs->nr_cores, 
>> cs->nr_threads);
>> +                *eax |= ((1 << pkg_offset) - 1) << 14;
>> +                *ebx = (L3_N_LINE_SIZE - 1) | \
>> +                       ((L3_N_PARTITIONS - 1) << 12) | \
>> +                       ((L3_N_ASSOCIATIVITY - 1) << 22);
>> +                *ecx = L3_N_SETS - 1;
>> +                *edx = CPUID_4_INCLUSIVE | CPUID_4_COMPLEX_IDX;
>> +                break;
>>              default: /* end of info */
>>                  *eax = 0;
>>                  *ebx = 0;
>> @@ -2585,9 +2617,15 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
>> uint32_t count,
>>          *ecx = (L2_SIZE_KB_AMD << 16) | \
>>                 (AMD_ENC_ASSOC(L2_ASSOCIATIVITY) << 12) | \
>>                 (L2_LINES_PER_TAG << 8) | (L2_LINE_SIZE);
>> -        *edx = ((L3_SIZE_KB/512) << 18) | \
>> -               (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \
>> -               (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE);
>> +        if (!cpu->enable_l3_cache_shared) {
>> +            *edx = ((L3_SIZE_KB / 512) << 18) | \
>> +                   (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \
>> +                   (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE);
>> +        } else {
>> +            *edx = ((L3_N_SIZE_KB_AMD / 512) << 18) | \
>> +                   (AMD_ENC_ASSOC(L3_N_ASSOCIATIVITY) << 12) | \
>> +                   (L3_N_LINES_PER_TAG << 8) | (L3_N_LINE_SIZE);
>> +        }
>>          break;
>>      case 0x80000007:
>>          *eax = 0;
>> @@ -3364,6 +3402,7 @@ static Property x86_cpu_properties[] = {
>>      DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
>>      DEFINE_PROP_BOOL("cpuid-0xb", X86CPU, enable_cpuid_0xb, true),
>>      DEFINE_PROP_BOOL("lmce", X86CPU, enable_lmce, false),
>> +    DEFINE_PROP_BOOL("l3-cache-shared", X86CPU, enable_l3_cache_shared, 
>> true),
>>      DEFINE_PROP_END_OF_LIST()
>>  };
>>  
>> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
>> index 65615c0..355bf47 100644
>> --- a/target-i386/cpu.h
>> +++ b/target-i386/cpu.h
>> @@ -1202,6 +1202,11 @@ struct X86CPU {
>>       */
>>      bool enable_lmce;
>>  
>> +    /* Compatibility bits for old machine types.
>> +     * If true present virtual l3 cache for VM.
> 
> "pretend that all CPUs share an l3 cache"?
> 

The vcpus in the same virtual-socket share an virtual l3 cache.
I will make it more clearly later.

The 2.7 was released, so I will modify this patch for 2.8 later.

> 
>> +     */
>> +    bool enable_l3_cache_shared;
>> +
>>      /* Compatibility bits for old machine types: */
>>      bool enable_cpuid_0xb;
>>  
>> -- 
>> 1.8.3.1
>>
> 
> .
> 


-- 
Regards,
Longpeng(Mike)




reply via email to

[Prev in Thread] Current Thread [Next in Thread]