[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v3] target-i386: present virtual L3 cache info f
From: |
Longpeng (Mike) |
Subject: |
Re: [Qemu-devel] [PATCH v3] target-i386: present virtual L3 cache info for vcpus |
Date: |
Mon, 5 Sep 2016 09:16:38 +0800 |
User-agent: |
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0) Gecko/20120327 Thunderbird/11.0.1 |
Hi Michael,
On 2016/9/3 6:52, Michael S. Tsirkin wrote:
> On Fri, Sep 02, 2016 at 10:22:55AM +0800, Longpeng(Mike) wrote:
>> From: "Longpeng(Mike)" <address@hidden>
>>
>> Some software algorithms are based on the hardware's cache info, for example,
>> for x86 linux kernel, when cpu1 want to wakeup a task on cpu2, cpu1 will
>> trigger
>> a resched IPI and told cpu2 to do the wakeup if they don't share low level
>> cache. Oppositely, cpu1 will access cpu2's runqueue directly if they share
>> llc.
>> The relevant linux-kernel code as bellow:
>>
>> static void ttwu_queue(struct task_struct *p, int cpu)
>> {
>> struct rq *rq = cpu_rq(cpu);
>> ......
>> if (... && !cpus_share_cache(smp_processor_id(), cpu)) {
>> ......
>> ttwu_queue_remote(p, cpu); /* will trigger RES IPI */
>> return;
>> }
>> ......
>> ttwu_do_activate(rq, p, 0); /* access target's rq directly */
>> ......
>> }
>>
>> In real hardware, the cpus on the same socket share L3 cache, so one won't
>> trigger a resched IPIs when wakeup a task on others. But QEMU doesn't
>> present a
>> virtual L3 cache info for VM, then the linux guest will trigger lots of RES
>> IPIs
>> under some workloads even if the virtual cpus belongs to the same virtual
>> socket.
>>
>> For KVM, this degrades performance, because there will be lots of vmexit due
>> to
>> guest send IPIs.
>>
>> The workload is a SAP HANA's testsuite, we run it one round(about 40
>> minuates)
>> and observe the (Suse11sp3)Guest's amounts of RES IPIs which triggering
>> during
>> the period:
>>
>> No-L3 With-L3(applied this patch)
>> cpu0: 363890 44582
>> cpu1: 373405 43109
>> cpu2: 340783 43797
>> cpu3: 333854 43409
>> cpu4: 327170 40038
>> cpu5: 325491 39922
>> cpu6: 319129 42391
>> cpu7: 306480 41035
>> cpu8: 161139 32188
>> cpu9: 164649 31024
>> cpu10: 149823 30398
>> cpu11: 149823 32455
>> cpu12: 164830 35143
>> cpu13: 172269 35805
>> cpu14: 179979 33898
>> cpu15: 194505 32754
>> avg: 268963.6 40129.8
>>
>> The VM's topology is "1*socket 8*cores 2*threads".
>> After present virtual L3 cache info for VM, the amounts of RES IPIs in guest
>> reduce 85%.
>>
>> What's more, for KVM, vcpus send IPIs will cause vmexit which is expensive.
>> We had tested the overall system performance if vcpus actually run on sparate
>> physical socket. With L3 cache, the performance improves
>> 7.2%~33.1%(avg:15.7%).
>>
>> Signed-off-by: Longpeng(Mike) <address@hidden>
>
> For PC bits:
> Acked-by: Michael S. Tsirkin <address@hidden>
Thanks!
>
>
>> ---
>> Changes since v2:
>> - add more useful commit mesage.
>> - rename "compat-cache" to "l3-cache-shared".
>>
>> Changes since v1:
>> - fix the compat problem: set compat_props on PC_COMPAT_2_7.
>> - fix a "intentionally introducde bug": make intel's and amd's
>> consistently.
>> - fix the CPUID.(EAX=4, ECX=3):EAX[25:14].
>> - test the performance if vcpus running on sparate sockets: with L3 cache,
>> the performance improves 7.2%~33.1%(avg: 15.7%).
>> ---
>> include/hw/i386/pc.h | 8 ++++++++
>> target-i386/cpu.c | 49 ++++++++++++++++++++++++++++++++++++++++++++-----
>> target-i386/cpu.h | 5 +++++
>> 3 files changed, 57 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
>> index 74c175c..c92c54e 100644
>> --- a/include/hw/i386/pc.h
>> +++ b/include/hw/i386/pc.h
>> @@ -367,7 +367,15 @@ int e820_add_entry(uint64_t, uint64_t, uint32_t);
>> int e820_get_num_entries(void);
>> bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
>>
>> +#define PC_COMPAT_2_7 \
>> + {\
>> + .driver = TYPE_X86_CPU,\
>> + .property = "l3-cache-shared",\
>> + .value = "off",\
>> + },
>> +
>> #define PC_COMPAT_2_6 \
>> + PC_COMPAT_2_7 \
>> HW_COMPAT_2_6 \
>> {\
>> .driver = "fw_cfg_io",\
>> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
>> index 6a1afab..4f93922 100644
>> --- a/target-i386/cpu.c
>> +++ b/target-i386/cpu.c
>> @@ -57,6 +57,7 @@
>> #define CPUID_2_L1D_32KB_8WAY_64B 0x2c
>> #define CPUID_2_L1I_32KB_8WAY_64B 0x30
>> #define CPUID_2_L2_2MB_8WAY_64B 0x7d
>> +#define CPUID_2_L3_16MB_16WAY_64B 0x4d
>>
>>
>> /* CPUID Leaf 4 constants: */
>> @@ -131,11 +132,18 @@
>> #define L2_LINES_PER_TAG 1
>> #define L2_SIZE_KB_AMD 512
>>
>> -/* No L3 cache: */
>> +/* Level 3 unified cache: */
>> #define L3_SIZE_KB 0 /* disabled */
>> #define L3_ASSOCIATIVITY 0 /* disabled */
>> #define L3_LINES_PER_TAG 0 /* disabled */
>> #define L3_LINE_SIZE 0 /* disabled */
>> +#define L3_N_LINE_SIZE 64
>> +#define L3_N_ASSOCIATIVITY 16
>> +#define L3_N_SETS 16384
>> +#define L3_N_PARTITIONS 1
>> +#define L3_N_DESCRIPTOR CPUID_2_L3_16MB_16WAY_64B
>> +#define L3_N_LINES_PER_TAG 1
>> +#define L3_N_SIZE_KB_AMD 16384
>>
>> /* TLB definitions: */
>>
>> @@ -2275,6 +2283,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index,
>> uint32_t count,
>> {
>> X86CPU *cpu = x86_env_get_cpu(env);
>> CPUState *cs = CPU(cpu);
>> + uint32_t pkg_offset;
>>
>> /* test if maximum index reached */
>> if (index & 0x80000000) {
>> @@ -2328,7 +2337,11 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index,
>> uint32_t count,
>> }
>> *eax = 1; /* Number of CPUID[EAX=2] calls required */
>> *ebx = 0;
>> - *ecx = 0;
>> + if (!cpu->enable_l3_cache_shared) {
>> + *ecx = 0;
>> + } else {
>> + *ecx = L3_N_DESCRIPTOR;
>> + }
>> *edx = (L1D_DESCRIPTOR << 16) | \
>> (L1I_DESCRIPTOR << 8) | \
>> (L2_DESCRIPTOR);
>> @@ -2374,6 +2387,25 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index,
>> uint32_t count,
>> *ecx = L2_SETS - 1;
>> *edx = CPUID_4_NO_INVD_SHARING;
>> break;
>> + case 3: /* L3 cache info */
>> + if (!cpu->enable_l3_cache_shared) {
>> + *eax = 0;
>> + *ebx = 0;
>> + *ecx = 0;
>> + *edx = 0;
>> + break;
>> + }
>> + *eax |= CPUID_4_TYPE_UNIFIED | \
>> + CPUID_4_LEVEL(3) | \
>> + CPUID_4_SELF_INIT_LEVEL;
>> + pkg_offset = apicid_pkg_offset(cs->nr_cores,
>> cs->nr_threads);
>> + *eax |= ((1 << pkg_offset) - 1) << 14;
>> + *ebx = (L3_N_LINE_SIZE - 1) | \
>> + ((L3_N_PARTITIONS - 1) << 12) | \
>> + ((L3_N_ASSOCIATIVITY - 1) << 22);
>> + *ecx = L3_N_SETS - 1;
>> + *edx = CPUID_4_INCLUSIVE | CPUID_4_COMPLEX_IDX;
>> + break;
>> default: /* end of info */
>> *eax = 0;
>> *ebx = 0;
>> @@ -2585,9 +2617,15 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index,
>> uint32_t count,
>> *ecx = (L2_SIZE_KB_AMD << 16) | \
>> (AMD_ENC_ASSOC(L2_ASSOCIATIVITY) << 12) | \
>> (L2_LINES_PER_TAG << 8) | (L2_LINE_SIZE);
>> - *edx = ((L3_SIZE_KB/512) << 18) | \
>> - (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \
>> - (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE);
>> + if (!cpu->enable_l3_cache_shared) {
>> + *edx = ((L3_SIZE_KB / 512) << 18) | \
>> + (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \
>> + (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE);
>> + } else {
>> + *edx = ((L3_N_SIZE_KB_AMD / 512) << 18) | \
>> + (AMD_ENC_ASSOC(L3_N_ASSOCIATIVITY) << 12) | \
>> + (L3_N_LINES_PER_TAG << 8) | (L3_N_LINE_SIZE);
>> + }
>> break;
>> case 0x80000007:
>> *eax = 0;
>> @@ -3364,6 +3402,7 @@ static Property x86_cpu_properties[] = {
>> DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
>> DEFINE_PROP_BOOL("cpuid-0xb", X86CPU, enable_cpuid_0xb, true),
>> DEFINE_PROP_BOOL("lmce", X86CPU, enable_lmce, false),
>> + DEFINE_PROP_BOOL("l3-cache-shared", X86CPU, enable_l3_cache_shared,
>> true),
>> DEFINE_PROP_END_OF_LIST()
>> };
>>
>> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
>> index 65615c0..355bf47 100644
>> --- a/target-i386/cpu.h
>> +++ b/target-i386/cpu.h
>> @@ -1202,6 +1202,11 @@ struct X86CPU {
>> */
>> bool enable_lmce;
>>
>> + /* Compatibility bits for old machine types.
>> + * If true present virtual l3 cache for VM.
>
> "pretend that all CPUs share an l3 cache"?
>
The vcpus in the same virtual-socket share an virtual l3 cache.
I will make it more clearly later.
The 2.7 was released, so I will modify this patch for 2.8 later.
>
>> + */
>> + bool enable_l3_cache_shared;
>> +
>> /* Compatibility bits for old machine types: */
>> bool enable_cpuid_0xb;
>>
>> --
>> 1.8.3.1
>>
>
> .
>
--
Regards,
Longpeng(Mike)