qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu


From: Anthony Harivel
Subject: Re: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu
Date: Thu, 18 Apr 2024 12:52:14 +0200
User-agent: aerc/0.15.2-111-g39195000e213

Hi Zhao,

Zhao Liu, Apr 17, 2024 at 12:07:
> Hi Anthony,
>
> May I ask what your usage scenario is? Is it to measure Guest's energy
> consumption and to charged per watt consumed? ;-)

See previous email from Daniel.

> On Thu, Apr 11, 2024 at 02:14:34PM +0200, Anthony Harivel wrote:
> > Date: Thu, 11 Apr 2024 14:14:34 +0200
> > From: Anthony Harivel <aharivel@redhat.com>
> > Subject: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu
> > 
> > Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
> > interface (Running Average Power Limit) for advertising the accumulated
> > energy consumption of various power domains (e.g. CPU packages, DRAM,
> > etc.).
> >
> > The consumption is reported via MSRs (model specific registers) like
> > MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
> > 64 bits registers that represent the accumulated energy consumption in
> > micro Joules. They are updated by microcode every ~1ms.
>
> What is your current target platform?
>
> On future Xeon platforms (EMR and beyond) RAPL will support TPMI (an MMIO
> interface) and the TPMI based RAPL will be preferred in the future as
> well:
> * TPMI doc: https://github.com/intel/tpmi_power_management
> * TPMI based RAPL driver: drivers/powercap/intel_rapl_tpmi.c
>
> So do you have the plan to support RAPL-TPMI?

Yes, I guess this would be inevitable in the future. But right now the 
lack of HW with this TPMI make hard to integrate it on day 1.

> > For now, KVM always returns 0 when the guest requests the value of
> > these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
> > these MSRs dynamically in userspace.
> > 
> > To limit the amount of system calls for every MSR call, create a new
> > thread in QEMU that updates the "virtual" MSR values asynchronously.
> > 
> > Each vCPU has its own vMSR to reflect the independence of vCPUs. The
> > thread updates the vMSR values with the ratio of energy consumed of
> > the whole physical CPU package the vCPU thread runs on and the
> > thread's utime and stime values.
> > 
> > All other non-vCPU threads are also taken into account. Their energy
> > consumption is evenly distributed among all vCPUs threads running on
> > the same physical CPU package.
>
> The package energy consumption includes core part and uncore part, where
> uncore part consumption may not be able to be scaled based on vCPU
> runtime ratio.
>
> When the uncore part consumption is small, the error in this part is
> small, but if it is large, then the error generated by scaling by vCPU
> runtime will be large.
>

So far we can only work with what Intel is giving us i.e Package power 
plane and DRAM power plane on server, which is the main target of 
this feature. Maybe in the future, Intel will expand the core power 
plane and the uncore power plane to server class CPU ?

> May I ask what your usage scenario is? Is there significant uncore
> consumption (e.g. GPU)?
>

Same answer as above: uncore/graphics power plane is only available on 
client class CPU. 

> Also, I think of a generic question is whether the error in this
> calculation is measurable? Like comparing the RAPL status of the same
> workload on Guest and bare metal to check the error.
>
> IIUC, this calculation is highly affected by native/sibling Guests,
> especially in cloud scenarios where there are multiple Guests, the
> accuracy of this algorithm needs to be checked.
>

Indeed, depending on where your vCPUs are running within the package (on 
the native or sibling CPU), you might observe different power 
consumption levels. However, I don't consider this to be a problem, as 
the ratio calculation takes into account the vCPU's location.

We also need to approach the measurement differently. Due to the 
complexity of factors influencing power consumption, we must compare 
what is comparable. If you require precise power consumption data, 
use a power meter on the PSU of the server.It will provide the 
ultimate judgment. However, if you need an estimation to optimize 
software workloads in a guest, then this feature could be useful. All my 
tests have consistently shown reproducible output in terms of power 
consumption, which has convinced me that we can effectively work with 
it.

> > To overcome the problem that reading the RAPL MSR requires priviliged
> > access, a socket communication between QEMU and the qemu-vmsr-helper is
> > mandatory. You can specified the socket path in the parameter.
> > 
> > This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock
>
> Based on the above comment, I suggest to rename this option as "rapl-msr"
> to distinguish it from rapl-tpmi.

Fair enough, I can rename this like that.

>
> In addition, RAPL is basically a CPU feature, I think it would be more
> appropriate to make it as a x86 CPU's property.
>
> Your RAPL support actually provides a framework for assisting KVM
> emulation in userspace, so this informs other feature support (maybe model
> specific, or architectural) in the future. Enabling/disabling CPU features
> via -cpu looks more natural.

This is totally dependant of KVM because it used the KVM MSR Filtering 
to access userspace when a specific MSR is required.

I can try to find a way to use -cpu for this feature and check if KVM is 
activated or not. 

>
> [snip]
>
> > +High level implementation
> > +-------------------------
> > +
> > +In order to update the value of the virtual MSR, a QEMU thread is created.
> > +The thread is basically just an infinity loop that does:
> > +
> > +1. Snapshot of the time metrics of all QEMU threads (Time spent scheduled 
> > in
> > +   Userspace and System)
> >
> > +2. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages 
> > where
> > +   the QEMU threads are running on.
> > +
> > +3. Sleep for 1 second - During this pause the vcpu and other non-vcpu 
> > threads
> > +   will do what they have to do and so the energy counter will increase.
> > +
> > +4. Repeat 2. and 3. and calculate the delta of every metrics representing 
> > the
> > +   time spent scheduled for each QEMU thread *and* the energy spent by the
> > +   packages during the pause.
> > +
> > +5. Filter the vcpu threads and the non-vcpu threads.
> > +
> > +6. Retrieve the topology of the Virtual Machine. This helps identify which
> > +   vCPU is running on which virtual package.
> > +
> > +7. The total energy spent by the non-vcpu threads is divided by the number
> > +   of vcpu threads so that each vcpu thread will get an equal part of the
> > +   energy spent by the QEMU workers.
> > +
> > +8. Calculate the ratio of energy spent per vcpu threads.
> > +
> > +9. Calculate the energy for each virtual package.
> > +
> > +10. The virtual MSRs are updated for each virtual package. Each vCPU that
> > +    belongs to the same package will return the same value when accessing 
> > the
> > +    the MSR.
> > +
> > +11. Loop back to 1.
> > +
> > +Ratio calculation
> > +-----------------
> > +
> > +In Linux, a process has an execution time associated with it. The 
> > scheduler is
> > +dividing the time in clock ticks. The number of clock ticks per second can 
> > be
> > +found by the sysconf system call. A typical value of clock ticks per 
> > second is
> > +100. So a core can run a process at the maximum of 100 ticks per second. 
> > If a
> > +package has 4 cores, 400 ticks maximum can be scheduled on all the cores
> > +of the package for a period of 1 second.
> > +
> > +The /proc/[pid]/stat [#b]_ is a sysfs file that can give the executed time 
> > of a
> > +process with the [pid] as the process ID. It gives the amount of ticks the
> > +process has been scheduled in userspace (utime) and kernel space (stime).
> > +
>
> I understand tick would ignore frequency changes, e.g., HWP's auto-pilot
> or turbo boost. All these CPU frequency change would impact on core energy
> consumption.
>
> I think the better way would be to use APERF counter, but unfortunately it
> lacks virtualization support (for Intel).
>
> Due to such considerations, it may be more worthwhile to evaluate the
> accuracy of this tick-based algorithm.
>

I've evaluated such things with another tool called Kepler [1]. This 
tool calculate the power ratio with metrics from RAPL and uses either 
eBPF or the tick based systems for time metrics. The eBPF part [2] is 
triggered on each 'finish_task_switch' of Thread and calculate the delta 
of cpu cycle, cache miss, cpu time, etc. Very complex. My tests showed 
that the difference between using eBPF and tick based ratio is really 
not that important. Maybe on some special cases, using eBPF would show 
a way better accuracy but I'm not aware of that.


[1]: https://github.com/sustainable-computing-io/kepler
[2]: 
https://github.com/sustainable-computing-io/kepler/blob/main/bpfassets/libbpf/src/kepler.bpf.c

> [snip]
>
> > diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> > index e68cbe929302..3de69caa229e 100644
> > --- a/target/i386/kvm/kvm.c
> > +++ b/target/i386/kvm/kvm.c
>
> [snip]
>
> > +static void *kvm_msr_energy_thread(void *data)
> > +{
> > +    KVMState *s = data;
> > +    struct KVMMsrEnergy *vmsr = &s->msr_energy;
> > +
> > +    g_autofree package_energy_stat *pkg_stat = NULL;
> > +    g_autofree thread_stat *thd_stat = NULL;
> > +    g_autofree pid_t *thread_ids = NULL;
> > +    g_autofree CPUState *cpu = NULL;
> > +    g_autofree unsigned int *vpkgs_energy_stat = NULL;
> > +    unsigned int num_threads = 0;
> > +    unsigned int tmp_num_threads = 0;
>
> [snip]
>
> > +        /* Sleep a short period while the other threads are working */
> > +        usleep(MSR_ENERGY_THREAD_SLEEP_US);
>
> Is it possible to passively read the energy status? i.e. access the Host
> MSR and calculate the energy consumption for the Guest when the Guest
> triggers the relevant exit.
>

Yes it could be possible. But what I wanted to avoid with my approach is 
the overhead it could take when accessing a RAPL MSR in the Guest.
The value is always available and return very quickly. 
I'm not sure about the overhead if we have to have to access the MSR, 
then do the calculation and so on.

> I think this might make the error larger, but not sure the error would
> be so large as to be unacceptable.
>

I'm a bit concerned about the potential overflow in calculation if the 
time in between is too big. 

> [snip]
>
> > +        /* Retrieve the virtual package number of each vCPU */
> > +        for (int i = 0; i < vmsr->x86_cpu_list->len; i++) {
> > +            for (int j = 0; j < num_threads; j++) {
> > +                if ((thd_stat[j].acpi_id == 
> > vmsr->x86_cpu_list->cpus[i].arch_id)
> > +                    && (thd_stat[j].is_vcpu == true)) {
> > +                    x86_topo_ids_from_apicid(thd_stat[j].acpi_id,
> > +                        &vmsr->topo_info, &topo_ids);
> > +                    thd_stat[j].vpkg_id = topo_ids.pkg_id;
> > +                }
> > +            }
> > +        }
>
> We lack package instances as well as MSR topology, so we have to deal
> with this pain...
>
> Similarly, my attempt at other package/die level MSRs [1] is to only
> allow Guest to set a package/die.
>
> But I think that in the future with QOM-topo [2] (i.e. creating package/
> die instances), MSR topology can have an easier implementation, though,
> looks there's a long way to go!
>
> [1]: 
> https://lore.kernel.org/qemu-devel/20240203093054.412135-4-zhao1.liu@linux.intel.com/
> [2]: 
> https://lore.kernel.org/qemu-devel/20231130144203.2307629-1-zhao1.liu@linux.intel.com/
>

Oh, great! Good to know! 
Topology is indeed a nightmare so good luck with that!! 


> > +        /* Calculate the total energy of all non-vCPU thread */
> > +        for (int i = 0; i < num_threads; i++) {
> > +            double temp;
> > +            if ((thd_stat[i].is_vcpu != true) &&
> > +                (thd_stat[i].delta_ticks > 0)) {
> > +                temp = vmsr_get_ratio(pkg_stat[thd_stat[i].pkg_id].e_delta,
> > +                    thd_stat[i].delta_ticks,
> > +                    vmsr->host_topo.maxticks[thd_stat[i].pkg_id]);
> > +                pkg_stat[thd_stat[i].pkg_id].e_ratio
> > +                    += (uint64_t)lround(temp);
> > +            }
> > +        }
> > +
>
> [snip]
>
> > +static int kvm_msr_energy_thread_init(KVMState *s, MachineState *ms)
> > +{
> > +    struct KVMMsrEnergy *r = &s->msr_energy;
> > +    int ret = 0;
> > +
>
> [snip]
>
> > +    /* Retrieve the number of virtual sockets */
> > +    r->vsockets = ms->smp.sockets;
>
> RAPL's package domain is special:
>  * When there's no die, it's package-scope.
>  * When there's die level, it's die-scope.
>
> In SDM vol.3, char 15.10 PLATFORM SPECIFIC POWER MANAGEMENT SUPPORT, it
> said: Package domain is the processor die.
>
> So if a Guest has die level, thers MSRs should be shared in a die. Thus
> I guess the energy consumption status should also be distributed based on
> the die number.
>

Right, I've missed that totally. Thanks for let me know.
I'm going to have a deeper look at this.


> > +    /* Allocate register memory (MSR_PKG_STATUS) for each vcpu */
> > +    r->msr_value = g_new0(uint64_t, r->vcpus);
> > +
> > +    /* Retrieve the CPUArchIDlist */
> > +    r->x86_cpu_list = x86_possible_cpu_arch_ids(ms);
> > +
>
> [snip]
>
> > +int is_rapl_enabled(void)
> > +{
> > +    const char *path = "/sys/class/powercap/intel-rapl/enabled";
>
> This field does not ensure the existence of RAPL MSRs since TPMI would
> also enable this field. (See powercap_register_control_type() in
> drivers/powercap/intel_rapl_{msr,tpmi}.c)
>

But this is exactly what it is intended to do. Wether it is MSR of TPMI, 
checking this, tell me that RAPL is activated or not. If it is 
activated...

> We can read RAPL MSRs directly. If we get 0 or failure, then there's no
> RAPL MSRs on Host.
>

... then it is safe to access the RAPL MSR. I would not like to access 
this on a XYZ cpu without knowing I can !

> > +    FILE *file = fopen(path, "r");
> > +    int value = 0;
> > +
> > +    if (file != NULL) {
> > +        if (fscanf(file, "%d", &value) != 1) {
> > +            error_report("INTEL RAPL not enabled");
> > +        }
> > +        fclose(file);
> > +    } else {
> > +        error_report("Error opening %s", path);
> > +    }
> > +
> > +    return value;
> > +}
>
> [snip]
>
> > +/* Get the physical package id from a given cpu id */
> > +int vmsr_get_physical_package_id(int cpu_id)
> > +{
> > +    g_autofree char *file_contents = NULL;
> > +    g_autofree char *file_path = NULL;
> > +    int package_id = -1;
> > +    gsize length;
> > +
> > +    file_path = g_strdup_printf("/sys/devices/system/cpu/cpu%d\
> > +        /topology/physical_package_id", cpu_id);
>
> Similarly, based on the topological properties of RAPL, it is more
> accurate to also consider the die_id.
>
> Currently, only CLX-AP has multiple dies. So, we either have to put a
> limit on the Host's CPU model or consider the topology of the die here
> as well. I think the latter is a more general approach.
>

Alright. I take not on that and I will check the die approach.

> > +
> > +    if (!g_file_get_contents(file_path, &file_contents, &length, NULL)) {
> > +        goto out;
> > +    }
> > +
> > +    package_id = atoi(file_contents);
> > +
> > +out:
> > +    return package_id;
> > +}
> > +
>
> Regards,
> Zhao

Thanks a lot for all your feedback Zhao !

Regards,
Anthony






reply via email to

[Prev in Thread] Current Thread [Next in Thread]