[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequen
From: |
Sergey Fedorov |
Subject: |
Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data |
Date: |
Tue, 7 Jun 2016 17:06:16 +0300 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.8.0 |
On 07/06/16 02:40, Emilio G. Cota wrote:
> On Fri, Jun 03, 2016 at 20:46:07 +0300, Sergey Fedorov wrote:
>> On 03/06/16 20:29, Sergey Fedorov wrote:
>>> On 03/06/16 20:22, Emilio G. Cota wrote:
>>>> On Sat, May 28, 2016 at 21:15:06 +0300, Sergey Fedorov wrote:
>>>>> On 25/05/16 04:13, Emilio G. Cota wrote:
>>>>> (snip)
>>>>>> +double qdist_avg(const struct qdist *dist)
>>>>>> +{
>>>>>> + unsigned long count;
>>>>>> + size_t i;
>>>>>> + double ret = 0;
>>>>>> +
>>>>>> + count = qdist_sample_count(dist);
>>>>>> + if (!count) {
>>>>>> + return NAN;
>>>>>> + }
>>>>>> + for (i = 0; i < dist->n; i++) {
>>>>>> + struct qdist_entry *e = &dist->entries[i];
>>>>>> +
>>>>>> + ret += e->x * e->count / count;
>>>>> Please use Welford’s method or something like that, see
>>>>> http://stackoverflow.com/a/1346890.
>>>> Yes, the way the mean is computed right now, we might suffer
>>>> from underflow if count is huge. But I'd rather take that, than the
>>>> perf penalty of an iterative method (such as the one used
>>>> in Welford's). Note that we might have huge amounts of
>>>> items, e.g. one item per head bucket in qht's occupancy qdist
>>>> (and 0.5M head buckets is easy to achieve).
>>>>
>>>> If we were to use an iterative method, we'd need to do something
>>>> like:
>>>>
>>>> double qdist_avg(const struct qdist *dist)
>>>> {
>>>> size_t i, j;
>>>> double ret = 0;
>>>>
>>>> if (!qdist_sample_count(dist)) {
>>>> return NAN;
>>>> }
>>>> /* compute moving average to prevent under/overflow */
>>>> for (i = 0; i < dist->n; i++) {
>>>> struct qdist_entry *e = &dist->entries[i];
>>>>
>>>> for (j = 0; j < e->count; j++) {
>>>>
>>>> ret += (e->x - ret) / (i + j + 1);
>>>> }
>>>> }
>>>> return ret;
>>>> }
>>>>
>>>> Note that skipping the inner loop would be incorrect.
>>> Ah, it's a shame. I'm wondering if there is some other algorithm that
>>> could work for us?
>> Maybe something like
>> https://en.wikipedia.org/wiki/Kahan_summation_algorithm could help?
> That algorithm is overkill for what we're doing. Pairwise summation
> should suffice:
>
> diff --git a/util/qdist.c b/util/qdist.c
> index 3343640..909bd2b 100644
> --- a/util/qdist.c
> +++ b/util/qdist.c
> @@ -367,20 +367,34 @@ unsigned long qdist_sample_count(const struct qdist
> *dist)
> return count;
> }
>
> +static double qdist_pairwise_avg(const struct qdist *dist, size_t index,
> + size_t n, unsigned long count)
> +{
> + if (n <= 2) {
We would like to amortize the overhead of the recursion by making the
cut-off sufficiently large.
> + size_t i;
> + double ret = 0;
> +
> + for (i = 0; i < n; i++) {
> + struct qdist_entry *e = &dist->entries[index + i];
> +
> + ret += e->x * e->count / count;
> + }
> + return ret;
> + } else {
> + size_t n2 = n / 2;
> +
> + return qdist_pairwise_avg(dist, index, n2, count) +
> + qdist_pairwise_avg(dist, index + n2, n - n2, count);
> + }
> +}
> +
> double qdist_avg(const struct qdist *dist)
> {
> unsigned long count;
> - size_t i;
> - double ret = 0;
>
> count = qdist_sample_count(dist);
> if (!count) {
> return NAN;
> }
> - for (i = 0; i < dist->n; i++) {
> - struct qdist_entry *e = &dist->entries[i];
> -
> - ret += e->x * e->count / count;
> - }
> - return ret;
> + return qdist_pairwise_avg(dist, 0, dist->n, count);
> }
Otherwise looks good.
Kind regards,
Sergey