qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations


From: Peter Lieven
Subject: Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
Date: Mon, 25 Mar 2013 22:37:44 +0100

Am 25.03.2013 um 15:34 schrieb Paolo Bonzini <address@hidden>:

> Il 25/03/2013 14:32, Peter Lieven ha scritto:
>> 
>> Am 25.03.2013 um 14:23 schrieb Peter Lieven <address@hidden>:
>> 
>>> 
>>> Am 25.03.2013 um 14:02 schrieb Paolo Bonzini <address@hidden>:
>>> 
>>>>> Maybe I should have explained the output more detailed. The percentages
>>>>> are added. 35.8% in the second last column means that
>>>>> 35.8% have a return value that is less than TARGET_PAGE_SIZE.
>>>>> This was meant to illustrate at how many 64-bit chunks you have
>>>>> to look to grab a certain percentage of non-zero pages.
>>>> 
>>>> Ok, I wrongly understood that many pages had 4088 zero bytes but
>>>> the last 8 were not zero.  Now it's clearer, and more logical too. :)
>>>> 
>>>>> Looking e.g. at the third value it means that looking at the first
>>>>> three 64-bit chunks it will catch 34.0% of all pages.
>>>>> It turns out that the non-zeroness of a page can be detected looking
>>>>> at the first 256 or so bits and only a low
>>>>> percentage turns out to be non-zero at a later position. So after
>>>>> having checked the first chunks one by one
>>>>> there is no big penalty looking at the remaining chunks with the
>>>>> vectorized loop.
>>>> 
>>>> I think it makes most sense to unroll the first four non-vectorized
>>>> iterations, i.e. not use SSE and use three or four ifs.  Either:
>>>> 
>>>> if (foo[0]) return 0;
>>>> if (foo[1]) return 8;
>>>> if (foo[2]) return 16;
>>>> if (foo[3]) return 24;
>>>> 
>>>> or
>>>> 
>>>> if (foo[0]) return 0;
>>>> if (foo[1] | foo[2] | foo[3]) return 8;
>>>> 
>>>> and then proceed on the remaining 4096-4*sizeof(long) bytes with
>>>> the vectorized loop.  foo+4 is aligned for SIMD operations on both
>>>> 32- and 64-bit machines, which makes this a nice choice.
>>> 
>>> i can't start at foo+4 since the remaining X-4*sizeof(long) bytes
>>> are not dividable by 8*sizeof(VECTYPE).
> 
> 
> Hmm, right.  What about just processing the first few longs twice, i.e.
> the above followed by "for (i = 0; i < len / sizeof(sizeof(VECTYPE); i
> += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR)"?

i will profile it tomorrow.

what is bad about processing the first 8 vectors like described below?

>>  for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
>>        if (!ALL_EQ(p[i], zero)) {
>>            return i * sizeof(VECTYPE);
>>        }
>>    }


this way it would not be necessary to process them twice.

Peter




reply via email to

[Prev in Thread] Current Thread [Next in Thread]