qemu-arm
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-arm] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1


From: Vijay Kilari
Subject: Re: [Qemu-arm] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register
Date: Fri, 19 Aug 2016 14:35:31 +0530

On Thu, Aug 18, 2016 at 8:26 PM, Peter Maydell <address@hidden> wrote:
> On 18 August 2016 at 15:46, Richard Henderson <address@hidden> wrote:
>> On 08/18/2016 07:14 AM, Peter Maydell wrote:
>>> While we're on the subject, can somebody explain to me why we
>>> use ifuncs at all? I couldn't work out why it would be better than
>>> just using a straightforward function pointer -- when I tried single
>>> stepping through things the ifunc approach still seemed to indirect
>>> through some table or other so it wasn't actually resolving to
>>> a direct function call anyway.
>
>> No reason, I suppose.
>>
>> It's particularly helpful for libraries, where we don't really want the
>> overhead of the initialization when it's not used.
>
> Ah, I see.
>
>> But (1) we don't have many of these and (2) we really don't care *that* much
>> about startup time.
>>
>> So a simple function pointer initialized by a constructor has the same
>> effect.
>

 The cutils does not have any initialization function that can init
function/constructor pointer
for zero_check function.

Also creating separate function with most of repeated code for prefetch does
not look good. So suggest to put check for prefetch outside the for loop and
code for loop with and without prefetch

I profiled and found that a single check inside the loop is adding 100ms delay
for 8GB RAM migration. So moving check outside the loop is enough.

Ex:

   if (need_prefetch()) {

       prefetch_vector(p, 0);

        for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
             i < len / sizeof(VECTYPE);
             i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {

            prefetch_vector_loop(p, i);

            VECTYPE tmp0 = VEC_OR(p[i + 0], p[i + 1]);
            VECTYPE tmp1 = VEC_OR(p[i + 2], p[i + 3]);
            VECTYPE tmp2 = VEC_OR(p[i + 4], p[i + 5]);
            VECTYPE tmp3 = VEC_OR(p[i + 6], p[i + 7]);
           VECTYPE tmp01 = VEC_OR(tmp0, tmp1);
           VECTYPE tmp23 = VEC_OR(tmp2, tmp3);
            if (!ALL_EQ(VEC_OR(tmp01, tmp23), zero)) {
                break;
            }
        }

} else {

        for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
             i < len / sizeof(VECTYPE);
             i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {

            VECTYPE tmp0 = VEC_OR(p[i + 0], p[i + 1]);
            VECTYPE tmp1 = VEC_OR(p[i + 2], p[i + 3]);
            VECTYPE tmp2 = VEC_OR(p[i + 4], p[i + 5]);
            VECTYPE tmp3 = VEC_OR(p[i + 6], p[i + 7]);
           VECTYPE tmp01 = VEC_OR(tmp0, tmp1);
           VECTYPE tmp23 = VEC_OR(tmp2, tmp3);
            if (!ALL_EQ(VEC_OR(tmp01, tmp23), zero)) {
                break;
            }
        }
}

Also,  If you want to make prefetch common for all arm64 platforms,
Then thunder cache line is 128 bytes so the prefetch is performed
at 128 byte index. If the platform has 64 byte cache line, then this
prefetch will fill only 64 byte line instead of 128 bytes required for the loop.

> That seems like it would be a worthwhile change since
> (a) I think it's easier to understand than ifunc magic
> (b) it means we don't unnecessarily restrict ourselves to a libc
> with ifunc support (musl libc doesn't do ifuncs, for instance)
>
> thanks
> -- PMM



reply via email to

[Prev in Thread] Current Thread [Next in Thread]