qemu-arm
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-arm] [Qemu-devel] [RFC PATCH v1 2/2] utils: Add prefetch for T


From: Vijay Kilari
Subject: Re: [Qemu-arm] [Qemu-devel] [RFC PATCH v1 2/2] utils: Add prefetch for Thunderx platform
Date: Fri, 12 Aug 2016 17:02:00 +0530

On Sat, Aug 6, 2016 at 3:47 PM, Richard Henderson <address@hidden> wrote:
> On 08/02/2016 03:50 PM, address@hidden wrote:
>>
>> +#define VEC_PREFETCH(base, index) \
>> +        asm volatile ("prfm pldl1strm, [%x[a]]\n" : :
>> [a]"r"(&base[(index)]))
>
>
> Is this not __builtin_prefetch(base + index) ?
>
> I.e. you can defined this generically for all targets.

__builtin_prefetch() is available only in gcc 5.3 for arm64.

>
>> +#if defined (__aarch64__)
>> +    do_prefetch = is_thunder_pass2_cpu();
>> +    if (do_prefetch) {
>> +        VEC_PREFETCH(p, 8);
>> +        VEC_PREFETCH(p, 16);
>> +        VEC_PREFETCH(p, 24);
>> +    }
>> +#endif
>> +
>>      for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
>>           i < len / sizeof(VECTYPE);
>>           i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>> +
>> +#if defined (__aarch64__)
>> +        if (do_prefetch) {
>> +            VEC_PREFETCH(p, i+32);
>> +            VEC_PREFETCH(p, i+40);
>> +        }
>> +#endif
>> +
>
>
> Surely we shouldn't be adding a conditional around a prefetch inside of the
> inner loop.  Does it really hurt to perform this prefetch for all aarch64
> cpus?

prefetch is only required for thunderx pass2 platform. I will remove this
condition check.

>
> I'll note that you're also prefetching too much, off the end of the block,
> and that you're probably not prefetching far enough.  You'd need to break
> off the last iteration(s) of the loop.
>
> I'll note that you're also prefetching too close.  The loop operates on
> 8*vecsize units.  In the case of aarch64, 128 byte units.  Both i+32 and

128 unit is specific to thunder. I will move this to thunder
specific function

> i+40 are within the current loop.  I believe you want to prefetch at
>

I am dropping i+40

>   i + BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * N
>
> where N is the number of iterations in advance to be fetched.  Probably N is
> 1 or 2, unless the memory controller is really slow.
>
>
> r~



reply via email to

[Prev in Thread] Current Thread [Next in Thread]