qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH v1 2/2] utils: Add prefetch for Thunderx pla


From: Richard Henderson
Subject: Re: [Qemu-devel] [RFC PATCH v1 2/2] utils: Add prefetch for Thunderx platform
Date: Sat, 6 Aug 2016 15:47:28 +0530
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0

On 08/02/2016 03:50 PM, address@hidden wrote:
+#define VEC_PREFETCH(base, index) \
+        asm volatile ("prfm pldl1strm, [%x[a]]\n" : : [a]"r"(&base[(index)]))

Is this not __builtin_prefetch(base + index) ?

I.e. you can defined this generically for all targets.

+#if defined (__aarch64__)
+    do_prefetch = is_thunder_pass2_cpu();
+    if (do_prefetch) {
+        VEC_PREFETCH(p, 8);
+        VEC_PREFETCH(p, 16);
+        VEC_PREFETCH(p, 24);
+    }
+#endif
+
     for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
          i < len / sizeof(VECTYPE);
          i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
+
+#if defined (__aarch64__)
+        if (do_prefetch) {
+            VEC_PREFETCH(p, i+32);
+            VEC_PREFETCH(p, i+40);
+        }
+#endif
+

Surely we shouldn't be adding a conditional around a prefetch inside of the inner loop. Does it really hurt to perform this prefetch for all aarch64 cpus?

I'll note that you're also prefetching too much, off the end of the block, and that you're probably not prefetching far enough. You'd need to break off the last iteration(s) of the loop.

I'll note that you're also prefetching too close. The loop operates on 8*vecsize units. In the case of aarch64, 128 byte units. Both i+32 and i+40 are within the current loop. I believe you want to prefetch at

  i + BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * N

where N is the number of iterations in advance to be fetched. Probably N is 1 or 2, unless the memory controller is really slow.


r~



reply via email to

[Prev in Thread] Current Thread [Next in Thread]