[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] util: optimise flush_idcache_range when the ppc host has coh
From: |
Richard Henderson |
Subject: |
Re: [PATCH] util: optimise flush_idcache_range when the ppc host has coherent icache |
Date: |
Thu, 19 May 2022 11:31:00 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.0 |
On 5/19/22 07:11, Nicholas Piggin wrote:
dcache writeback and icache invalidate is not required when icache is
coherent, a shorter fixed-length sequence can be used which just has to
flush and re-fetch instructions that were in-flight.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
I haven't been able to measure a significant performance difference
with this, qemu isn't flushing large ranges frequently so the old sequence
is not that slow.
Yeah, we should be flushing smallish regions (< 1-4k), as we generate TranslationBlocks.
And hopefully the translation cache is large enough that we spend more time executing
blocks than re-compiling them. ;-)
+++ b/include/qemu/cacheflush.h
@@ -28,6 +28,10 @@ static inline void flush_idcache_range(uintptr_t rx,
uintptr_t rw, size_t len)
#else
+#if defined(__powerpc__)
+extern bool have_coherent_icache;
+#endif
Ug. I'm undecided where to put this. I'm tempted to say...
--- a/util/cacheflush.c
+++ b/util/cacheflush.c
@@ -108,7 +108,16 @@ void flush_idcache_range(uintptr_t rx, uintptr_t rw,
size_t len)
... here in cacheflush.c, with a comment that the variable is defined and initialized in
cacheinfo.c.
I'm even more tempted to merge the two files to put all of the machine-specific cache data
in the same place, then this variable can be static. There's even an existing TODO
comment in cacheflush.c for aarch64.
b = rw & ~(dsize - 1);
+
+ if (have_coherent_icache) {
+ asm volatile ("sync" : : : "memory");
+ asm volatile ("icbi 0,%0" : : "r"(b) : "memory");
+ asm volatile ("isync" : : : "memory");
+ return;
+ }
Where can I find definitive rules on this?
Note that rx may not equal rw, and that we've got two virtual mappings for the same
memory, one for "data" that is read-write and one for "execute" that is read-execute.
(This split is enabled only for --enable-debug-tcg builds on linux, to make sure we don't
regress apple m1, which requires the split all of the time.)
In particular, you're flushing one icache line with the dcache address, and that you're
not flushing any of the other lines. Is the coherent icache thing really that we may
simply skip the dcache flush step, but must still flush all of the icache lines?
Without docs, "icache snoop" to me would imply that we only need the two barriers and no
flushes at all, just to make sure all memory writes complete before any new instructions
are executed. This would be like the two AArch64 bits, IDC and DIC, which indicate that
the two caches are coherent to Point of Unification, which leaves us with just the
Instruction Sequence Barrier at the end of the function.
+bool have_coherent_icache = false;
scripts/checkpatch.pl should complain this is initialized to 0.
static void arch_cache_info(int *isize, int *dsize)
{
+# ifdef PPC_FEATURE_ICACHE_SNOOP
+ unsigned long hwcap = qemu_getauxval(AT_HWCAP);
+# endif
+
if (*isize == 0) {
*isize = qemu_getauxval(AT_ICACHEBSIZE);
}
if (*dsize == 0) {
*dsize = qemu_getauxval(AT_DCACHEBSIZE);
}
+
+# ifdef PPC_FEATURE_ICACHE_SNOOP
+ have_coherent_icache = (hwcap & PPC_FEATURE_ICACHE_SNOOP) != 0;
+# endif
Better with only one ifdef, moving this second hunk up.
It would be nice if there were some kernel documentation for this...
r~