[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] util: optimise flush_idcache_range when the ppc host has coh
From: |
Nicholas Piggin |
Subject: |
Re: [PATCH] util: optimise flush_idcache_range when the ppc host has coherent icache |
Date: |
Fri, 20 May 2022 10:04:04 +1000 |
Excerpts from Richard Henderson's message of May 20, 2022 4:31 am:
> On 5/19/22 07:11, Nicholas Piggin wrote:
>> dcache writeback and icache invalidate is not required when icache is
>> coherent, a shorter fixed-length sequence can be used which just has to
>> flush and re-fetch instructions that were in-flight.
>>
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>>
>> I haven't been able to measure a significant performance difference
>> with this, qemu isn't flushing large ranges frequently so the old sequence
>> is not that slow.
>
> Yeah, we should be flushing smallish regions (< 1-4k), as we generate
> TranslationBlocks.
> And hopefully the translation cache is large enough that we spend more time
> executing
> blocks than re-compiling them. ;-)
>
>
>> +++ b/include/qemu/cacheflush.h
>> @@ -28,6 +28,10 @@ static inline void flush_idcache_range(uintptr_t rx,
>> uintptr_t rw, size_t len)
>>
>> #else
>>
>> +#if defined(__powerpc__)
>> +extern bool have_coherent_icache;
>> +#endif
>
> Ug. I'm undecided where to put this. I'm tempted to say...
>
>> --- a/util/cacheflush.c
>> +++ b/util/cacheflush.c
>> @@ -108,7 +108,16 @@ void flush_idcache_range(uintptr_t rx, uintptr_t rw,
>> size_t len)
>
> ... here in cacheflush.c, with a comment that the variable is defined and
> initialized in
> cacheinfo.c.
>
> I'm even more tempted to merge the two files to put all of the
> machine-specific cache data
> in the same place, then this variable can be static. There's even an
> existing TODO
> comment in cacheflush.c for aarch64.
That might be nice. Do you want me to look at doing that first?
>> b = rw & ~(dsize - 1);
>> +
>> + if (have_coherent_icache) {
>> + asm volatile ("sync" : : : "memory");
>> + asm volatile ("icbi 0,%0" : : "r"(b) : "memory");
>> + asm volatile ("isync" : : : "memory");
>> + return;
>> + }
>
> Where can I find definitive rules on this?
In processor manuals (I don't know if there are any notes about this in
the ISA, I would be tempted to say there should be since many processors
implement it).
POWER9 UM, 4.6.2.2 Instruction Cache Block Invalidate (icbi)
https://ibm.ent.box.com/s/tmklq90ze7aj8f4n32er1mu3sy9u8k3k
> Note that rx may not equal rw, and that we've got two virtual mappings for
> the same
> memory, one for "data" that is read-write and one for "execute" that is
> read-execute.
> (This split is enabled only for --enable-debug-tcg builds on linux, to make
> sure we don't
> regress apple m1, which requires the split all of the time.)
>
> In particular, you're flushing one icache line with the dcache address, and
> that you're
> not flushing any of the other lines. Is the coherent icache thing really
> that we may
> simply skip the dcache flush step, but must still flush all of the icache
> lines?
Yeah it's just a funny sequence the processor implements. It treats icbi
almost as a no-op except that it sets a flag such that the next isync
will flush and refetch the pipeline. It doesn't do any cache flushing.
> Without docs, "icache snoop" to me would imply that we only need the two
> barriers and no
> flushes at all, just to make sure all memory writes complete before any new
> instructions
> are executed. This would be like the two AArch64 bits, IDC and DIC, which
> indicate that
> the two caches are coherent to Point of Unification, which leaves us with
> just the
> Instruction Sequence Barrier at the end of the function.
>
>
>> +bool have_coherent_icache = false;
>
> scripts/checkpatch.pl should complain this is initialized to 0.
>
>
>> static void arch_cache_info(int *isize, int *dsize)
>> {
>> +# ifdef PPC_FEATURE_ICACHE_SNOOP
>> + unsigned long hwcap = qemu_getauxval(AT_HWCAP);
>> +# endif
>> +
>> if (*isize == 0) {
>> *isize = qemu_getauxval(AT_ICACHEBSIZE);
>> }
>> if (*dsize == 0) {
>> *dsize = qemu_getauxval(AT_DCACHEBSIZE);
>> }
>> +
>> +# ifdef PPC_FEATURE_ICACHE_SNOOP
>> + have_coherent_icache = (hwcap & PPC_FEATURE_ICACHE_SNOOP) != 0;
>> +# endif
>
> Better with only one ifdef, moving this second hunk up.
Will clean those bits up, thanks.
> It would be nice if there were some kernel documentation for this...
arm64 has kernel docs for hwcaps... powerpc probably should as well.
Good point, I might do a patch for that.
Thanks,
Nick