Hi,
I've been looking into other CRC32 implementations for architectures that don't have accelerated
polynomial multiplication (PCLMUL/VMULL for x86-SSE/AVX or ARM-NEON), and there are improvements we
can make, but they make tradeoffs for memory (both for lookup tables and for working set). The
"braiding" approach from zlib seems to be the best publicly available algorithm, and I've
also been developing my own algorithm "chorba" that outperforms both slicing and braiding
depending on the circumstances (it ends up needing to be tuned per CPU).
Memory requirements:
Sarwate (Simon's method from RFC1952): 1kB lookup table
Slicing by 8: 8kB lookup table
Braiding: 8kB lookup table
Chorba 1952: 2kB working set
Chorba 57082: 64kB working set
Chorba 703380: 1MB working set
Chorba 19824364: 32MB working set
In terms of performance, braiding seems to be 1.2-3x faster than slicing by 8
across different systems, whereas chorba depends on the CPU (x86_64 gets better
with higher working set, is even at 64kB and better with 1MB or 32MB, ARM seems
to be better cross the board, Raspberry Pi 4 is better at 2kB and gets worse
after that).
What sort of parameters would be important and what expectations will
users/maintainers have? For example, coreutils cksum reads a rolling 65kB
buffer and operates on that, whereas gzip seems to take much larger chunks; are
there any systems that would appreciate a low-mem mode? Is it a downside that
different architectures would need to choose different parameters to benefit
from the speedup?