Hi Bruno,
Presumably you've read Pádraig's comment in the other thread that I mistakenly created, there are two interesting things from this:
- coreutils is already GNU so no copyright review required, although the code appears to be inside the cksum utility so it's not in a position to be included directly if we were to reference coreutils as a submodule
- coreutils uses the slice-by-8 approach by default (and the original paper was written before 64-bit CPUs were a thing) so maybe we could ignore the bitness and just rely on whether it's flagged on/off?
- the comment about using pclmul instructions (since SSE4.1 so quite old) is also worth looking into as this doesn't need 8kb of tables in memory and should be in the same ballpark of performance
As for your question on speed, I noticed between zstd (which uses zlib as a backend) and gzip there seems to be an improvement of maybe 30-40% for decompressing a 100MB file (part of this is due to multithreading though), and gprof shows the CRC calculation being maybe 40-50% of the CPU cycles so a 3x improvement (as per the original 8-slice intel paper) in CRC speed would translate to ~30% reduction in time required for decompressing a large file.
So for next steps, I can add the #defines for HOST_CPU_C_ABI_32BIT and an option to enable/disable the whole thing (is whitelist or blacklist a better approach for a new feature like this?), and then we make sure everything is in a position to be merged.
For future steps the coreutils pclmul implementation is also quite interesting, and that seems simple enough to just gate on -mavx and -mpclmul and manage them in the makefiles too.
Cheers
Sam