bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Adding slice-by-4 and slice-by-8 to CRC32


From: Sam Russell
Subject: Re: Adding slice-by-4 and slice-by-8 to CRC32
Date: Mon, 14 Oct 2024 17:06:06 +0200

Hi Bruno,

Presumably you've read Pádraig's comment in the other thread that I mistakenly created, there are two interesting things from this:

- coreutils is already GNU so no copyright review required, although the code appears to be inside the cksum utility so it's not in a position to be included directly if we were to reference coreutils as a submodule
- coreutils uses the slice-by-8 approach by default (and the original paper was written before 64-bit CPUs were a thing) so maybe we could ignore the bitness and just rely on whether it's flagged on/off?
- the comment about using pclmul instructions (since SSE4.1 so quite old) is also worth looking into as this doesn't need 8kb of tables in memory and should be in the same ballpark of performance

As for your question on speed, I noticed between zstd (which uses zlib as a backend) and gzip there seems to be an improvement of maybe 30-40% for decompressing a 100MB file (part of this is due to multithreading though), and gprof shows the CRC calculation being maybe 40-50% of the CPU cycles so a 3x improvement (as per the original 8-slice intel paper) in CRC speed would translate to ~30% reduction in time required for decompressing a large file.

So for next steps, I can add the #defines for HOST_CPU_C_ABI_32BIT and an option to enable/disable the whole thing (is whitelist or blacklist a better approach for a new feature like this?), and then we make sure everything is in a position to be merged.

For future steps the coreutils pclmul implementation is also quite interesting, and that seems simple enough to just gate on -mavx and -mpclmul and manage them in the makefiles too.

Cheers
Sam

On Mon, 14 Oct 2024 at 15:53, Bruno Haible <bruno@clisp.org> wrote:
Hi Sam,

Thanks for the contribution offer!

> I've noticed that GZIP trails behind zlib in performance and part of this
> is down to the fact that zlib is using a more efficient CRC32
> implementation.

How much of a speedup do you obtain in gzip overall (not in CRC32 alone)
for large files, through these slice-by-4 and slice-by-8 techniques?

> the tables are generated by
> extending the code from RFC 1952 to generate the lookups for partial
> bitfields, this can be provided on request but it's not my finest work).

Yes, if we include your code, we would also include the generator. Doesn't
matter in which programming language it is written: C is fine, Python is fine,
Lisp is fine.

> - free software/patents: the Intel paper is from 2008; zlib had an
> independently-discovered version of slice-by-4 from 2002 so it seems
> unlikely there should be any patents encumbering this.

Good.

> the software is my
> own design and I am happy for it to be transferred to GNU and licensed
> accordingly

Yes, we would need a copyright assignment to the FSF for this code, as it
contains 40 lines of hand-written code (-> legally significant).

> Requests for help:
> - how do I get someone to review my code and then get it added to the
> codebase? is this done via Savannah, or the mailing list, do we email
> around PATCH files etc?

We do it through the mailing list. Patches in 'diff -u' or 'git diff' format
are fine, as are patch files produced by 'git format-patch'. As attachments,
please, not inline. Please provide a ChangeLog entry; we reuse that ChangeLog
entry for the commit message.

> - I assume we'll want to gate this functionality based on CPU ability
> (32-bit arithmetic is presumably fine but 64-bit makes no sense unless
> running on a system with a native 64-bit bus). What is the convention here
> for enforcing this at compile-time?

You could add the file m4/host-cpu-c-abi.m4 to the module, invoke
gl_HOST_CPU_C_ABI_32BIT at configure time, and then dispatch on
HOST_CPU_C_ABI_32BIT.

Probably, for embedded systems which don't want to spend 4 KB of cache,
it would be useful to have a #define for the entire optimization.

Bruno




reply via email to

[Prev in Thread] Current Thread [Next in Thread]