coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: feature request for coreutils: b2sum


From: Pádraig Brady
Subject: Re: feature request for coreutils: b2sum
Date: Tue, 1 Nov 2016 14:35:02 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0

CC'ing blake2 folks probably offlist.
Response below...

On 01/11/16 02:18, Assaf Gordon wrote:
> Hello Pádraig and all,
> 
> I have some concerns about supporting different algorithm in b2sum.
> [long email ahead...]
> 
> First, usability of multiple algorithms.
> ----
>  From a usability POV, it is rarely (never?) the case that making a program 
> multithreaded or parallel, or optimizing for different CPU type changes the 
> output of the program in a user-noticeable way.

Yes this is surprising as I mentioned in
http://lists.gnu.org/archive/html/coreutils/2016-10/msg00024.html
It might have been nice to have blake2bp as the default (and only algorithm?)
which was fast on most machines but degraded gracefully
on single cores to the same digest value.
My timing results on the above thread, showed blake2bp and blake2b
were close in speed on single core.
The library could provide access to all algorithms for
more fine grained control of performance for app that
didn't need to export the digest value.

But given there are existing implementations in `openssl blake2b`
and existing `b2sum` etc. I thought it most pragmatic to match the
current algorithms.

In your testing below showing that the blake2bp digest is
dependent on OMP_DYNAMIC env variable, that's just wrong,
whether it's a bug in the implementation or assumptions,
it's a bug.

> That is, "gzip" vs "pigz", "bzip2" vs "pbzip2", "make" vs "make -j", "sort" 
> vs "sort --parallel" - all produce the same output from the user POV (I'll 
> note that gzip vs pigz do produce binary different output, but the content 
> from the user POV is equivalent).
> Similarly, Optimizing for 32bits vs 64bits should never alter the result from 
> a user POV.
> 
> I'm not saying these optimizations are not worthwhile, but I do worry that in 
> practice they constitute a completely different checksum.
> Saying in the manual:
>     "blake2bp - This is like ‘blake2b’ but takes advantage of multiple CPU 
> cores."
> Gives the impression that it is equivalent, and it is not.
> The warning that is mentioned in the preceding paragraph will not be noticed 
> by most users, and it's not in the "--help" or man page (and even if it was, 
> users would still ignore it).
> It's one thing if there's a "gotcha" about program behavior which existed for 
> decades, preceding standardization, and we have to live with it for 
> backward-comparability.
> But this is a new feature, and we should avoid making it confusing.
>   
> 
> In a previous message you wrote:
>> I might also drop the blake2s and blake2sp implementations
>> as 32 bit can get equivalent but somewhat slower functionality with
>> `b2sum -l 256 -a blake2b`
> 
> But it is not equivalent - it's a different result:
> 
>      $ echo a | ./src/b2sum -l256 -ablake2b
>      be29a54b934581ab434fde713c16db07c3e0124a371daca7c33588be7526630e  -
>      $ echo a | ./src/b2sum -ablake2s
>      1e61fdb001508ebd3c559545024e7a58a67aeb124308a24bbd13f7ed09d9f2c7  -
> 
> If by "equivalent" you mean just "happens to be the same length of digest but 
> different value",
> then I fear many non-tech-savvy users would not be aware of this distinction.
> 
> Coreutils has a long history of dealing with full spectrum of users: from 
> computer experts and complete novices.
> I'm sure you have more experience than me about users' confusion, but to me 
> this sounds like inviting troubles.
> 
> This also means that if someone has a slow 32bit computer which generates 
> blake2 checksum and he uses "-blake2s" (which is the optimal option for him),
> then he forces any future user of his checksums to use single-core 32bit 
> algorithm "blake2s" regardless of any newer hardware they have.
> 
> Also,
> Up until now, the name of the utility (e.g. "sha1sum") was enough to identify 
> the program.
> So of someone named their checksum files "data.sha1sum" - it was a good hint 
> to use "sha1sum -c" - and the results should be valid.
> But calling the file "data.b2sum" is not sufficient anymore - there are at 
> least 4 variables of "b2sum".
> And more variants if we consider "--length"...

Yes the --tag format is now required to identify all formats.

> Second, the "--length" parameter.
> ----
> This is something that will be relevant for future SHA3 and perhaps others, 
> so I think we might want to take a second look and think ahead.
> Up until now,
> with the existing utilities md5sum,sha1sum,sha224,sha384,sha512 - the length 
> of the digest in the checksum file was a usability hint regarding which 
> algorithm it is. The length was fixed, and a checksum file with 40 characters 
> what sha1. 32 was md5, 128 was sha512.
> 
> Whether we like it or not, it was a good hint.
> 
> With sha3 and blake2, the digest defaults to 512 as well, using "sha512" 
> loses that useful hint - but that's unavoidable.
> What is a bigger problem is that with variable length digests in the same 
> utility,
> it becomes much harder to know what are the correct parameters.
> I think that automatic length detection should be turned on automatically, 
> even without "--tag".

Yes I think that would be useful for non --tag format.
Though with b2sum supporting -a, it would not be enough.
It's probably worth doing though.

> Since I also believe that machines should work harder than people, it would 
> be nice if we have an "--autodetect" kind of parameter
> that will automatically test multiple algorithms based on the given digest 
> length - it just takes more CPU time,
> but can save some annoyances for users.

That would reduce veracity of the --check I think,
because it would be theoretically possible to replace a file
that would have the checksum of another algorithm.

> Third, multi-threaded scalability.
> ---
> the multi-threaded version of 'blake2bp' uses fixed number of threads using 
> openMP:
> 
>       #pragma omp parallel shared(S), num_threads(PARALLELISM_DEGREE)
> 
> With PARALLELISM_DEGREE being 4 for 'blake2bp' and 8 for 'blake2sp'.
> 
> But the implementation is not really scalable (in the sense that adding more 
> threads will make things faster but identical result).
> On one hand, reducing PARALLELISM_DEGREE (e.g. to 2) results in a different 
> value,
> and increasing it (e.g. to 8) results in a segfault (at least on my machine 
> with 4 cores).
> 
> On the other hand, having fixed 4 cores means 'blake2bp' ALWAYS creates 3 
> threads (calls "clone(2)" three times),
> regardless of how many cores the machine actually has. This seems like bad 
> form IMHO.
> 
> It also means, if I understand the code correctly, that we can never increase 
> the number of threads - it is fixed at four forever,
> otherwise it will give different checksums. So regardless of future hardware 
> improvements, we're stuck at 4 threads because it was a reasonable value in 
> 2016.

Yes I didn't like the thread hardcoding at all, and was going to analyze before 
release.
Zooko/Samuel, is the digest value dependent on number of threads?
Did parallelism efficiency fall off after 4?

> Last but not least, OMP_DYNAMIC
> ---
> 
> Using a fixed number in "#pragma omp parallel num_threads()" in OpenMP causes 
> the code to ignore OMP_NUM_THREADS,
> but it still adheres to OMP_DYNAMIC. If set to TRUE, the system doesn't have 
> to provide 4 threads,
> and the result of the digest is different:
> 
> On a machine with 2 cores:
> 
>       $ nproc
>       2
> 
>       $ seq 10000 | OMP_DYNAMIC=FALSE ./src/b2sum -ablake2bp
>       824b215...
> 
>       $ seq 10000 | OMP_DYNAMIC=TRUE ./src/b2sum -ablake2bp
>       6f032b1...
> 
> Similar results happen on machines with 4 or more cores if the machine is 
> overloaded.
> 
> I also suspect, but not sure about this, that the OpenMP allows the 
> implementation to terminate the program
> if the number of requested threads isn't available and OMP_DYNAMIC=FALSE .

Yes on a 4 core machine you can see that with:
  $ seq 10000 | OMP_DYNAMIC=TRUE taskset -a 03 ./src/b2sum -ablake2bp
  6f032b1...

That's a bug.

> To conclude,
> I recommend implementing one, and only one, blake2b hash variant.
> Chose the official reference one (if there is one),
> and stick to it regardless of threads/bit-size.
> In the future, I'm sure there will be both hardware and software improvements 
> which could speed up the implementation while keeping the exact same 
> algorithm intact for backward compatibility.

I'm inclined to agree.

I was on the fence about including the 's' variants,
but given the perf advantage of the 'p' variants,
it made sense to support all --algorithm if supporting any.
Though if the 'p' variants can't be made to deal sensibly
with OMP_DYNAMIC and OMP_NUM_THREADS, then we should
just support 'blake2b'.

Note that it would be awkward to introduce -a in future,
as users would have to be wary of providing newer checksum files
that would be incompat with older b2sum(1).

thanks for the detailed analysis!
Pádraig



reply via email to

[Prev in Thread] Current Thread [Next in Thread]