lzip-bug
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

plzip: manual gives very false numbers, real defaults are huge!


From: Steffen Nurpmeso
Subject: plzip: manual gives very false numbers, real defaults are huge!
Date: Sat, 04 May 2024 20:38:15 +0200
User-agent: s-nail v14.9.24-621-g0d1e55f367

Hello again.

Sorry for disturbing the harmony, but i have a problem with plzip.
I switched to it from zstd (even though for the backups the
reduction is not that tremendous, and for where i would need it,
eg compressing text files for tests i cannot use it because it is
not available everywhere, not even optionally), but while
compressing a 70MB file i realized it was not multithreaded.
So i looked, and indeed, if i do

  plzip -9 -n4 -c < 76-MiB-file > au.lz

(-n4 i never did; the manual however says two is default, which is
not true, mind you) then for two top(1) cycles plzip goes for 132,
157 percent (of 400, say), but then falls back to 100 percent aka
single processor, to stay there until the end (quite some time).
For a 400 MiB file it is 400 percent from dusk to dawn, so to say.
But the thing is, if i do

  plzip -9 -n4 -B16000000 -c < 76-MiB-file > au.lz

aka use the values the manual describes, then i get 263 and then
400 until the end, in my poor man's top(1).
This seems to go

  const bool fast = encoder_options.dictionary_size == 65535 &&
                      encoder_options.match_len_limit == 16;
  if( data_size <= 0 )
    {
    if( fast ) data_size = 1 << 20;
    else data_size = 2 * std::max( 65536, encoder_options.dictionary_size );
    }

and then the dictionary_size must be a giant, right?
Also, there is

      case 'B': data_size = getnum( arg, pn, 2 * LZ_min_dictionary_size(),
                                    2 * LZ_max_dictionary_size() ); break;

which is somehow different, i though (what do i know), and so
i added some printfs and i now see

  #?0|kent:plzip-1.11$ ./plzip -9 -n4 -c <XXX > au.lz
  PRE IF: FAST=0, dictsize=33554432 data_size=0
  POST IF: FAST=0, dictsize=33554432 data_size=67108864
  ^C

Oha, 67MiB data size!  That is four times the one from the manual,
and likewise for the --dictionary-size manual entry.
Which explains the bad CPU saturation except for fewest files.
(The difference in between -B16Mib and -B32MiB is 69417528 to
69332306 bytes .. not much; btw.)

Please let me make the statement that a default of 67 MiB for data
size is really too much, especially given the minimal difference
in resulting file sizes.  In *my* opinion, the defaults should
satisfy the occasional "i wanna compress something" lady (Diaz is
south of Texas, is it?), instead of those "i want to compress my
10 GiB scientific database file" specialists.

Thank you!


P.S.: if really no malloc() hookability, how about adding
a _buffer variant that takes a buffer and a length, and casts any
necessary structure plus the buffers out of that buffer?  That is
to say, no memory allocation usage whatsoever, but simply using
a user-provided buffer for anything?  With
a a LZIP_THIS_MUCH_MINIMUM constant defining a minimum size (say
maybe 64+1KiB), and if sizeof(void*) not sufficient,
a LZIP_REQUIRED_ALIGNMENT?  I love the simplicity of the interface
otherwise, and such (CRC-32, aieeee), but these malloc()s are
pretty tough stuff.  (And seem so useless, too!)

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]