lzip-bug
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: plzip: manual gives very false numbers, real defaults are huge!


From: Steffen Nurpmeso
Subject: Re: plzip: manual gives very false numbers, real defaults are huge!
Date: Tue, 07 May 2024 02:51:17 +0200
User-agent: s-nail v14.9.24-621-g0d1e55f367

Hello!

Antonio Diaz Diaz wrote in
 <6638EA2D.3000801@gnu.org>:
 |Steffen Nurpmeso wrote:
 |> Thanks for the quick response on a Saturday.
 |
 |You are welcome. :-)

Thank you, you are too.  (My only problem is that due to unknown
reasons Google always beats my IP with "suspicious activity" after
i post here, and my last post to my own mailing-list did not come
in via @gmail.com thus.  Sigh.)

 |>   Note that the number of usable threads is limited by file size;
 |>   on files larger than  a  few  GB plzip  can  use  hundreds  of
 |>   processors, but on files of only a few MB plzip is no faster than lzip.
 |>
 |> Ok "you get scaling effects", but 70 MiB is not "a few MiB".
 |
 |The above means "on files of only a few MB plzip can't be faster than \
 |lzip, 
 |no matter what options you use". Of course, at high compression levels the 
 |"few MB" become "several tens of MB".

I think i now have understood your approach.
But i claim it is not what people would expect.
For example, if i hack just a little bit i get on my i5 laptop

  #?0|kent:plzip-1.11$ time ./plzip -9 -n4 x1
  instat=0x7fff32eb6800 inreg=1 sat=0 cf=680412 x=170103 tmp=67108864
  USING 67108864

  real    0m37.743s
  user    0m37.737s
  sys     0m0.273s

  #?0|kent:plzip-1.11$ plzip -d x1.lz

But compare it to that, not to think about a Raspberry or so:

  #?0|kent:plzip-1.11$ time ./plzip -9 -n0 x1
  instat=0x7ffe538049d0 inreg=1 sat=1 cf=680412 x=170103 tmp=67108864
  USING 170103

  real    0m3.157s
  user    0m12.415s
  sys     0m0.087s

(Reversed display for show effect.)  That is for

  --- plzip-1.11.old/main.cc      2024-01-21 12:25:18.000000000 +0100
  +++ plzip-1.11/main.cc  2024-05-07 00:51:11.502821694 +0200
  @@ -767,6 +767,7 @@ int main( const int argc, const char * c
     int out_slots = 64;
     Mode program_mode = m_compress;
     Cl_options cl_opts;          // command-line options
  +  bool saturate = false;
     bool force = false;
     bool keep_input_files = false;
     bool recompress = false;
  @@ -848,7 +849,9 @@ int main( const int argc, const char * c
         case 'm': encoder_options.match_len_limit =
                     getnum( arg, pn, LZ_min_match_len_limit(),
                                      LZ_max_match_len_limit() ); break;
  -      case 'n': num_workers = getnum( arg, pn, 1, max_workers ); break;
  +      case 'n': num_workers = getnum( arg, pn, 0, max_workers );
  +                saturate = ( num_workers == 0 );
  +                break;
         case 'o': if( sarg == "-" ) to_stdout = true;
                   else { default_output_filename = sarg; } break;
         case 'q': verbosity = -1; break;
  @@ -980,9 +983,20 @@ int main( const int argc, const char * c
         infd_isreg ? ( in_stats.st_size + 99 ) / 100 : 0;
       int tmp;
       if( program_mode == m_compress )
  -      tmp = compress( cfile_size, data_size, encoder_options.dictionary_size,
  +      {
  +      tmp = data_size;
  +fprintf(stderr, "instat=%p inreg=%d sat=%d cf=%llu x=%llu tmp=%d\n",
  +               in_statsp, infd_isreg,
  +saturate, cfile_size,cfile_size / num_workers, (unsigned)tmp);
  +
  +      if( saturate && cfile_size != 0 && cfile_size / num_workers < 
(unsigned)tmp)
  +        tmp = cfile_size / num_workers;
  +
  +fprintf(stderr, "USING %d\n",tmp);
  +      tmp = compress( cfile_size, tmp, encoder_options.dictionary_size,
                         encoder_options.match_len_limit, num_workers,
                         infd, outfd, pp, debug_level );
  +      }
       else
         tmp = decompress( cfile_size, num_workers, infd, outfd, cl_opts, pp,
                           debug_level, in_slots, out_slots, infd_isreg,

I realized for the first time that standard input is treated in
a different way via the "one_to_one" mapping of yours.  Ie while
doing

  time ./plzip -9 -n0 -c < /tmp/t.tar.xz  > x1.lz

it occurred to me that the "struct stat" is not used at all for
stdin, which is a pity imho, especially since S_ISREG() is tested.

 |> 67 megabytes per processor!  (How about doing a stat and somehow taking
 |> into account st_size?  Or fstat, after the file was opened?
 |
 |This would break reproducibility (obtaining identical compressed output \
 |from 
 |identical input) because the size of uncompressed data read from standard 
 |input (not from a file) can't be known in advance.

That is true.  If not a regular file, then the above saturation
will unfortunately not work out.  Yet, i thought, limiting a
data size that the user did not explicitly set in the user
required saturation mode could at least minimize the damage a bit:

  #?0|kent:plzip-1.11$ time ./plzip -9 -n0 X2
  DATA_SIZE=67108864 dict=33554432 min=4096 max=536870912
  GO DATA_SIZE=170103 dict=33554432

  real    0m3.421s
  user    0m13.414s
  sys     0m0.077s
  #?0|kent:plzip-1.11$ time ./plzip -9 -n0 < X3 > X3.lz
  DATA_SIZE=67108864 dict=33554432 min=4096 max=536870912
  GO DATA_SIZE=33554432 dict=33554432

  real    0m17.291s
  user    0m34.139s
  sys     0m0.483s

  --- plzip-1.11.old/main.cc      2024-01-21 12:25:18.000000000 +0100
  +++ plzip-1.11/main.cc  2024-05-07 02:27:16.309378187 +0200
  @@ -767,6 +767,7 @@ int main( const int argc, const char * c
     int out_slots = 64;
     Mode program_mode = m_compress;
     Cl_options cl_opts;          // command-line options
  +  bool saturate = false;
     bool force = false;
     bool keep_input_files = false;
     bool recompress = false;
  @@ -848,7 +849,9 @@ int main( const int argc, const char * c
         case 'm': encoder_options.match_len_limit =
                     getnum( arg, pn, LZ_min_match_len_limit(),
                                      LZ_max_match_len_limit() ); break;
  -      case 'n': num_workers = getnum( arg, pn, 1, max_workers ); break;
  +      case 'n': num_workers = getnum( arg, pn, 0, max_workers );
  +                saturate = ( num_workers == 0 );
  +                break;
         case 'o': if( sarg == "-" ) to_stdout = true;
                   else { default_output_filename = sarg; } break;
         case 'q': verbosity = -1; break;
  @@ -980,9 +983,22 @@ int main( const int argc, const char * c
         infd_isreg ? ( in_stats.st_size + 99 ) / 100 : 0;
       int tmp;
       if( program_mode == m_compress )
  -      tmp = compress( cfile_size, data_size, encoder_options.dictionary_size,
  +      {
  +      tmp = data_size;
  +      if( saturate )
  +        {
  +        if( cfile_size != 0 )
  +          {
  +          if( cfile_size / num_workers < (unsigned)tmp )
  +            tmp = cfile_size / num_workers;
  +          }
  +         else if( num_workers > 1 && tmp > encoder_options.dictionary_size )
  +                 tmp = encoder_options.dictionary_size;
  +        }
  +      tmp = compress( cfile_size, tmp, encoder_options.dictionary_size,
                         encoder_options.match_len_limit, num_workers,
                         infd, outfd, pp, debug_level );
  +      }
       else
         tmp = decompress( cfile_size, num_workers, infd, outfd, cl_opts, pp,
                           debug_level, in_slots, out_slots, infd_isreg,

 |> A single sentence that the "defaults" are (of course?!!?!)
 |> dependent on the compression level would have shed some enlightening.
 |
 |I'll try to document it better in the manual and in the man page.

Given how condensed and tight the latter is, this will be hard.
But thanks for your efforts.  (I hope i myself now got it, extract
the ball aka look into doc of my copy before i ask something.)

 |> (Having read the referenced section in the .info file in the
 |> source tarball i would open an issue as "wishlist" asking for an
 |> option that would scale-to-"a-reasonable-number-of"-cpus.
 |
 |As I said above, such an option would not work with data read from \
 |standard 
 |input, and would break reproducibility.

Well, as the user has to enable it, and it would as long as the
environment stays the same.  And i hope the people of
reproducible-builds.org now always check their environment before
penaltizing aka flagging other people's work.

 |Best regards,
 |Antonio.

You know, i have a real lot of use cases like

      </dev/null $ZEXE < FILE > FILE.$ZEXT || exit 5

and other things were file descriptors get played around with, and
i find myself using

  ZEXE='plzip -9 -B16MiB -n'"$NPROC"' -c' ZEXT=lz

for this to not end up taking dozens of minutes.
The above would at least half the necessary time.
Sure.  The above is old and maybe totally useless when using
things like -k and -f.  Hm.

 --End of <6638EA2D.3000801@gnu.org>

Thank you, and greetings from Germany,

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]