[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: plzip: manual gives very false numbers, real defaults are huge!
From: |
Steffen Nurpmeso |
Subject: |
Re: plzip: manual gives very false numbers, real defaults are huge! |
Date: |
Tue, 07 May 2024 22:43:11 +0200 |
User-agent: |
s-nail v14.9.24-621-g0d1e55f367 |
Hello!!
Antonio Diaz Diaz wrote in
<663A502B.4080802@gnu.org>:
|Steffen Nurpmeso wrote:
|>> The above means "on files of only a few MB plzip can't be faster \
|>> than lzip,
|>> no matter what options you use". Of course, at high compression \
|>> levels the
|>> "few MB" become "several tens of MB".
|>
|> I think i now have understood your approach.
|> But i claim it is not what people would expect.
|
|People tends to expect contradictory things. Like using all the processors
|on any machine while at the same time producing the same compressed output
|on all machines.
Well, hm, the latter not, if i speak for myself. I did not spend
a thought on that, in fact. Reproducible, yes, if i say "EXE
FILE" twice in a row, i would expect the compression result to be
the same, that yes.
|> For example, if i hack just a little bit i get on my i5 laptop
|>
|> #?0|kent:plzip-1.11$ time ./plzip -9 -n4 x1
|> instat=0x7fff32eb6800 inreg=1 sat=0 cf=680412 x=170103 tmp=67108864
|> USING 67108864
|>
|> real 0m37.743s
|> user 0m37.737s
|> sys 0m0.273s
|> [...]
|> #?0|kent:plzip-1.11$ time ./plzip -9 -n0 x1
|> instat=0x7ffe538049d0 inreg=1 sat=1 cf=680412 x=170103 tmp=67108864
|> USING 170103
|>
|> real 0m3.157s
|> user 0m12.415s
|> sys 0m0.087s
|
|Note that the above does not run 12 times faster because you have 12
|processors, but because you are using a dictionary size almost 200 times
|smaller (which I guess will give a compression ratio between levels \
|0 and 1
|instead of the level 9 requested).
No?
#?0|kent:plzip-1.11$ cp /x/balls/gcc-13.2.0.tar.xz X1
#?0|kent:plzip-1.11$ cp X1 X2
#?0|kent:plzip-1.11$ time ./plzip -9 -n4 X1; time ./plzip -9 -n@4 X2; ll
X1.lz X2.lz
cfile_size, tmp, encoder_options.dictionary_size: 878586 67108864 33554432
real 0m37.618s
user 0m46.182s
sys 0m0.414s
cfile_size, tmp, encoder_options.dictionary_size: 878586 219646 33554432
real 0m4.420s
user 0m17.458s
sys 0m0.106s
-rw-r----- 1 steffen steffen 89049959 May 7 22:14 X1.lz
-rw-r----- 1 steffen steffen 89079463 May 7 22:14 X2.lz
Hmmm, maybe yes
13966934 Jan 3 18:58 /x/doc/coding/austin-group/202x_d4.txt
#?0|kent:plzip-1.11$ cp /x/doc/coding/austin-group/202x_d4.txt X1
#?0|kent:plzip-1.11$ cp X1 X2
#?0|kent:plzip-1.11$ time ./plzip -9 -n4 X1; time ./plzip -9 -n@4 X2; ll
X1.lz X2.lz
cfile_size, tmp, encoder_options.dictionary_size: 139670 67108864 33554432
real 0m12.098s
user 0m12.002s
sys 0m0.080s
cfile_size, tmp, encoder_options.dictionary_size: 139670 34917 33554432
real 0m1.310s
user 0m5.176s
sys 0m0.013s
-rw-r----- 1 steffen steffen 1880399 May 7 22:18 X1.lz
-rw-r----- 1 steffen steffen 2927787 May 7 22:18 X2.lz
Hmm. Ok that is not desired indeed, then the concept is
a braindead dead-end, actually. It at least had to be
if( cfile_size / num_workers < (unsigned)tmp )
{
tmp = cfile_size / num_workers;
tmp = std::max(tmp, encoder_options.dictionary_size);
}
}
to be sane (let alone acceptible), but that will still improve
something similar to what i had last week, and there were many
files which would benefit from it, then.
#?0|kent:plzip-1.11$ time ./plzip -9 -n4 X1; time ./plzip -9 -n@4 X2; ll
X1.lz X2.lz; rm X1.lz X2.lz
cfile_size, tmp, encoder_options.dictionary_size: 878586 67108864 33554432
^A0
real 0m38.817s
user 0m48.124s
sys 0m0.407s
cfile_size, tmp, encoder_options.dictionary_size: 878586 33554432 33554432
real 0m16.535s
user 0m41.348s
sys 0m0.812s
-rw-r----- 1 steffen steffen 89049959 May 7 22:26 X1.lz
-rw-r----- 1 steffen steffen 89050445 May 7 22:26 X2.lz
To me this is still an improvement i would long for the port
i maintain!?
|> I realized for the first time that standard input is treated in
|> a different way via the "one_to_one" mapping of yours. Ie while
|> doing
|>
|> time ./plzip -9 -n0 -c < /tmp/t.tar.xz > x1.lz
|>
|> it occurred to me that the "struct stat" is not used at all for
|> stdin, which is a pity imho, especially since S_ISREG() is tested.
|
|S_ISREG is not tested for stdin. But if you want to reproduce the metadata
|of the input file in an output file with a different name, you can use
|
| time ./plzip -9 /tmp/t.tar.xz -o x1.lz
|
|> That is true. If not a regular file, then the above saturation
|> will unfortunately not work out. Yet, i thought, limiting a
|> data size that the user did not explicitly set in the user
|> required saturation mode could at least minimize the damage a bit:
|
|Or can cause the opposite damage by splitting a huge file into twice \
|as many
|members as now.
|
|> And i hope the people of reproducible-builds.org now always check their
|> environment before penaltizing aka flagging other people's work.
|
|Reproducible builds are a set of software development practices that \
|create
|an independently-verifiable path from source to binary code. They have
|nothing to do with reproducible compression. (Obtaining identical compre\
|ssed
|output from identical uncompressed input read from anywhere).
I am all listening, and yes.
|> i find myself using
|>
|> ZEXE='plzip -9 -B16MiB -n'"$NPROC"' -c' ZEXT=lz
|>
|> for this to not end up taking dozens of minutes.
|
|I already gave you a solution; use -m or -s:
|
| ZEXE='plzip -m273 -n'"$NPROC"' -c' ZEXT=lz
|or
| ZEXE='plzip -9 -s8MiB -n'"$NPROC"' -c' ZEXT=lz
I think dynamically scalling according to the processors, talking
into account the dictionary size, as you said above, is the sane
approach for "saturating" with plzip, in the above job there are
quite a lot of files, of varying size (the spam DB being very
large), and one recipe is not good for them all.
I must say i never spent so many thoughts on their compression,
thinking about it, the desire of the above would be "as good as
possible in the smallest possible time", as could be expected
from a Kartoffel, no. I think the -n@0 mode will do this, even
very good! Because, you know, what other option do *i* have, for
example in the above shell script? I could go the GNU make
variant and simply "(EXE FILE) </dev/null >/dev/null &" all jobs
concurrently, and hope that the operating system paves its way
through this (luckily running NVME SSD), eventually finishing all
the jobs, or create $NPROC such jobs in a loop (ditto). Then my
own possibilites are exhausted. plzip -n@0 however (and *only* as
above, thank you!) can saturate pretty good. (Of course, there
are other actions on the computer concurrently for one, and then
filesystems may spread data quite a bit, so I/O data collection
must not necessarily be so much worse with ()&, *i think*.)
|> The above would at least half the necessary time.
|> Sure. The above is old and maybe totally useless when using
|> things like -k and -f. Hm.
|
|I guess you could simplify it to something like this
|
| ZEXE='plzip -m273 -n"$NPROC"'
| $ZEXE -kf FILE || exit 5
Ciao!
--steffen
|
|Der Kragenbaer, The moon bear,
|der holt sich munter he cheerfully and one by one
|einen nach dem anderen runter wa.ks himself off
|(By Robert Gernhardt)
- plzip: manual gives very false numbers, real defaults are huge!, Steffen Nurpmeso, 2024/05/04
- Re: plzip: manual gives very false numbers, real defaults are huge!, Antonio Diaz Diaz, 2024/05/04
- Re: plzip: manual gives very false numbers, real defaults are huge!, Steffen Nurpmeso, 2024/05/04
- Re: plzip: manual gives very false numbers, real defaults are huge!, Steffen Nurpmeso, 2024/05/04
- Re: plzip: manual gives very false numbers, real defaults are huge!, Antonio Diaz Diaz, 2024/05/06
- Re: plzip: manual gives very false numbers, real defaults are huge!, Steffen Nurpmeso, 2024/05/06
- Re: plzip: manual gives very false numbers, real defaults are huge!, Antonio Diaz Diaz, 2024/05/07
- Re: plzip: manual gives very false numbers, real defaults are huge!,
Steffen Nurpmeso <=
- Re: plzip: manual gives very false numbers, real defaults are huge!, Antonio Diaz Diaz, 2024/05/08
- Re: plzip: manual gives very false numbers, real defaults are huge!, Steffen Nurpmeso, 2024/05/08
- Re: plzip: manual gives very false numbers, real defaults are huge!, Steffen Nurpmeso, 2024/05/07
- Re: plzip: manual gives very false numbers, real defaults are huge!, Steffen Nurpmeso, 2024/05/07
- Re: plzip: manual gives very false numbers, real defaults are huge!, Antonio Diaz Diaz, 2024/05/08
- Re: plzip: manual gives very false numbers, real defaults are huge!, Steffen Nurpmeso, 2024/05/08
- Re: plzip: manual gives very false numbers, real defaults are huge!, Steffen Nurpmeso, 2024/05/07