--- Begin Message ---
Subject: |
[PATCH] enhancement: modify md5sum to allow piping |
Date: |
Thu, 20 Dec 2012 16:09:38 -0600 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:10.0.11) Gecko/20121128 Thunderbird/10.0.11 |
There are many times, usually when doing system backups, maintenance,
recovery, etc., that I would like to pipe large files through md5sum to
produce or verify a hash so that I do not have to read the file multiple
times. This is especially the case when backing up a system from a
livecd across the network
dd if=/dev/sda3 | pbzip2 -c2 | netcat 192.168.1.123 45678
or
tar c /mnt/sda3 | pbzip2 -c2 | netcat 192.168.1.123 45678
Attached is a preliminary patch set that will allow for this as in the
following example
dd if=/dev/sda3 | pbzip2 -c2 | md5sum -po /tmp/sda3.dat.bzip2.md5 |
netcat 192.168.1.123 45678
-p is short for --pipe and -o <filename> is short for --outfile
<filename>. Then, on the receiving end, the hash can be determined as
the file is read, eliminating any worry about network corruption:
netcat -l -p 45678| md5sum -po sda3.dat.bzip2.rx.md5 > sda3.dat.bzip2
The only caveat being that you have to manually compare the sum files,
which you can just do by calling diff, a small cost when compared to
re-reading a 200GiB file!
You can even get the sum prior to compression, although if you wanted to
avoid a duplicate read on the server end, you would have to decompress
as you read it and either store the file uncompressed or re-compress it.
dd if=/dev/sda3 | md5sum -po /tmp/sda3.dat.md5 | pbzip2 -c2 | netcat
192.168.1.123 45678
with
netcat -l -p 45678| pbzip2 -cd | md5sum -po sda3.dat.rx.md5 > sda3.dat
The attached patchset is in a very early stage and has many problems:
* GNU coding style compliance (this coding style is new to me)
* API in gnulib is changed, may break other apps
* all changes are lumped together and needs to be broken apart into
logical changes
* it has a few hacks that need to be cleaned up
Also, this patch set addresses a problem with the gnulib's hash
functions where there was a lot of copy & paste code. I've implemented
a mechanism to clean this up w/o a performance hit (as long as we're
using gcc 4.6.1+). This change should probably go into a separate
patchset & bug report.
Finally, after the cursory amount that I've worked with this code, I see
a number of other areas where I believe there's room for improvement.
* The copy & paste code problem (mentioned above)
* Centralize the location where BLOCKSIZE is defined and only verify
it's a multiple of 64 in gnulib/lib/{md,sha}*.c
* Perhaps allow BLOCKSIZE to be defined at configure time? Honestly,
I'm not intimately familiar enough with the issues where I can be
certain it would alter performance on any system, but I'm thinking
about embedded where reading 32k chunks may end up thrashing the
cache, but 8k or 4k would not. However, I don't think I would be in
favor of this being a run-time parameter, as it would seem to be a
lot of waste (and lost optimizations) for something that's probably
pretty specific to the hardware and build target.
* Centralize compiler sniffing into a single gnulib header, (like
"compiler.h" or some such) and define the GCC_VERSION macro as
described in
http://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html.
* Make better use of __builtin_expect via portable likely/unlikely
macros to make sure error handling code gets moved out of the main
bodies of functions (which can save a cache miss here and there).
Of course, this would require the above item to do cleanly.
* Introduce some tuning parameter in the configure script to choose
between smaller and larger, but more optimized code. I bring this
up mainly because in my re-work of the copy & pasted code, I see a
large opportunity to create a much smaller executable (if needed),
but one that would create slightly slower code, which would usually
be undesirable on a machine with plenty of RAM, storage and CPU cache.
Obviously, these should be made into separate bug reports as well and I
can send separate emails for them if you like.
Daniel
0001-md5sum-pipe.patch
Description: Text Data
0001-piping-support.patch
Description: Text Data
--- End Message ---
--- Begin Message ---
Subject: |
Re: bug#13243: [PATCH] enhancement: modify md5sum to allow piping |
Date: |
Thu, 20 Dec 2012 15:49:50 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/17.0 Thunderbird/17.0 |
tag 13243 notabug
thanks
On 12/20/2012 03:09 PM, Daniel Santos wrote:
> There are many times, usually when doing system backups, maintenance,
> recovery, etc., that I would like to pipe large files through md5sum to
> produce or verify a hash so that I do not have to read the file multiple
> times. This is especially the case when backing up a system from a
> livecd across the network
>
> dd if=/dev/sda3 | pbzip2 -c2 | netcat 192.168.1.123 45678
> or
> tar c /mnt/sda3 | pbzip2 -c2 | netcat 192.168.1.123 45678
>
> Attached is a preliminary patch set that will allow for this as in the
> following example
>
> dd if=/dev/sda3 | pbzip2 -c2 | md5sum -po /tmp/sda3.dat.bzip2.md5 |
> netcat 192.168.1.123 45678
Thanks for the report, and even for the attempted patch. However, I'm
reluctant to even read through the patch, as I think that you can
already do what you want with existing tools. In particular, 'info
coreutils tee' mentions:
> The `tee' command is useful when you happen to be transferring a
> large amount of data and also want to summarize that data without
> reading it a second time. For example, when you are downloading a DVD
> image, you often want to verify its signature or checksum right away.
> The inefficient way to do it is simply:
>
> wget http://example.com/some.iso && sha1sum some.iso
>
> One problem with the above is that it makes you wait for the
> download to complete before starting the time-consuming SHA1
> computation. Perhaps even more importantly, the above requires reading
> the DVD image a second time (the first was from the network).
>
> The efficient way to do it is to interleave the download and SHA1
> computation. Then, you'll get the checksum for free, because the
> entire process parallelizes so well:
>
> # slightly contrived, to demonstrate process substitution
> wget -O - http://example.com/dvd.iso \
> | tee >(sha1sum > dvd.sha1) > dvd.iso
In your case, you can do:
dd if=/dev/sda3 | pbzip2 -c2 | tee >(md5sum > /tmp/sda3.dat.bzip2.md5) |
netcat 192.168.1.123 45678
Besides, isn't it nicer to use something that already works than to
worry about a preliminary patch still needing lots of work to come up to
coding standards, not to mention copyright assignment paperwork?
As such, I'm going to close the bug report so that we don't spin our
wheels re-implementing something that already works. But you should
still feel welcome to contribute, and even add further comments to this
thread as appropriate (we can always reopen this bug if there is
convincing reason that I missed something in my decision to close it).
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature
--- End Message ---