emacs-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[debbugs-tracker] bug#13243: closed ([PATCH] enhancement: modify md5sum


From: GNU bug Tracking System
Subject: [debbugs-tracker] bug#13243: closed ([PATCH] enhancement: modify md5sum to allow piping)
Date: Thu, 20 Dec 2012 22:51:03 +0000

Your message dated Thu, 20 Dec 2012 15:49:50 -0700
with message-id <address@hidden>
and subject line Re: bug#13243: [PATCH] enhancement: modify md5sum to allow 
piping
has caused the debbugs.gnu.org bug report #13243,
regarding [PATCH] enhancement: modify md5sum to allow piping
to be marked as done.

(If you believe you have received this mail in error, please contact
address@hidden)


-- 
13243: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=13243
GNU Bug Tracking System
Contact address@hidden with problems
--- Begin Message --- Subject: [PATCH] enhancement: modify md5sum to allow piping Date: Thu, 20 Dec 2012 16:09:38 -0600 User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.11) Gecko/20121128 Thunderbird/10.0.11 There are many times, usually when doing system backups, maintenance, recovery, etc., that I would like to pipe large files through md5sum to produce or verify a hash so that I do not have to read the file multiple times. This is especially the case when backing up a system from a livecd across the network

dd if=/dev/sda3 | pbzip2 -c2 | netcat 192.168.1.123 45678
or
tar c /mnt/sda3 | pbzip2 -c2 | netcat 192.168.1.123 45678

Attached is a preliminary patch set that will allow for this as in the following example

dd if=/dev/sda3 | pbzip2 -c2 | md5sum -po /tmp/sda3.dat.bzip2.md5 | netcat 192.168.1.123 45678

-p is short for --pipe and -o <filename> is short for --outfile <filename>. Then, on the receiving end, the hash can be determined as the file is read, eliminating any worry about network corruption:

netcat -l -p 45678| md5sum -po sda3.dat.bzip2.rx.md5 > sda3.dat.bzip2

The only caveat being that you have to manually compare the sum files, which you can just do by calling diff, a small cost when compared to re-reading a 200GiB file!

You can even get the sum prior to compression, although if you wanted to avoid a duplicate read on the server end, you would have to decompress as you read it and either store the file uncompressed or re-compress it.

dd if=/dev/sda3 | md5sum -po /tmp/sda3.dat.md5 | pbzip2 -c2 | netcat 192.168.1.123 45678
with
netcat -l -p 45678| pbzip2 -cd | md5sum -po sda3.dat.rx.md5 > sda3.dat

The attached patchset is in a very early stage and has many problems:

 * GNU coding style compliance (this coding style is new to me)
 * API in gnulib is changed, may break other apps
 * all changes are lumped together and needs to be broken apart into
   logical changes
 * it has a few hacks that need to be cleaned up

Also, this patch set addresses a problem with the gnulib's hash functions where there was a lot of copy & paste code. I've implemented a mechanism to clean this up w/o a performance hit (as long as we're using gcc 4.6.1+). This change should probably go into a separate patchset & bug report.

Finally, after the cursory amount that I've worked with this code, I see a number of other areas where I believe there's room for improvement.

 * The copy & paste code problem (mentioned above)
 * Centralize the location where BLOCKSIZE is defined and only verify
   it's a multiple of 64 in gnulib/lib/{md,sha}*.c
 * Perhaps allow BLOCKSIZE to be defined at configure time? Honestly,
   I'm not intimately familiar enough with the issues where I can be
   certain it would alter performance on any system, but I'm thinking
   about embedded where reading 32k chunks may end up thrashing the
   cache, but 8k or 4k would not. However, I don't think I would be in
   favor of this being a run-time parameter, as it would seem to be a
   lot of waste (and lost optimizations) for something that's probably
   pretty specific to the hardware and build target.
 * Centralize compiler sniffing into a single gnulib header, (like
   "compiler.h" or some such) and define the GCC_VERSION macro as
   described in
   http://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html.
 * Make better use of __builtin_expect via portable likely/unlikely
   macros to make sure error handling code gets moved out of the main
bodies of functions (which can save a cache miss here and there). Of course, this would require the above item to do cleanly.
 * Introduce some tuning parameter in the configure script to choose
   between smaller and larger, but more optimized code.  I bring this
   up mainly because in my re-work of the copy & pasted code, I see a
   large opportunity to create a much smaller executable (if needed),
   but one that would create slightly slower code, which would usually
   be undesirable on a machine with plenty of RAM, storage and CPU cache.

Obviously, these should be made into separate bug reports as well and I can send separate emails for them if you like.

Daniel

Attachment: 0001-md5sum-pipe.patch
Description: Text Data

Attachment: 0001-piping-support.patch
Description: Text Data


--- End Message ---
--- Begin Message --- Subject: Re: bug#13243: [PATCH] enhancement: modify md5sum to allow piping Date: Thu, 20 Dec 2012 15:49:50 -0700 User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/17.0 Thunderbird/17.0
tag 13243 notabug
thanks

On 12/20/2012 03:09 PM, Daniel Santos wrote:
> There are many times, usually when doing system backups, maintenance,
> recovery, etc., that I would like to pipe large files through md5sum to
> produce or verify a hash so that I do not have to read the file multiple
> times.  This is especially the case when backing up a system from a
> livecd across the network
> 
> dd if=/dev/sda3 | pbzip2 -c2 | netcat 192.168.1.123 45678
> or
> tar c /mnt/sda3 | pbzip2 -c2 | netcat 192.168.1.123 45678
> 
> Attached is a preliminary patch set that will allow for this as in the
> following example
> 
> dd if=/dev/sda3 | pbzip2 -c2 | md5sum -po /tmp/sda3.dat.bzip2.md5 |
> netcat 192.168.1.123 45678

Thanks for the report, and even for the attempted patch.  However, I'm
reluctant to even read through the patch, as I think that you can
already do what you want with existing tools.  In particular, 'info
coreutils tee' mentions:

>    The `tee' command is useful when you happen to be transferring a
> large amount of data and also want to summarize that data without
> reading it a second time.  For example, when you are downloading a DVD
> image, you often want to verify its signature or checksum right away.
> The inefficient way to do it is simply:
> 
>      wget http://example.com/some.iso && sha1sum some.iso
> 
>    One problem with the above is that it makes you wait for the
> download to complete before starting the time-consuming SHA1
> computation.  Perhaps even more importantly, the above requires reading
> the DVD image a second time (the first was from the network).
> 
>    The efficient way to do it is to interleave the download and SHA1
> computation.  Then, you'll get the checksum for free, because the
> entire process parallelizes so well:
> 
>      # slightly contrived, to demonstrate process substitution
>      wget -O - http://example.com/dvd.iso \
>        | tee >(sha1sum > dvd.sha1) > dvd.iso

In your case, you can do:

dd if=/dev/sda3 | pbzip2 -c2 | tee >(md5sum > /tmp/sda3.dat.bzip2.md5) |
 netcat 192.168.1.123 45678

Besides, isn't it nicer to use something that already works than to
worry about a preliminary patch still needing lots of work to come up to
coding standards, not to mention copyright assignment paperwork?

As such, I'm going to close the bug report so that we don't spin our
wheels re-implementing something that already works.  But you should
still feel welcome to contribute, and even add further comments to this
thread as appropriate (we can always reopen this bug if there is
convincing reason that I missed something in my decision to close it).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


--- End Message ---

reply via email to

[Prev in Thread] Current Thread [Next in Thread]