[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-tar] use optimal file system block size
From: |
Christian Krause |
Subject: |
Re: [Bug-tar] use optimal file system block size |
Date: |
Thu, 19 Jul 2018 11:05:47 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 |
Dear all,
First, I would like to thank you all for your prompt replies.
To clarify: I do not mean to change the **record size**, which would result in
an incompatible tar file. I am only interested in the buffer sizes that are
used to read from and write to block devices.
As far as I understand it, please correct me if I didn't get it right, when
creating a tarball with `tar cf data.tar data`, the following things happen:
1. tar reads the input files from the data directory using the **input buffer
size**
2. tar creates records using the **record size**, which depends e.g. on
command line arguments like `-b`
3. tar writes the records to the output (block device file, STDOUT, character
device / tape drive) using the **output buffer size**
There are three different **sizes** at work here: **input buffer**, **record**,
and **output buffer**. The input buffer and output buffer sizes are the same as
the record size, which can be verified using Ralphs command line with the `-b`
option:
```
$strace -T -ttt -ff -o tar-1.30-factor-4k.strace tar cbf 4096 data4k.tar data
$ strace-analyzer io tar-1.30-factor-4k.strace.72464 | grep data | column -t
read 84M in 1.520 s (~ 55M / s) with 43 ops (~ 2M / op, ~
2M request size) data/blob
write 86M in 61.316 ms (~ 1G / s) with 43 ops (~ 2M / op, ~
2M request size) data4k.tar
```
Due to changing the **record size**, this creates a different,
not-so-compatible tar file:
```
$ stat -c %s data.tar data4k.tar
88084480
90177536
$ md5sum data.tar data4k.tar
4477dca65dee41609d43147cd15eea68 data.tar
6f4ce17db2bf7beca3665e857cbc2d69 data4k.tar
```
Please verify: The fact that input buffer and output buffer sizes are the same
as the record size is an implementation detail. The input buffer and output
buffer sizes could be decoupled from the record size to improve I/O performance
without changing the resulting tar file. Decoupling would entail a huge
refactoring, like Jörg suggests.
What network filesystem are you using? Typically, such small IOPS
should be hidden from the filesystem with readahead and writeback
cache, though of course there is still more overhead from having
lots of system calls.
We are using IBM Spectrum Scale (previously known as GPFS). From the Spectrum
Scale documentation I can see that it is using read-ahead and write-back
techniques (I don't know much about the internals, though). The performance
gain by reducing the number of syscalls and the resulting reduced overhead in
both OS kernel and Spectrum Scale software components should still be
measurable.
bsdtar has a similar optimization.
I can verify this for the input buffer size:
```
$ bsdtar --version
bsdtar 3.2.2 - libarchive 3.2.2 zlib/1.2.8 liblzma/5.0.4 bz2lib/1.0.6
$ strace -T -ttt -ff -o bsdtar-3.2.2-create.strace bsdtar -cf data-bsdtar.tar
data
$ strace-analyzer io bsdtar-3.2.2-create.strace.14101 | grep data | column -t
read 84M in 388.927 ms (~ 216M / s) with 42 ops (~ 2M / op,
~ 2M request size) data/blob
write 84M in 4.854 s (~ 17M / s) with 8602 ops (~ 10K / op,
~ 10K request size) data-bsdtar.tar
```
This is not the latest version, though. Might be they changed the write buffer
size in later versions, too.
Best Regards
On 07/19/2018 06:20 AM, Tim Kientzle wrote:
bsdtar has a similar optimization.
It decouples reads and writes, allowing it to use a more optimal size for each
side.
When it opens an archive for writing, it checks the target device type. If
it’s a character device (such as a tape drive), it writes the requested blocks
exactly. When the target device is a block device, however, it instead buffers
and writes much larger blocks, padding the file at the end as necessary to
ensure the final size is a multiple of the requested block size. This produces
the exact same end result as if it had written blocks as requested but much
more efficiently.
Tim
On Jul 18, 2018, at 9:58 AM, Andreas Dilger <address@hidden> wrote:
On Jul 18, 2018, at 9:03 AM, Ralph Corderoy <address@hidden> wrote:
Hi Christian,
$ stat -c %o data/blob
2097152
...
**tar** does not explicitly use the block size of the file system
where the files are located, but, for a reason I don't know (feel free to
educate me), 10 KiB:
Historic, that being 20 blocks where a block is 512 B. See `Blocking
Factor'. https://www.gnu.org/software/tar/manual/tar.html#SEC160
It can be changed.
$ strace -e write -s 10 tar cbf 4096 foo.tar foo
write(3, "foo\0\0\0\0\0\0\0"..., 2097152) = 2097152
+++ exited with 0 +++
$
I would like to propose to use the native file system block size in
favor of the currently used 10 KiB.
I can't see the default changing. POSIX's pax(1) states for ustar
format that the default for character devices is 10 KiB, and allows for
multiples of 512 up to an including 32,256. So you're suggesting the
default is to produce an incompatible tar file.
The IO size from the storage does not need to match the recordsize
of the tar file. It may be that writing to an actual tape character
device needs to use 10KB writes, but for a regular file on a block
device (which is 99% of tar usage) it can still write 10KB records,
but just write a few hundred of them at a time.
What network filesystem are you using? Typically, such small IOPS
should be hidden from the filesystem with readahead and writeback
cache, though of course there is still more overhead from having
lots of system calls.
Cheers, Andreas
--
Christian Krause
Scientific Computing Administration and Support
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Email: address@hidden
Office: BioCity Leipzig 5e, Room 3.201.3
Phone: +49 341 97 33144
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig
Deutscher Platz 5e
04103 Leipzig
Germany
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
iDiv is a research centre of the DFG – Deutsche Forschungsgemeinschaft
iDiv ist eine zentrale Einrichtung der Universität Leipzig im Sinne des § 92
Abs. 1 SächsHSFG und wird zusammen mit der Martin-Luther-Universität
Halle-Wittenberg und der Friedrich-Schiller-Universität Jena betrieben sowie in
Kooperation mit dem Helmholtz-Zentrum für Umweltforschung GmbH – UFZ.
Beteiligte Kooperationspartner sind die folgenden außeruniversitären
Forschungseinrichtungen: das Helmholtz-Zentrum für Umweltforschung GmbH - UFZ,
das Max-Planck-Institut für Biogeochemie (MPI BGC), das Max-Planck-Institut für
chemische Ökologie (MPI CE), das Max-Planck-Institut für evolutionäre
Anthropologie (MPI EVA), das Leibniz-Institut Deutsche Sammlung von
Mikroorganismen und Zellkulturen (DSMZ), das Leibniz-Institut für
Pflanzenbiochemie (IPB), das Leibniz-Institut für Pflanzengenetik und
Kulturpflanzenforschung (IPK) und das Leibniz-Institut Senckenberg Museum für
Naturkunde Görlitz (SMNG). USt-IdNr. DE 141510383
Re: [Bug-tar] use optimal file system block size, Joerg Schilling, 2018/07/18