|
| From: | apache2 |
| Subject: | Re: [DISCUSSION] FILE* slowness |
| Date: | Sun, 17 Jul 2022 11:06:15 +0200 |
| User-agent: | Mutt/1.9.3 (2018-01-21) |
OK, so here is some more analysis.
I started out with the "stupid" implementation that mapped
(ios_dev_file_pread) and (ios_dev_file_pwrite) directly to
(pread) and(pwrite).
This was, not totally unexpectedly, slower than our current
FILE*-based code with the ftello patch applied.
Presumably because `fcntl(SEEK_CUR)` is faster than `pread64`.
Then I added a buffer. That made things look a little bit better.
Here are the results of `time` on my benchmark, as executed on my
shitty old laptop with a bunch of other stuff running in the background. Ie
YMMV.
I use a work-in-progress `zfs.pk` (available in the Pokology repo) that reads a
lot of uint32s to compute the Fletcher4 checksum used in the `ZFS send` file
format.
I thought this checksum pickle would be a good benchmark, both because most
of the heavy lifting consists of contiguous reads, but also because it would
be more likely to catch bugs in the implementation
(by way of complaining about incorrect checksums).
Here bash's `time` builtin, three successive runs:
FILE* (ftello patch):
real 1m11.913s
user 1m1.259s
sys 0m0.525s
--
real 1m13.129s
user 1m1.132s
sys 0m0.458s
--
real 1m9.563s
user 1m0.466s
sys 0m0.514s
pread (4096 byte buffer):
real 1m5.867s
user 0m56.473s
sys 0m0.477s
--
real 0m57.814s
user 0m52.453s
sys 0m0.472s
--
real 0m59.327s
user 0m53.165s
sys 0m0.468s
pread (8192 byte buffer):
real 1m6.722s
user 0m56.645s
sys 0m0.345s
--
real 1m3.726s
user 0m55.380s
sys 0m0.391s
--
real 1m0.855s
user 0m54.288s
sys 0m0.435s
Attached you'll find three perf reports (50 first lines), these are based on
current master:
commit 273c468495f71b1f08e7a7169928ef40e8a3c184 (origin/master, origin/HEAD,
gnu/master, master)
Author: Jose E. Marchesi <jose.marchesi@oracle.com>
Date: Thu Jul 14 19:00:39 2022 +0200
remove etc/pk-mi-json-schema.json
- perf.FILE.txt is the current master version
- perf.pread_4k.txt is for the attached pread patch with 4k buffers
- perf.pread_8k.txt is for the attached pread patch with 8k buffers
In perf.FILE.txt we see the following relevant entries:
3.74% 10851 poke libc-2.33.so [.] __ftello64
3.55% 10301 poke libc-2.33.so [.] _IO_fread
3.43% 9941 poke libc-2.33.so [.] _IO_ferror
0.61% 1777 poke libc-2.33.so [.]
_IO_file_seekoff@@GLIBC_2.2.5
0.69% 2006 poke libpoke.so.0.0.0 [.] ios_dev_file_pread
0.60% 1750 poke libc-2.33.so [.] _IO_file_xsgetn
0.49% 1410 poke libpoke.so.0.0.0 [.]
ios_dev_file_get_flags
0.40% 1160 poke libc-2.33.so [.] _IO_seekoff_unlocked
This sums up to approximately 13.51% of the runtime. I did not filter the
callgraph for calls from ios_dev_file_pread, but I believe it's representative
enough.
In perf.pread_4k.txt we instead see:
1.84% 4591 poke libpoke.so.0.0.0 [.] ios_dev_file_pread
0.48% 1199 poke libpoke.so.0.0.0 [.]
ios_dev_file_get_flags
And in perf.pread_8k.txt:
1.85% 4738 poke libpoke.so.0.0.0 [.] ios_dev_file_pread
0.53% 1351 poke libpoke.so.0.0.0 [.]
ios_dev_file_get_flags
So we are now spending ~approximately 2% of the time dealing with file reading
instead of 13.5%.
An interesting sidenote is that we have a lot of calls to
ios_dev_file_get_flags; if we do not expect these to change (I don't think the
API allows this?), we could catch that so libpoke doesn't have to call that
function for each and every read. Something to look into in the future.
We also see that the relative cost of (the time spent in) most of the other
expensive functions have gone up comparing perf.FILE.txt to perf.pread_4k.txt:
- 13.14% 38114 poke libc-2.33.so [.]
__memmove_avx_unaligned_erms_rtm
+ 15.30% 38181 poke libc-2.33.so [.]
__memmove_avx_unaligned_erms_rtm
- 10.86% 31485 poke libpoke.so.0.0.0 [.]
pvm_execute_or_initialize
+ 12.37% 30873 poke libpoke.so.0.0.0 [.]
pvm_execute_or_initialize
- 7.94% 23035 poke libc-2.33.so [.]
__memset_avx2_erms_rtm
+ 9.29% 23187 poke libc-2.33.so [.]
__memset_avx2_erms_rtm
- 6.44% 18683 poke libgc.so.1.4.3 [.] GC_malloc_kind
+ 7.83% 19541 poke libgc.so.1.4.3 [.] GC_malloc_kind
- 7.01% 20329 poke libgc.so.1.4.3 [.] GC_mark_from
+ 6.07% 15151 poke libgc.so.1.4.3 [.] GC_mark_from
This, seen in the context of the fact that execution took about 20% less time
wall clock time,
is indicative that we are onto a beneficial optimization.
Comparing the two buffer sizes we learn that my benchmark was probably not good
enough;
the statistics are inconclusive and the differences do not look statistically
significant to me.
It does look like 8k might be slightly faster, but in my example pread_4k won,
and I wouldn't be willing to bet much on either being superior without better
data, including more varied benchmarks. I wonder if we have something like that
to catch performance regressions?
One way to obtain better data would be to increase the sampling rate of `perf`,
or to switch to tracing instead.
I looked at the access patterns by tracing the pread patch with strace and the
following awk script:
strace --string-limit=0 -f -e pread64 ./run poke \
-c 'load "../pokology/pickles/zfs.pk";' -c 'open("my.zfs-send")' \
-c "var x=(ZFS_send @ 0 : 0#b);" -c "x.checksum" \
|& tr -d ',)' \
| awk 'BEGIN{last=0; successive=0;}; /pread64/ && /pid/ {if ($6 ==
last){succ++;}else{succ=0;} last=$6+$8; print
substr($3,9)"\t"$5"\t"$6"\t"$8"\t"succ;} END{print last}'
This counts the number of successive reads and can be used to get an idea of
how we're doing.
If our general access pattern was scattered and all over the map we'd see "0"
for most of the columns; but
Moving forward, I propose the attached patch to replace the
default FILE*-based iod subsystem. The way Poke currently
utilizes this subsystem is in many (the majority of) cases to
read 1 byte from a given offset, at least in my benchmark.
Looking at zfs.pk it looks like it *should* read in increments of 4 bytes,
but I don't know if endianness (in zfs.pk set to big-endian;
my processor is little-endian) somehow messes that up.
I tried to figure out if I could make it read things in
larger batches, but as far as I can tell the reads come from
inside PVM, and I don't know enough about that part of Poke to
make intelligent suggestions other than "maybe we could see if we
could read more data at a time".
Lastly, the code attached to this PR does not do write caching, and
instead of updating the buffer when writing it invalidates the read
cache. I have not benchmarked writing, but we should come up with
a benchmark for that so we can work on the pwrite side of things.
There are also a number of outstanding problems / discussions points
marked with TODO in the attached patch, it would be nice if we could
discuss them too before merging. :-)
~ap
perf.FILE.txt
Description: Text document
perf.pread_4k.txt
Description: Text document
perf.pread_8k.txt
Description: Text document
0001-pread64-based-buffered-I-O.patch
Description: Text Data
| [Prev in Thread] | Current Thread | [Next in Thread] |