Re: [Qemu-devel] QEMU event loop optimizations

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] QEMU event loop optimizations

From:	Sergio Lopez
Subject:	Re: [Qemu-devel] QEMU event loop optimizations
Date:	Fri, 05 Apr 2019 18:29:49 +0200
User-agent:	mu4e 1.0; emacs 26.1

Stefan Hajnoczi writes:

> Hi Sergio,
> Here are the forgotten event loop optimizations I mentioned:
>
>   https://github.com/stefanha/qemu/commits/event-loop-optimizations
>
> The goal was to eliminate or reorder syscalls so that useful work (like
> executing BHs) occurs as soon as possible after an event is detected.
>
> I remember that these optimizations only shave off a handful of
> microseconds, so they aren't a huge win.  They do become attractive on
> fast SSDs with <10us read/write latency.
>
> These optimizations are aggressive and there is a possibility of
> introducing regressions.
>
> If you have time to pick up this work, try benchmarking each commit
> individually so performance changes are attributed individually.
> There's no need to send them together in a single patch series, the
> changes are quite independent.

It took me a while to find a way to get meaningful numbers to evaluate
those optimizations. The problem is that here (Xeon E5-2640 v3 and EPYC
7351P) the cost of event_notifier_set() is just ~0.4us when the code
path is hot, and it's hard differentiating it from the noise.

To do so, I've used a patched kernel with a naive io_poll implementation
for virtio_blk [1], an also patched QEMU with poll-inflight [2] (just to
be sure we're polling) and ran the test on semi-isolated cores
(nohz_full + rcu_nocbs + systemd_isolation) with idle siblings. The
storage is simulated by null_blk with "completion_nsec=0 no_sched=1
irqmode=0".

# fio --time_based --runtime=30 --rw=randread --name=randread \
 --filename=/dev/vdb --direct=1 --ioengine=pvsync2 --iodepth=1 --hipri=1

| avg_lat (us) | master | qbsn* |
|   run1       | 11.32  | 10.96 |
|   run2       | 11.37  | 10.79 |
|   run3       | 11.42  | 10.67 |
|   run4       | 11.32  | 11.06 |
|   run5       | 11.42  | 11.19 |
|   run6       | 11.42  | 10.91 |
 * patched with aio: add optimized qemu_bh_schedule_nested() API

Even though there's still some variance in the numbers, the 0.4us
improvement can be clearly appreciated.

I haven't tested the other 3 patches, as their optimizations only have
effect when the event loop is not running in polling mode. Without
polling, we get an additional overhead of, at least, 10us, in addition
to a lot of noise, due to both direct costs (ppoll()...) and indirect
ones (re-scheduling and TLB/cache pollution), so I don't think we can
reliable benchmark them. Probably their impact won't be significant
either, due to the costs I've just mentioned.

Sergio.

[1] https://github.com/slp/linux/commit/d369b37db3e298933e8bb88c6eeacff07f39bc13
[2] https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00447.html

signature.asc
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] QEMU event loop optimizations, Kevin Wolf, 2019/04/02
- Re: [Qemu-devel] QEMU event loop optimizations, Paolo Bonzini, 2019/04/02
- Re: [Qemu-devel] QEMU event loop optimizations, Sergio Lopez <=
  - Re: [Qemu-devel] QEMU event loop optimizations, Stefan Hajnoczi, 2019/04/08
- Re: [Qemu-devel] QEMU event loop optimizations, Sergio Lopez, 2019/04/05
  - Re: [Qemu-devel] QEMU event loop optimizations, Paolo Bonzini, 2019/04/08

Prev by Date: Re: [Qemu-devel] [PATCH for-4.0] hmp: Fix drive_add ... format=help crash
Next by Date: [Qemu-devel] [PATCH for-4.1 2/2] spapr: Drop duplicate code in LSI mapping
Previous by thread: Re: [Qemu-devel] QEMU event loop optimizations
Next by thread: Re: [Qemu-devel] QEMU event loop optimizations
Index(es):
- Date
- Thread