Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virt

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virt

From:	Kevin Wolf
Subject:	Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
Date:	Thu, 7 Aug 2014 15:51:30 +0200
User-agent:	Mutt/1.5.21 (2010-09-15)
Am 07.08.2014 um 12:27 hat Ming Lei geschrieben:
> On Wed, Aug 6, 2014 at 11:40 PM, Kevin Wolf <address@hidden> wrote:
> > Am 06.08.2014 um 13:28 hat Ming Lei geschrieben:
> >> On Wed, Aug 6, 2014 at 6:09 PM, Kevin Wolf <address@hidden> wrote:
> >> > Am 06.08.2014 um 11:37 hat Ming Lei geschrieben:
> >> >> On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <address@hidden> wrote:
> >> >> > However, I just wasn't sure whether a change on this level would be
> >> >> > relevant in a realistic environment. This is the reason why I wanted 
> >> >> > to
> >> >> > get a benchmark involving the block layer and some I/O.
> >> >> >
> >> >> >> From the profiling data in below link:
> >> >> >>
> >> >> >>     http://pastebin.com/YwH2uwbq
> >> >> >>
> >> >> >> With coroutine, the running time for same loading is increased
> >> >> >> ~50%(1.325s vs. 0.903s), and dcache load events is increased
> >> >> >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
> >> >> >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
> >> >> >>
> >> >> >> The bypass code in the benchmark is very similar with the approach
> >> >> >> used in the bypass patch, since linux-aio with O_DIRECT seldom
> >> >> >> blocks in the the kernel I/O path.
> >> >> >>
> >> >> >> Maybe the benchmark is a bit extremely, but given modern storage
> >> >> >> device may reach millions of IOPS, and it is very easy to slow down
> >> >> >> the I/O by coroutine.
> >> >> >
> >> >> > I think in order to optimise coroutines, such benchmarks are fair 
> >> >> > game.
> >> >> > It's just not guaranteed that the effects are exactly the same on real
> >> >> > workloads, so we should take the results with a grain of salt.
> >> >> >
> >> >> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
> >> >> > coroutines instead of exiting them, so it can't make any use of the
> >> >> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
> >> >> > version that simply removes the yield at the end):
> >> >> >
> >> >> >                 | bypass        | fixed coro    | buggy coro
> >> >> > ----------------+---------------+---------------+--------------
> >> >> > time            | 1.09s         | 1.10s         | 1.62s
> >> >> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
> >> >> > insns per cycle | 2.39          | 2.39          | 1.90
> >> >> >
> >> >> > Begs the question whether you see a similar effect on a real qemu and
> >> >> > the coroutine pool is still not big enough? With correct use of
> >> >> > coroutines, the difference seems to be barely measurable even without
> >> >> > any I/O involved.
> >> >>
> >> >> When I comment qemu_coroutine_yield(), looks result of
> >> >> bypass and fixed coro is very similar as your test, and I am just
> >> >> wondering if stack is always switched in qemu_coroutine_enter()
> >> >> without calling qemu_coroutine_yield().
> >> >
> >> > Yes, definitely. qemu_coroutine_enter() always involves calling
> >> > qemu_coroutine_switch(), which is the stack switch.
> >> >
> >> >> Without the yield, the benchmark can't emulate coroutine usage in
> >> >> bdrv_aio_readv/writev() path any more, and bypass in the patchset
> >> >> skips two qemu_coroutine_enter() and one qemu_coroutine_yield()
> >> >> for each bdrv_aio_readv/writev().
> >> >
> >> > It's not completely comparable anyway because you're not going through a
> >> > main loop and callbacks from there for your benchmark.
> >> >
> >> > But fair enough: Keep the yield, but enter the coroutine twice then. You
> >> > get slightly worse results then, but that's more like doubling the very
> >> > small difference between "bypass" and "fixed coro" (1.11s / 946,434,327
> >> > / 2.37), not like the horrible performance of the buggy version.
> >>
> >> Yes, I compared that too, looks no big difference.
> >>
> >> >
> >> > Actually, that's within the error of measurement for time and
> >> > insns/cycle, so running it for a bit longer:
> >> >
> >> >                 | bypass    | coro      | + yield   | buggy coro
> >> > ----------------+-----------+-----------+-----------+--------------
> >> > time            | 21.45s    | 21.68s    | 21.83s    | 97.05s
> >> > L1-dcache-loads | 18,049 M  | 18,387 M  | 18,618 M  | 26,062 M
> >> > insns per cycle | 2.42      | 2.40      | 2.41      | 1.75
> >> >
> >> >> >> > I played a bit with the following, I hope it's not too naive. I 
> >> >> >> > couldn't
> >> >> >> > see a difference with your patches, but at least one reason for 
> >> >> >> > this is
> >> >> >> > probably that my laptop SSD isn't fast enough to make the CPU the
> >> >> >> > bottleneck. Haven't tried ramdisk yet, that would probably be the 
> >> >> >> > next
> >> >> >> > thing. (I actually wrote the patch up just for some profiling on 
> >> >> >> > my own,
> >> >> >> > not for comparing throughput, but it should be usable for that as 
> >> >> >> > well.)
> >> >> >>
> >> >> >> This might not be good for the test since it is basically a 
> >> >> >> sequential
> >> >> >> read test, which can be optimized a lot by kernel. And I always use
> >> >> >> randread benchmark.
> >> >> >
> >> >> > Yes, I shortly pondered whether I should implement random offsets
> >> >> > instead. But then I realised that a quicker kernel operation would 
> >> >> > only
> >> >> > help the benchmark because we want it to test the CPU consumption in
> >> >> > userspace. So the faster the kernel gets, the better for us, because 
> >> >> > it
> >> >> > should make the impact of coroutines bigger.
> >> >>
> >> >> OK, I will compare coroutine vs. bypass-co with the benchmark.
> >>
> >> I use the /dev/nullb0 block device to test, which is available in linux 
> >> kernel
> >> 3.13+, and follows the difference, which looks not very big(< 10%):
> >
> > Sounds useful. I'm running on an older kernel, so I used a loop-mounted
> > file on tmpfs instead for my tests.
> 
> Actually loop is a slow device, and recently I used kernel aio and blk-mq
> to speedup it a lot.

Yes, I have no doubts that it's slower than a proper ramdisk, but it
should still be way faster than my normal disk.

> > Anyway, at some point today I figured I should take a different approach
> > and not try to minimise the problems that coroutines introduce, but
> > rather make the most use of them when we have them. After all, the
> > raw-posix driver is still very callback-oriented and does things that
> > aren't really necessary with coroutines (such as AIOCB allocation).
> >
> > The qemu-img bench time I ended up with looked quite nice. Maybe you
> > want to take a look if you can reproduce these results, both with
> > qemu-img bench and your real benchmark.
> >
> >
> > $ for i in $(seq 1 5); do time ./qemu-img bench -t none -n -c 2000000 
> > /dev/loop0; done
> > Sending 2000000 requests, 4096 bytes each, 64 in parallel
> >
> >         bypass (base) | bypass (patch) | coro (base) | coro (patch)
> > ----------------------+----------------+-------------+---------------
> > run 1   0m5.966s      | 0m5.687s       |  0m6.224s   | 0m5.362s
> > run 2   0m5.826s      | 0m5.831s       |  0m5.994s   | 0m5.541s
> > run 3   0m6.145s      | 0m5.495s       |  0m6.253s   | 0m5.408s
> > run 4   0m5.683s      | 0m5.527s       |  0m6.045s   | 0m5.293s
> > run 5   0m5.904s      | 0m5.607s       |  0m6.238s   | 0m5.207s
> 
> I suggest to run the test a bit long.

Okay, ran it again with -c 10000000 this time. I also used the updated
branch for the patched version. This means that the __thread patch is
not enabled; this is probably why the improvement for the bypass has
disappeared and the coroutine based version only approaches, but doesn't
beat it this time.

        bypass (base) | bypass (patch) | coro (base) | coro (patch)
----------------------+----------------+-------------+---------------
run 1   28.255s       |  28.615s       | 30.364s     | 28.318s
run 2   28.190s       |  28.926s       | 30.096s     | 28.437s
run 3   28.079s       |  29.603s       | 30.084s     | 28.567s
run 4   28.888s       |  28.581s       | 31.343s     | 28.605s
run 5   28.196s       |  28.924s       | 30.033s     | 27.935s

> > You can find my working tree at:
> >
> >     git://repo.or.cz/qemu/kevin.git perf-bypass
> 
> I just tried your work tree, and looks qemu-img can work well
> with your linux-aio coro patches, but unfortunately there is
> little improvement observed in my server, basically the result is
> same without bypass; in my laptop, the improvement can be
> observed but it is still at least 5% less than bypass.
> 
> Let's see the result in my server:
> 
> ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -t off -n -c 6400000 
> /dev/nullb5
> Sending 6400000 requests, 4096 bytes each, 64 in parallel
>     read time: 38351ms, 166.000000K IOPS
> ming@:~/git/qemu$
> ming@:~/git/qemu$ sudo ./qemu-img bench -f raw -t off -n -c 6400000 -b
> /dev/nullb5
> Sending 6400000 requests, 4096 bytes each, 64 in parallel
>     read time: 35241ms, 181.000000K IOPS

Hm, interesting. Apparently our environments are different enough to
come to opposite conclusions.

I also tried running some fio benchmarks based on the configuration you
had in the cover letter (just a bit downsized to fit it in the ramdisk)
and came to completely different results: For me, git master is a lot
better than qemu 2.0. The optimisation branch showed small, but
measurable additional improvements, with coroutines consistently being a
bit ahead of the bypass mode.

> > Please note that I added an even worse and even wronger hack to keep the
> > bypass working so I can compare it (raw-posix exposes now both bdrv_aio*
> > and bdrv_co_*, and enabling the bypass also switches). Also, once the
> > AIO code that I kept for the bypass mode is gone, we can make the
> > coroutine path even nicer.
> 
> This approach looks nice since it saves the intermediate callback.
> 
> Basically current bypass approach is to bypass coroutine in block, but
> linux-aio takes a new coroutine, which are two different path. And
> linux-aio's coroutine still can be bypassed easily too , :-)

The patched linux-aio doesn't create a new coroutine, it simply stays
in the one coroutine that we have and in which we already are. Bypassing
it by making the yield conditional would still be possible, of course
(for testing anyway; I don't think anything like that can be merged
easily).

Kevin
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, (continued)
Prev by Date: Re: [Qemu-devel] aarch64 & gdb: warning: while parsing target description (at line 1): Could not load XML document "arm-core.xml"
Next by Date: Re: [Qemu-devel] aarch64 & gdb: warning: while parsing target description (at line 1): Could not load XML document "arm-core.xml"
Previous by thread: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
Next by thread: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
Index(es):
- Date
- Thread