qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virt


From: Ming Lei
Subject: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
Date: Fri, 8 Aug 2014 19:26:38 +0800

On Fri, Aug 8, 2014 at 6:32 PM, Ming Lei <address@hidden> wrote:
> On Thu, Aug 7, 2014 at 9:51 PM, Kevin Wolf <address@hidden> wrote:
>> Am 07.08.2014 um 12:27 hat Ming Lei geschrieben:
>>> On Wed, Aug 6, 2014 at 11:40 PM, Kevin Wolf <address@hidden> wrote:
>>> > Am 06.08.2014 um 13:28 hat Ming Lei geschrieben:
>>> >> On Wed, Aug 6, 2014 at 6:09 PM, Kevin Wolf <address@hidden> wrote:
>>> >> > Am 06.08.2014 um 11:37 hat Ming Lei geschrieben:
>>> >> >> On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <address@hidden> wrote:
>>> >> >> > However, I just wasn't sure whether a change on this level would be
>>> >> >> > relevant in a realistic environment. This is the reason why I 
>>> >> >> > wanted to
>>> >> >> > get a benchmark involving the block layer and some I/O.
>>> >> >> >
>>> >> >> >> From the profiling data in below link:
>>> >> >> >>
>>> >> >> >>     http://pastebin.com/YwH2uwbq
>>> >> >> >>
>>> >> >> >> With coroutine, the running time for same loading is increased
>>> >> >> >> ~50%(1.325s vs. 0.903s), and dcache load events is increased
>>> >> >> >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
>>> >> >> >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
>>> >> >> >>
>>> >> >> >> The bypass code in the benchmark is very similar with the approach
>>> >> >> >> used in the bypass patch, since linux-aio with O_DIRECT seldom
>>> >> >> >> blocks in the the kernel I/O path.
>>> >> >> >>
>>> >> >> >> Maybe the benchmark is a bit extremely, but given modern storage
>>> >> >> >> device may reach millions of IOPS, and it is very easy to slow down
>>> >> >> >> the I/O by coroutine.
>>> >> >> >
>>> >> >> > I think in order to optimise coroutines, such benchmarks are fair 
>>> >> >> > game.
>>> >> >> > It's just not guaranteed that the effects are exactly the same on 
>>> >> >> > real
>>> >> >> > workloads, so we should take the results with a grain of salt.
>>> >> >> >
>>> >> >> > Anyhow, the coroutine version of your benchmark is buggy, it leaks 
>>> >> >> > all
>>> >> >> > coroutines instead of exiting them, so it can't make any use of the
>>> >> >> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
>>> >> >> > version that simply removes the yield at the end):
>>> >> >> >
>>> >> >> >                 | bypass        | fixed coro    | buggy coro
>>> >> >> > ----------------+---------------+---------------+--------------
>>> >> >> > time            | 1.09s         | 1.10s         | 1.62s
>>> >> >> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>>> >> >> > insns per cycle | 2.39          | 2.39          | 1.90
>>> >> >> >
>>> >> >> > Begs the question whether you see a similar effect on a real qemu 
>>> >> >> > and
>>> >> >> > the coroutine pool is still not big enough? With correct use of
>>> >> >> > coroutines, the difference seems to be barely measurable even 
>>> >> >> > without
>>> >> >> > any I/O involved.
>>> >> >>
>>> >> >> When I comment qemu_coroutine_yield(), looks result of
>>> >> >> bypass and fixed coro is very similar as your test, and I am just
>>> >> >> wondering if stack is always switched in qemu_coroutine_enter()
>>> >> >> without calling qemu_coroutine_yield().
>>> >> >
>>> >> > Yes, definitely. qemu_coroutine_enter() always involves calling
>>> >> > qemu_coroutine_switch(), which is the stack switch.
>>> >> >
>>> >> >> Without the yield, the benchmark can't emulate coroutine usage in
>>> >> >> bdrv_aio_readv/writev() path any more, and bypass in the patchset
>>> >> >> skips two qemu_coroutine_enter() and one qemu_coroutine_yield()
>>> >> >> for each bdrv_aio_readv/writev().
>>> >> >
>>> >> > It's not completely comparable anyway because you're not going through 
>>> >> > a
>>> >> > main loop and callbacks from there for your benchmark.
>>> >> >
>>> >> > But fair enough: Keep the yield, but enter the coroutine twice then. 
>>> >> > You
>>> >> > get slightly worse results then, but that's more like doubling the very
>>> >> > small difference between "bypass" and "fixed coro" (1.11s / 946,434,327
>>> >> > / 2.37), not like the horrible performance of the buggy version.
>>> >>
>>> >> Yes, I compared that too, looks no big difference.
>>> >>
>>> >> >
>>> >> > Actually, that's within the error of measurement for time and
>>> >> > insns/cycle, so running it for a bit longer:
>>> >> >
>>> >> >                 | bypass    | coro      | + yield   | buggy coro
>>> >> > ----------------+-----------+-----------+-----------+--------------
>>> >> > time            | 21.45s    | 21.68s    | 21.83s    | 97.05s
>>> >> > L1-dcache-loads | 18,049 M  | 18,387 M  | 18,618 M  | 26,062 M
>>> >> > insns per cycle | 2.42      | 2.40      | 2.41      | 1.75
>>> >> >
>>> >> >> >> > I played a bit with the following, I hope it's not too naive. I 
>>> >> >> >> > couldn't
>>> >> >> >> > see a difference with your patches, but at least one reason for 
>>> >> >> >> > this is
>>> >> >> >> > probably that my laptop SSD isn't fast enough to make the CPU the
>>> >> >> >> > bottleneck. Haven't tried ramdisk yet, that would probably be 
>>> >> >> >> > the next
>>> >> >> >> > thing. (I actually wrote the patch up just for some profiling on 
>>> >> >> >> > my own,
>>> >> >> >> > not for comparing throughput, but it should be usable for that 
>>> >> >> >> > as well.)
>>> >> >> >>
>>> >> >> >> This might not be good for the test since it is basically a 
>>> >> >> >> sequential
>>> >> >> >> read test, which can be optimized a lot by kernel. And I always use
>>> >> >> >> randread benchmark.
>>> >> >> >
>>> >> >> > Yes, I shortly pondered whether I should implement random offsets
>>> >> >> > instead. But then I realised that a quicker kernel operation would 
>>> >> >> > only
>>> >> >> > help the benchmark because we want it to test the CPU consumption in
>>> >> >> > userspace. So the faster the kernel gets, the better for us, 
>>> >> >> > because it
>>> >> >> > should make the impact of coroutines bigger.
>>> >> >>
>>> >> >> OK, I will compare coroutine vs. bypass-co with the benchmark.
>>> >>
>>> >> I use the /dev/nullb0 block device to test, which is available in linux 
>>> >> kernel
>>> >> 3.13+, and follows the difference, which looks not very big(< 10%):
>>> >
>>> > Sounds useful. I'm running on an older kernel, so I used a loop-mounted
>>> > file on tmpfs instead for my tests.
>>>
>>> Actually loop is a slow device, and recently I used kernel aio and blk-mq
>>> to speedup it a lot.
>>
>> Yes, I have no doubts that it's slower than a proper ramdisk, but it
>> should still be way faster than my normal disk.
>>
>>> > Anyway, at some point today I figured I should take a different approach
>>> > and not try to minimise the problems that coroutines introduce, but
>>> > rather make the most use of them when we have them. After all, the
>>> > raw-posix driver is still very callback-oriented and does things that
>>> > aren't really necessary with coroutines (such as AIOCB allocation).
>>> >
>>> > The qemu-img bench time I ended up with looked quite nice. Maybe you
>>> > want to take a look if you can reproduce these results, both with
>>> > qemu-img bench and your real benchmark.
>>> >
>>> >
>>> > $ for i in $(seq 1 5); do time ./qemu-img bench -t none -n -c 2000000 
>>> > /dev/loop0; done
>>> > Sending 2000000 requests, 4096 bytes each, 64 in parallel
>>> >
>>> >         bypass (base) | bypass (patch) | coro (base) | coro (patch)
>>> > ----------------------+----------------+-------------+---------------
>>> > run 1   0m5.966s      | 0m5.687s       |  0m6.224s   | 0m5.362s
>>> > run 2   0m5.826s      | 0m5.831s       |  0m5.994s   | 0m5.541s
>>> > run 3   0m6.145s      | 0m5.495s       |  0m6.253s   | 0m5.408s
>>> > run 4   0m5.683s      | 0m5.527s       |  0m6.045s   | 0m5.293s
>>> > run 5   0m5.904s      | 0m5.607s       |  0m6.238s   | 0m5.207s
>>>
>>> I suggest to run the test a bit long.
>>
>> Okay, ran it again with -c 10000000 this time. I also used the updated
>> branch for the patched version. This means that the __thread patch is
>> not enabled; this is probably why the improvement for the bypass has
>> disappeared and the coroutine based version only approaches, but doesn't
>> beat it this time.
>>
>>         bypass (base) | bypass (patch) | coro (base) | coro (patch)
>> ----------------------+----------------+-------------+---------------
>> run 1   28.255s       |  28.615s       | 30.364s     | 28.318s
>> run 2   28.190s       |  28.926s       | 30.096s     | 28.437s
>> run 3   28.079s       |  29.603s       | 30.084s     | 28.567s
>> run 4   28.888s       |  28.581s       | 31.343s     | 28.605s
>> run 5   28.196s       |  28.924s       | 30.033s     | 27.935s
>
> Your result is quite good(>300K IOPS), much better than my result with
> /dev/nullb0(less than 200K), and I also tried loop over file in tmpfs, which
> looks a bit quicker than /dev/nullb0(still ~200K IOPS in my server), so
> I guess your machine is very fast.
>
> It is a bit similar with my observation:
>
> - in my laptop(CPU: 2.6GHz), your coro patch improved much, and only
> less 5% than bypass
> - in my server(CPU: 1.6GHz, same L1/L2 cache with laptop, bigger L3 cache),
> your coro patch improved little, and it is less 10% than bypass
>
> so looks coroutine behaves better on fast CPUs? instead of slow CPU?

I think it is true:

- using coroutine introduces a bit extra CPU loading(coroutine_swap,
and dcache miss introduced by switching stack) inevitably

- the introduced loading may be not a (big) deal for fast CPU, but makes
difference for slower CPU

- even for fast CPU,  more or less, 'perf stat' may show some difference
in instructions per cycle, dcache load and misses, branch misses, dTLB
misses...

BTW, Kevin, if we want to see coroutine effect in block I/O path, it may be
better to use same path(bypass linux-aio co too in below tree, or just
qemu master with my patchset) to compare result since the only difference
is in using coroutine or not:

   git://kernel.ubuntu.com/ming/qemu.git  v2.1.0-mq.1-kevin-perf

In your perf-bypass branch, bypass and non-bypass runs a different
path, so it is not suitable for doing the comparison and making conclusion.

Thanks,



reply via email to

[Prev in Thread] Current Thread [Next in Thread]