Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virt

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virt

From:	Ming Lei
Subject:	Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
Date:	Wed, 6 Aug 2014 19:28:58 +0800
On Wed, Aug 6, 2014 at 6:09 PM, Kevin Wolf <address@hidden> wrote:
> Am 06.08.2014 um 11:37 hat Ming Lei geschrieben:
>> On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <address@hidden> wrote:
>> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
>> >> Hi Kevin,
>> >>
>> >> On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <address@hidden> wrote:
>> >> > Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben:
>> >> >> I have been wondering how to prove that the root cause is the ucontext
>> >> >> coroutine mechanism (stack switching).  Here is an idea:
>> >> >>
>> >> >> Hack your "bypass" code path to run the request inside a coroutine.
>> >> >> That way you can compare "bypass without coroutine" against "bypass 
>> >> >> with
>> >> >> coroutine".
>> >> >>
>> >> >> Right now I think there are doubts because the bypass code path is
>> >> >> indeed a different (and not 100% correct) code path.  So this approach
>> >> >> might prove that the coroutines are adding the overhead and not
>> >> >> something that you bypassed.
>> >> >
>> >> > My doubts aren't only that the overhead might not come from the
>> >> > coroutines, but also whether any coroutine-related overhead is really
>> >> > unavoidable. If we can optimise coroutines, I'd strongly prefer to do
>> >> > just that instead of introducing additional code paths.
>> >>
>> >> OK, thank you for taking look at the problem, and hope we can
>> >> figure out the root cause, :-)
>> >>
>> >> >
>> >> > Another thought I had was this: If the performance difference is indeed
>> >> > only coroutines, then that is completely inside the block layer and we
>> >> > don't actually need a VM to test it. We could instead have something
>> >> > like a simple qemu-img based benchmark and should be observing the same.
>> >>
>> >> Even it is simpler to run a coroutine-only benchmark, and I just
>> >> wrote a raw one, and looks coroutine does decrease performance
>> >> a lot, please see the attachment patch, and thanks for your template
>> >> to help me add the 'co_bench' command in qemu-img.
>> >
>> > Yes, we can look at coroutines microbenchmarks in isolation. I actually
>> > did do that yesterday with the yield test from tests/test-coroutine.c.
>> > And in fact profiling immediately showed something to optimise:
>> > pthread_getspecific() was quite high, replacing it by __thread on
>> > systems where it works is more efficient and helped the numbers a bit.
>> > Also, a lot of time seems to be spent in pthread_mutex_lock/unlock (even
>> > in qemu-img bench), maybe there's even something that can be done here.
>>
>> The lock/unlock in dataplane is often from memory_region_find(), and Paolo
>> should have done lots of work on that.
>>
>> >
>> > However, I just wasn't sure whether a change on this level would be
>> > relevant in a realistic environment. This is the reason why I wanted to
>> > get a benchmark involving the block layer and some I/O.
>> >
>> >> From the profiling data in below link:
>> >>
>> >>     http://pastebin.com/YwH2uwbq
>> >>
>> >> With coroutine, the running time for same loading is increased
>> >> ~50%(1.325s vs. 0.903s), and dcache load events is increased
>> >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
>> >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
>> >>
>> >> The bypass code in the benchmark is very similar with the approach
>> >> used in the bypass patch, since linux-aio with O_DIRECT seldom
>> >> blocks in the the kernel I/O path.
>> >>
>> >> Maybe the benchmark is a bit extremely, but given modern storage
>> >> device may reach millions of IOPS, and it is very easy to slow down
>> >> the I/O by coroutine.
>> >
>> > I think in order to optimise coroutines, such benchmarks are fair game.
>> > It's just not guaranteed that the effects are exactly the same on real
>> > workloads, so we should take the results with a grain of salt.
>> >
>> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
>> > coroutines instead of exiting them, so it can't make any use of the
>> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
>> > version that simply removes the yield at the end):
>> >
>> >                 | bypass        | fixed coro    | buggy coro
>> > ----------------+---------------+---------------+--------------
>> > time            | 1.09s         | 1.10s         | 1.62s
>> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>> > insns per cycle | 2.39          | 2.39          | 1.90
>> >
>> > Begs the question whether you see a similar effect on a real qemu and
>> > the coroutine pool is still not big enough? With correct use of
>> > coroutines, the difference seems to be barely measurable even without
>> > any I/O involved.
>>
>> When I comment qemu_coroutine_yield(), looks result of
>> bypass and fixed coro is very similar as your test, and I am just
>> wondering if stack is always switched in qemu_coroutine_enter()
>> without calling qemu_coroutine_yield().
>
> Yes, definitely. qemu_coroutine_enter() always involves calling
> qemu_coroutine_switch(), which is the stack switch.
>
>> Without the yield, the benchmark can't emulate coroutine usage in
>> bdrv_aio_readv/writev() path any more, and bypass in the patchset
>> skips two qemu_coroutine_enter() and one qemu_coroutine_yield()
>> for each bdrv_aio_readv/writev().
>
> It's not completely comparable anyway because you're not going through a
> main loop and callbacks from there for your benchmark.
>
> But fair enough: Keep the yield, but enter the coroutine twice then. You
> get slightly worse results then, but that's more like doubling the very
> small difference between "bypass" and "fixed coro" (1.11s / 946,434,327
> / 2.37), not like the horrible performance of the buggy version.

Yes, I compared that too, looks no big difference.

>
> Actually, that's within the error of measurement for time and
> insns/cycle, so running it for a bit longer:
>
>                 | bypass    | coro      | + yield   | buggy coro
> ----------------+-----------+-----------+-----------+--------------
> time            | 21.45s    | 21.68s    | 21.83s    | 97.05s
> L1-dcache-loads | 18,049 M  | 18,387 M  | 18,618 M  | 26,062 M
> insns per cycle | 2.42      | 2.40      | 2.41      | 1.75
>
>> >> > I played a bit with the following, I hope it's not too naive. I couldn't
>> >> > see a difference with your patches, but at least one reason for this is
>> >> > probably that my laptop SSD isn't fast enough to make the CPU the
>> >> > bottleneck. Haven't tried ramdisk yet, that would probably be the next
>> >> > thing. (I actually wrote the patch up just for some profiling on my own,
>> >> > not for comparing throughput, but it should be usable for that as well.)
>> >>
>> >> This might not be good for the test since it is basically a sequential
>> >> read test, which can be optimized a lot by kernel. And I always use
>> >> randread benchmark.
>> >
>> > Yes, I shortly pondered whether I should implement random offsets
>> > instead. But then I realised that a quicker kernel operation would only
>> > help the benchmark because we want it to test the CPU consumption in
>> > userspace. So the faster the kernel gets, the better for us, because it
>> > should make the impact of coroutines bigger.
>>
>> OK, I will compare coroutine vs. bypass-co with the benchmark.

I use the /dev/nullb0 block device to test, which is available in linux kernel
3.13+, and follows the difference, which looks not very big(< 10%):

And I added two parameter to your img-bench patch:

      -c CNT  # which is passed to 'data.n'
      -b           #enable bypass coroutine introduced in this patchset

Another difference is that dataplane uses its own thread, and this
bench takes main_loop.

ming@:~/git/qemu$ sudo ~/bin/perf stat -e
L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses
./qemu-img bench -f raw -t off -n -c 10000000  -b /dev/nullb0
read time: 58024ms

 Performance counter stats for './qemu-img bench -f raw -t off -n -c
10000000 -b /dev/nullb0':

    34,874,462,357      L1-dcache-loads
              [40.00%]
       714,018,039      L1-dcache-load-misses     #    2.05% of all
L1-dcache hits   [40.00%]
   133,897,794,677      cpu-cycles                [40.05%]
   116,714,230,004      instructions              #    0.87  insns per
cycle         [50.02%]
    22,689,223,546      branch-instructions
              [50.01%]
       391,673,952      branch-misses             #    1.73% of all
branches         [50.00%]
    22,726,856,215      branch-loads
              [50.01%]
    18,570,766,783      branch-load-misses
              [49.98%]
    34,944,839,907      dTLB-loads
              [39.99%]
        24,405,944      dTLB-load-misses          #    0.07% of all
dTLB cache hits  [39.99%]

      58.040785989 seconds time elapsed


ming@:~/git/qemu$ sudo ~/bin/perf stat -e
L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses
./qemu-img bench -f raw -t off -n -c 10000000  /dev/nullb0
read time: 63369ms

 Performance counter stats for './qemu-img bench -f raw -t off -n -c
10000000 /dev/nullb0':

    35,751,490,462      L1-dcache-loads
              [39.97%]
     1,111,352,581      L1-dcache-load-misses     #    3.11% of all
L1-dcache hits   [40.01%]
   143,731,446,722      cpu-cycles                [40.01%]
   118,754,926,871      instructions              #    0.83  insns per
cycle         [50.04%]
    22,870,542,314      branch-instructions
              [50.07%]
       524,893,216      branch-misses             #    2.30% of all
branches         [50.05%]
    22,903,688,861      branch-loads
              [50.00%]
    20,179,726,291      branch-load-misses
              [49.99%]
    35,829,927,679      dTLB-loads
              [39.96%]
        42,964,365      dTLB-load-misses          #    0.12% of all
dTLB cache hits  [39.97%]

      63.392832844 seconds time elapsed


Thanks,
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, (continued)
Prev by Date: [Qemu-devel] QEMU with KVM does not start Win8 on kernel 3.4.67 and core2duo
Next by Date: Re: [Qemu-devel] [PATCH v2 00/30] AHCI test suite framework
Previous by thread: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
Next by thread: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
Index(es):
- Date
- Thread