qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virt


From: Kevin Wolf
Subject: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
Date: Mon, 11 Aug 2014 16:03:56 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

Am 10.08.2014 um 05:46 hat Ming Lei geschrieben:
> Hi Kevin, Paolo, Stefan and all,
> 
> 
> On Wed, 6 Aug 2014 10:48:55 +0200
> Kevin Wolf <address@hidden> wrote:
> 
> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
> 
> > 
> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
> > coroutines instead of exiting them, so it can't make any use of the
> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
> > version that simply removes the yield at the end):
> > 
> >                 | bypass        | fixed coro    | buggy coro
> > ----------------+---------------+---------------+--------------
> > time            | 1.09s         | 1.10s         | 1.62s
> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
> > insns per cycle | 2.39          | 2.39          | 1.90
> > 
> > Begs the question whether you see a similar effect on a real qemu and
> > the coroutine pool is still not big enough? With correct use of
> > coroutines, the difference seems to be barely measurable even without
> > any I/O involved.
> 
> Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
> loading, and cause operations per sec very low(~40K/sec), finally I write a 
> new
> and simple one which can generate hundreds of kilo operations per sec and
> the number should match with some fast storage devices, and it does show there
> is not small effect from coroutine.
> 
> Extremely if just getppid() syscall is run in each iteration, with using 
> coroutine,
> only 3M operations/sec can be got, and without using coroutine, the number can
> reach 16M/sec, and there is more than 4 times difference!!!

I see that you're measuring a lot of things, but the one thing that is
unclear to me is what question those benchmarks are supposed to answer.

Basically I see two different, useful types of benchmark:

1. Look at coroutines in isolation and try to get a directly coroutine-
   related function (like create/destroy or yield/reenter) faster. This
   is what tests/test-coroutine does.

   This is quite good at telling you what costs the coroutine functions
   have and where you need to optimise - without taking the pratical
   benefits into account, so it's not suitable for comparison.

2. Look at the whole thing in its realistic environment. This should
   probably involve at least some asynchronous I/O, but ideally use the
   whole block layer. qemu-img bench tries to do this. For being even
   closer to the real environment you'd have to use the virtio-blk code
   as well, which you currently only get with a full VM (perhaps qtest
   could do something interesting here in theory).

   This is good for telling how big the costs are in relation to the
   total workload (and code saved elsewhere) in practice. This is the
   set of tests that can meaningfully be compared to a callback-based
   solution.

Running arbitrary workloads like getppid() or open/read/close isn't as
useful as these. It doesn't isolate the coroutines as well as tests that
run literally nothing else than coroutine functions, and it is too
removed from the actual use case to get the relation between additional
costs, saving and total workload figured out for the real case.

> From another file read bench which is the default one:
> 
>       just doing open(file), read(fd, buf in stack, 512), sum and close() in 
> each iteration
> 
> without using coroutine, operations per second can increase ~20% compared
> with using coroutine. If reading 1024 bytes each time, the number still can
> increase ~10%. The operations per second level is between 200K~400K per
> sec which should match the IOPS in dataplane test, and the tests are
> done in my lenovo T410 notepad(CPU: 2.6GHz, dual core, four threads). 
> 
> When reading 8192 and more bytes each time, the difference between using
> coroutine and not can't be observed obviously.

All it tells you is that the variation of the workload can make the
coroutine cost disappear in the noise. It doesn't tell you much about
how the real use case.

And you're comparing apples and oranges anyway: The real question in
qemu is whether you use coroutines or pass around heap-allocated state
between callbacks. Your benchmark doesn't have a single callback because
it hasn't got any asynchronous operations and doesn't need to allocate
and pass any state.

It does, however, have an unnecessary yield() for the coroutine case
because you felt that the real case is more complex and does yield
(which is true, but it's more complex for both coroutines and
callbacks).

> Surely, the test result should depend on how fast the machine is, but even
> for fast machine, I guess the similar result still can be observed by
> decreasing read bytes each time.

Yes, results looked similar on my laptop. (They just don't tell me
much.)


Let's have a look at some fio results from my laptop:

aggrb KB/s  | master    | coroutine | bypass
------------+-----------+-----------+------------
run 1       | 419934    | 449518    | 445823
run 2       | 444358    | 456365    | 448332
run 3       | 444076    | 455209    | 441552


And here from my lab test box:

aggrb KB/s  | master    | coroutine | bypass
------------+-----------+-----------+------------
run 1       | 25330     | 56378     | 53541
run 2       | 26041     | 55709     | 54136
run 3       | 25811     | 56829     | 49080

The improvement of the bypass patches is barely measurable on my laptop
(if it even exists), whereas it seems to be a pretty big thing for my
lab test box. In any case, the optimised coroutine code seems to beat
the bypass on both machines. (That is for random reads anyway. For
sequential, I get a much larger variation, and on my lab test box bypass
is ahead, whereas on my laptop both are roughly on the same level.)


Another thing I tried is creating the coroutine already in virtio-blk to
avoid the overhead of the bdrv_aio_* emulation. I don't quite understand
the result of my benchmarks there, maybe you have an idea: For random
reads, I see a significant improvement, for sequential however a clear
degradation.

aggrb MB/s  | bypass    | coroutine | virtio-blk-created coroutine
------------+-----------+-----------+------------------------------
seq. read   | 738       | 738       | 694
random read | 442       | 459       | 475

I would appreciate any ideas about what's going on with sequential reads
here and how it can be fixed. Anyway, on my machines, coroutines don't
look like a lost case at all.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]