[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH 0/7] coroutine: optimizations
From: |
Ming Lei |
Subject: |
Re: [Qemu-devel] [PATCH 0/7] coroutine: optimizations |
Date: |
Mon, 1 Dec 2014 15:46:03 +0800 |
On Mon, 01 Dec 2014 08:05:17 +0100
Peter Lieven <address@hidden> wrote:
> On 01.12.2014 06:55, Ming Lei wrote:
> > On Fri, Nov 28, 2014 at 10:12 PM, Paolo Bonzini <address@hidden> wrote:
> >> As discussed in the other thread, this brings speedups from
> >> dropping the coroutine mutex (which serializes multiple iothreads,
> >> too) and using ELF thread-local storage.
> >>
> >> The speedup in perf/cost is about 30% (190->145). Windows port tested
> >> with tests/test-coroutine.exe under Wine.
> > The data is very nice, and in my laptop, 'perf cost' can be decreased
> > from 244ns to 174ns.
> >
> > BTW, the cost by using coroutine to run function isn't only from these
> > helpers(*_yield, *_enter, *_create, and perf-cost just measures
> > this part of cost), but also some implicit/invisible part. I have some
> > test cases which can show the problem. If someone is interested,
> > I can post them in list.
>
> Of course, maybe the problem can be solved or impaired.
OK, please try below patch:
From 917d5cc0a273f9825b10abd52152c54e08c81ef8 Mon Sep 17 00:00:00 2001
From: Ming Lei <address@hidden>
Date: Mon, 1 Dec 2014 11:11:23 +0800
Subject: [PATCH] test-coroutine: introduce perf-cost-with-load
The perf/cost test case only covers explicit cost by
using coroutine.
This patch provides a open/close file test case, and
from this case, we can find there is also some implicit
or invisible cost except for the cost measured by /perf/cost.
In my environment, follows the test result after appying this
patch and running perf/cost and perf/cost-with-load:
{*LOG(start):{/perf/cost}:LOG*}
/perf/cost: {*LOG(message):{Run operation 40000000 iterations 7.539413
s, 5305K operations/s, 188ns per coroutine}:LOG*}
OK
{*LOG(stop):(0;0;7.539497):LOG*}
{*LOG(start):{/perf/cost-with-load}:LOG*}
/perf/cost-with-load: {*LOG(message):{Run operation 1000000 iterations
2.648014 s, 377K operations/s, 2648ns per operation without using
coroutine}:LOG*}
{*LOG(message):{Run operation 1000000 iterations 2.919133 s, 342K
operations/s, 2919ns per operation, 271ns(cost introduced by coroutine)
per operation with using coroutine}:LOG*}
OK
{*LOG(stop):(0;0;5.567333):LOG*}
From above data, we can see 188ns is introduced for running one
coroutine, but in /perf/cost-with-load, the actual cost introduced
is 271ns, and the extra 83ns cost is invisible and implicit.
The similar result can be found in following test case too:
- read from /dev/nullb0 which is opened with O_DIRECT
(it is sort of aio read simulation, need 3.13+ kernel for
/dev/nullbX support by 'modprobe null_blk', this case
can show +150ns extra cost)
- statvfs() syscall, there is ~30ns extra cost for running
one statvfs() with coroutine
---
tests/test-coroutine.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 67 insertions(+)
diff --git a/tests/test-coroutine.c b/tests/test-coroutine.c
index 27d1b6f..7323a91 100644
--- a/tests/test-coroutine.c
+++ b/tests/test-coroutine.c
@@ -311,6 +311,72 @@ static void perf_baseline(void)
maxcycles, duration);
}
+static void perf_cost_load_worker(void *opaque)
+{
+ int fd;
+
+ fd = open("/proc/self/exe", O_RDONLY);
+ assert(fd >= 0);
+ close(fd);
+}
+
+static __attribute__((noinline)) void perf_cost_load_func(void *opaque)
+{
+ perf_cost_load_worker(opaque);
+ qemu_coroutine_yield();
+}
+
+static double perf_cost_load(unsigned long maxcycles, bool use_co)
+{
+ unsigned long i = 0;
+ double duration;
+
+ g_test_timer_start();
+ if (use_co) {
+ Coroutine *co;
+ while (i++ < maxcycles) {
+ co = qemu_coroutine_create(perf_cost_load_func);
+ qemu_coroutine_enter(co, &i);
+ qemu_coroutine_enter(co, NULL);
+ }
+ } else {
+ while (i++ < maxcycles) {
+ perf_cost_load_worker(&i);
+ }
+ }
+ duration = g_test_timer_elapsed();
+
+ return duration;
+}
+
+static void perf_cost_with_load(void)
+{
+ const unsigned long maxcycles = 1000000;
+ double duration;
+ unsigned long ops;
+ unsigned long cost_co, cost;
+
+ duration = perf_cost_load(maxcycles, false);
+ ops = (long)(maxcycles / (duration * 1000));
+ cost = (unsigned long)(1000000000.0 * duration / maxcycles);
+ g_test_message("Run operation %lu iterations %f s, %luK operations/s, "
+ "%luns per operation without using coroutine",
+ maxcycles,
+ duration, ops,
+ cost);
+
+ duration = perf_cost_load(maxcycles, true);
+ ops = (long)(maxcycles / (duration * 1000));
+ cost_co = (unsigned long)(1000000000.0 * duration / maxcycles);
+ g_test_message("Run operation %lu iterations %f s, %luK operations/s, "
+ "%luns per operation, "
+ "%luns(cost introduced by coroutine) per operation "
+ "with using coroutine",
+ maxcycles,
+ duration, ops,
+ cost_co, cost_co - cost);
+}
+
static __attribute__((noinline)) void perf_cost_func(void *opaque)
{
qemu_coroutine_yield();
@@ -355,6 +421,7 @@ int main(int argc, char **argv)
g_test_add_func("/perf/yield", perf_yield);
g_test_add_func("/perf/function-call", perf_baseline);
g_test_add_func("/perf/cost", perf_cost);
+ g_test_add_func("/perf/cost-with-load", perf_cost_with_load);
}
return g_test_run();
}
--
1.7.9.5
Thanks,
--
Ming Lei