qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virt


From: Ming Lei
Subject: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
Date: Wed, 6 Aug 2014 13:33:36 +0800

Hi Kevin,

On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <address@hidden> wrote:
> Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben:
>> On Tue, Aug 05, 2014 at 06:00:22PM +0800, Ming Lei wrote:
>> > On Tue, Aug 5, 2014 at 5:48 PM, Kevin Wolf <address@hidden> wrote:
>> > > Am 05.08.2014 um 05:33 hat Ming Lei geschrieben:
>> > >> Hi,
>> > >>
>> > >> These patches bring up below 4 changes:
>> > >>         - introduce object allocation pool and apply it to
>> > >>         virtio-blk dataplane for improving its performance
>> > >>
>> > >>         - introduce selective coroutine bypass mechanism
>> > >>         for improving performance of virtio-blk dataplane with
>> > >>         raw format image
>> > >
>> > > Before applying any bypassing patches, I think we should understand in
>> > > detail where we are losing performance with coroutines enabled.
>> >
>> > From the below profiling data, CPU becomes slow to run instructions
>> > with coroutine, and CPU dcache miss is increased so it is very
>> > likely caused by switching stack frequently.
>> >
>> > http://marc.info/?l=qemu-devel&m=140679721126306&w=2
>> >
>> > http://pastebin.com/ae0vnQ6V
>>
>> I have been wondering how to prove that the root cause is the ucontext
>> coroutine mechanism (stack switching).  Here is an idea:
>>
>> Hack your "bypass" code path to run the request inside a coroutine.
>> That way you can compare "bypass without coroutine" against "bypass with
>> coroutine".
>>
>> Right now I think there are doubts because the bypass code path is
>> indeed a different (and not 100% correct) code path.  So this approach
>> might prove that the coroutines are adding the overhead and not
>> something that you bypassed.
>
> My doubts aren't only that the overhead might not come from the
> coroutines, but also whether any coroutine-related overhead is really
> unavoidable. If we can optimise coroutines, I'd strongly prefer to do
> just that instead of introducing additional code paths.

OK, thank you for taking look at the problem, and hope we can
figure out the root cause, :-)

>
> Another thought I had was this: If the performance difference is indeed
> only coroutines, then that is completely inside the block layer and we
> don't actually need a VM to test it. We could instead have something
> like a simple qemu-img based benchmark and should be observing the same.

Even it is simpler to run a coroutine-only benchmark, and I just
wrote a raw one, and looks coroutine does decrease performance
a lot, please see the attachment patch, and thanks for your template
to help me add the 'co_bench' command in qemu-img.

>From the profiling data in below link:

    http://pastebin.com/YwH2uwbq

With coroutine, the running time for same loading is increased
~50%(1.325s vs. 0.903s), and dcache load events is increased
~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).

The bypass code in the benchmark is very similar with the approach
used in the bypass patch, since linux-aio with O_DIRECT seldom
blocks in the the kernel I/O path.

Maybe the benchmark is a bit extremely, but given modern storage
device may reach millions of IOPS, and it is very easy to slow down
the I/O by coroutine.

> I played a bit with the following, I hope it's not too naive. I couldn't
> see a difference with your patches, but at least one reason for this is
> probably that my laptop SSD isn't fast enough to make the CPU the
> bottleneck. Haven't tried ramdisk yet, that would probably be the next
> thing. (I actually wrote the patch up just for some profiling on my own,
> not for comparing throughput, but it should be usable for that as well.)

This might not be good for the test since it is basically a sequential
read test, which can be optimized a lot by kernel. And I always use
randread benchmark.


Thanks,

Attachment: co_bench.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]