qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virt


From: Ming Lei
Subject: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
Date: Wed, 13 Aug 2014 21:49:23 +0800

On Wed, Aug 13, 2014 at 9:16 PM, Paolo Bonzini <address@hidden> wrote:
> Il 13/08/2014 11:54, Kevin Wolf ha scritto:
>> Am 12.08.2014 um 21:08 hat Paolo Bonzini geschrieben:
>>> Il 12/08/2014 10:12, Ming Lei ha scritto:
>>>>>> The below patch is basically the minimal change to bypass coroutines.  
>>>>>> Of course
>>>>>> the block.c part is not acceptable as is (the change to 
>>>>>> refresh_total_sectors
>>>>>> is broken, the others are just ugly), but it is a start.  Please run it 
>>>>>> with
>>>>>> your fio workloads, or write an aio-based version of a qemu-img/qemu-io 
>>>>>> *I/O*
>>>>>> benchmark.
>>>> Could you explain why the new change is introduced?
>>>
>>> It provides a fast path for bdrv_aio_readv/writev whenever there is
>>> nothing to do after the driver routine returns.  In this case there is
>>> no need to wrap the AIOCB returned by the driver routine.
>>>
>>> It doesn't go all the way, and in particular it doesn't reverse
>>> completely the roles of bdrv_co_readv/writev vs. bdrv_aio_readv/writev.
>>
>> That's actually why I think it's an option. Remember that, like you say
>> below, we're optimising for an extreme case here, and I certainly don't
>> want to hurt the common case for it. I can't imagine a way of reversing
>> the roles without multiplying the cost for the coroutine path.
>
> I'm not that worried about it.  Perhaps it's enough to add an
> !qemu_in_coroutine() to the AIO fast path, and let the driver provide
> optimized coroutine paths like in your patches that allocate AIOCBs on
> the stack.

IMO, it will not be a extreme case as SSD or high performance storage
becomes more popular, coroutine starts to affect performance if IOPS
is more than 100K, as previous computation.

>
>> Or do you have a clever solution how you'd go about it without having an
>> impact on the common case?
>
> I don't really have any ace up my sleeve, but there are some things that
> bother me in the block layer's AIO API and in block.c in general.
>
> One is that block.c can do all the pre-processing it wants before
> issuing AIO, but nothing before calling the callback.  This means that
> my patches break bdrv_drain_all (they cannot call tracked_request_end).
>
> Another is all the similar structs that we have (RwCo,
> BdrvTrackedRequest, BlockRequest, etc.).
>
> Perhaps it would help if we had a single "real" block request object,
> which is an extension of the BlockDriverAIOCB and includes enough data
> to subsume all these request structs.  That should help commonizing
> stuff between the coroutine and AIO paths, for the common case where a
> single yield is enough.  I think the single-yield case is the one that
> is really worth optimizing for.  If done properly, I think this can
> simplify a lot of block.c code, but it is really difficult to get it
> right, and unless the design is sound the code is going to come up
> really ugly. :(
>
> Another thing to evaluate is the performance gap (if there is any)
> between aio=threads and aio=native.  The only advantage of aio=native,
> AFAIU, is the batch submission of requests (plug/unplug).  But
> aio=threads often ends up having better performance because the kernel
> folks have optimized VFS a lot.  So, in the aio=threads case, we might

>From my test, aio=native is much better than aio=threads at least for
read.

For aio=thread, it may be possible to use batch submission too with
readv/writev to decrease sycall.

> as well move the format code out of the iothread and into the worker
> thread, and get rid of the coroutine cost simply by making everything
> synchronous.  Looking within QEMU, this worked out very well for migration.
>
> (We could do batch submission of requests to the thread pool if there
> were a variant of sem_post that can add multiple signals to the same
> semaphore, similar to ReleaseSemaphore on Windows).
>
>>> The problem is that your patches to do touch too much code and subtly
>>> break too much stuff.  The one I wrote does have a little breakage
>>> because I don't understand bs->growable 100% and I didn't really put
>>> much effort into it (my deadline being basically "be done as soon as the
>>> shower is free"), and it is ugly as hell, _but_ it should be compatible
>>> with the way the block layer works.
>>
>> Yes, your patch is definitely much more palatable than Ming's. The part
>> that I still don't like about it is that it would be stating "in the
>> common case, we're only doing the second best thing". I'm not yet
>> convinced that coroutines perform necessarily worse than state-passing
>> callbacks.
>
> Coroutines lump all the allocation costs together at the time you
> allocate the stack, but have (much) more expensive context switching.

Yes, I agree.

In my tests, one pair of malloc(128)/free(128) only takes 57ns,
but two enter and one yield takes 240ns, not mention dcache reload &
miss caused by switching stack, and allocation fallback.


Ming



reply via email to

[Prev in Thread] Current Thread [Next in Thread]