[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virt
From: |
Ming Lei |
Subject: |
Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support |
Date: |
Wed, 13 Aug 2014 21:49:23 +0800 |
On Wed, Aug 13, 2014 at 9:16 PM, Paolo Bonzini <address@hidden> wrote:
> Il 13/08/2014 11:54, Kevin Wolf ha scritto:
>> Am 12.08.2014 um 21:08 hat Paolo Bonzini geschrieben:
>>> Il 12/08/2014 10:12, Ming Lei ha scritto:
>>>>>> The below patch is basically the minimal change to bypass coroutines.
>>>>>> Of course
>>>>>> the block.c part is not acceptable as is (the change to
>>>>>> refresh_total_sectors
>>>>>> is broken, the others are just ugly), but it is a start. Please run it
>>>>>> with
>>>>>> your fio workloads, or write an aio-based version of a qemu-img/qemu-io
>>>>>> *I/O*
>>>>>> benchmark.
>>>> Could you explain why the new change is introduced?
>>>
>>> It provides a fast path for bdrv_aio_readv/writev whenever there is
>>> nothing to do after the driver routine returns. In this case there is
>>> no need to wrap the AIOCB returned by the driver routine.
>>>
>>> It doesn't go all the way, and in particular it doesn't reverse
>>> completely the roles of bdrv_co_readv/writev vs. bdrv_aio_readv/writev.
>>
>> That's actually why I think it's an option. Remember that, like you say
>> below, we're optimising for an extreme case here, and I certainly don't
>> want to hurt the common case for it. I can't imagine a way of reversing
>> the roles without multiplying the cost for the coroutine path.
>
> I'm not that worried about it. Perhaps it's enough to add an
> !qemu_in_coroutine() to the AIO fast path, and let the driver provide
> optimized coroutine paths like in your patches that allocate AIOCBs on
> the stack.
IMO, it will not be a extreme case as SSD or high performance storage
becomes more popular, coroutine starts to affect performance if IOPS
is more than 100K, as previous computation.
>
>> Or do you have a clever solution how you'd go about it without having an
>> impact on the common case?
>
> I don't really have any ace up my sleeve, but there are some things that
> bother me in the block layer's AIO API and in block.c in general.
>
> One is that block.c can do all the pre-processing it wants before
> issuing AIO, but nothing before calling the callback. This means that
> my patches break bdrv_drain_all (they cannot call tracked_request_end).
>
> Another is all the similar structs that we have (RwCo,
> BdrvTrackedRequest, BlockRequest, etc.).
>
> Perhaps it would help if we had a single "real" block request object,
> which is an extension of the BlockDriverAIOCB and includes enough data
> to subsume all these request structs. That should help commonizing
> stuff between the coroutine and AIO paths, for the common case where a
> single yield is enough. I think the single-yield case is the one that
> is really worth optimizing for. If done properly, I think this can
> simplify a lot of block.c code, but it is really difficult to get it
> right, and unless the design is sound the code is going to come up
> really ugly. :(
>
> Another thing to evaluate is the performance gap (if there is any)
> between aio=threads and aio=native. The only advantage of aio=native,
> AFAIU, is the batch submission of requests (plug/unplug). But
> aio=threads often ends up having better performance because the kernel
> folks have optimized VFS a lot. So, in the aio=threads case, we might
>From my test, aio=native is much better than aio=threads at least for
read.
For aio=thread, it may be possible to use batch submission too with
readv/writev to decrease sycall.
> as well move the format code out of the iothread and into the worker
> thread, and get rid of the coroutine cost simply by making everything
> synchronous. Looking within QEMU, this worked out very well for migration.
>
> (We could do batch submission of requests to the thread pool if there
> were a variant of sem_post that can add multiple signals to the same
> semaphore, similar to ReleaseSemaphore on Windows).
>
>>> The problem is that your patches to do touch too much code and subtly
>>> break too much stuff. The one I wrote does have a little breakage
>>> because I don't understand bs->growable 100% and I didn't really put
>>> much effort into it (my deadline being basically "be done as soon as the
>>> shower is free"), and it is ugly as hell, _but_ it should be compatible
>>> with the way the block layer works.
>>
>> Yes, your patch is definitely much more palatable than Ming's. The part
>> that I still don't like about it is that it would be stating "in the
>> common case, we're only doing the second best thing". I'm not yet
>> convinced that coroutines perform necessarily worse than state-passing
>> callbacks.
>
> Coroutines lump all the allocation costs together at the time you
> allocate the stack, but have (much) more expensive context switching.
Yes, I agree.
In my tests, one pair of malloc(128)/free(128) only takes 57ns,
but two enter and one yield takes 240ns, not mention dcache reload &
miss caused by switching stack, and allocation fallback.
Ming
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, (continued)
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Ming Lei, 2014/08/09
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Kevin Wolf, 2014/08/11
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Ming Lei, 2014/08/12
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Kevin Wolf, 2014/08/12
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Ming Lei, 2014/08/12
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Paolo Bonzini, 2014/08/11
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Ming Lei, 2014/08/12
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Paolo Bonzini, 2014/08/12
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Kevin Wolf, 2014/08/13
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Paolo Bonzini, 2014/08/13
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support,
Ming Lei <=
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Stefan Hajnoczi, 2014/08/14
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Ming Lei, 2014/08/14
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Paolo Bonzini, 2014/08/15
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Ming Lei, 2014/08/13
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Paolo Bonzini, 2014/08/13
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Stefan Hajnoczi, 2014/08/13
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Ming Lei, 2014/08/13
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Paolo Bonzini, 2014/08/13
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Ming Lei, 2014/08/13
- Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support, Kevin Wolf, 2014/08/14