qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Block I/O outside the QEMU global mutex was "Re: [RFC P


From: Paolo Bonzini
Subject: Re: [Qemu-devel] Block I/O outside the QEMU global mutex was "Re: [RFC PATCH 00/17] Support for multiple "AIO contexts""
Date: Tue, 09 Oct 2012 13:08:37 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20120911 Thunderbird/15.0.1

Il 09/10/2012 12:52, Avi Kivity ha scritto:
> On 10/09/2012 12:36 PM, Paolo Bonzini wrote:
>> Il 09/10/2012 11:26, Avi Kivity ha scritto:
>>> On 10/09/2012 11:08 AM, Stefan Hajnoczi wrote:
>>>> Here are the steps that have been mentioned:
>>>>
>>>> 1. aio fastpath - for raw-posix and other aio block drivers, can we reduce 
>>>> I/O
>>>>    request latency by skipping block layer coroutines?  
>>>
>>> Is coroutine overhead noticable?
>>
>> I'm thinking more about throughput than latency.  If the iothread
>> becomes CPU-bound, then everything is noticeable.
> 
> That's not strictly a coroutine issue.  Switching to ordinary threads
> may make the problem worse, since there will clearly be contention.

The point is you don't need either coroutines or userspace threads if
you use native AIO.  longjmp/setjmp is probably a smaller overhead
compared to the many syscalls involved in poll+eventfd
reads+io_submit+io_getevents, but it's also not cheap.  Also, if you
process AIO in batches you risk overflowing the pool of free coroutines,
which gets expensive real fast (allocate/free the stack, do the
expensive getcontext/swapcontext instead of the cheaper longjmp/setjmp,
etc.).

It seems better to sidestep the issue completely, it's a small amount of
work.

> What is the I/O processing time we have?  If it's say 10 microseconds,
> then we'll have 100,000 context switches per second assuming a device
> lock and a saturated iothread (split into multiple threads).

Hopefully with a saturated dedicated iothread you would not have any
context switches and a single CPU will be just dedicated to virtio
processing.

> The coroutine work may have laid the groundwork for fine-grained
> locking.  I'm doubtful we should use qcow when we want >100K IOPS though.

Yep.  Going away from coroutines is a solution in search of a problem,
it will introduce several new variables (kernel scheduling, more
expensive lock contention, starving the thread pool with locked threads,
...), all for a case where performance hardly matters.

>>>> I'm also curious about virtqueue_pop()/virtqueue_push() outside the QEMU 
>>>> mutex
>>>> although that might be blocked by the current work around MMIO/PIO dispatch
>>>> outside the global mutex.
>>>
>>> It is, yes.
>>
>> It should only require unlocked memory map/unmap, not MMIO dispatch.
>> The MMIO/PIO bits are taken care of by ioeventfd.
> 
> The ring, or indirect descriptors, or the data, can all be on mmio.
> IIRC the virtio spec forbids that, but the APIs have to be general.  We
> don't have cpu_physical_memory_map_nommio() (or
> address_space_map_nommio(), as soon as the coding style committee
> ratifies srtuct literals).

cpu_physical_memory_map could still take the QEMU lock in the slow
bounce-buffer case.  BTW the block layer has been using struct literals
for a long time and we're just as happy as you are about them. :)

Paolo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]