qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Linux kernel polling for QEMU


From: Christian Borntraeger
Subject: Re: [Qemu-devel] Linux kernel polling for QEMU
Date: Tue, 29 Nov 2016 12:58:18 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0

On 11/29/2016 12:00 PM, Stefan Hajnoczi wrote:
> On Tue, Nov 29, 2016 at 09:19:22AM +0100, Christian Borntraeger wrote:
>> On 11/24/2016 04:12 PM, Stefan Hajnoczi wrote:
>>> I looked through the socket SO_BUSY_POLL and blk_mq poll support in
>>> recent Linux kernels with an eye towards integrating the ongoing QEMU
>>> polling work.  The main missing feature is eventfd polling support which
>>> I describe below.
>>>
>>> Background
>>> ----------
>>> We're experimenting with polling in QEMU so I wondered if there are
>>> advantages to having the kernel do polling instead of userspace.
>>>
>>> One such advantage has been pointed out by Christian Borntraeger and
>>> Paolo Bonzini: a userspace thread spins blindly without knowing when it
>>> is hogging a CPU that other tasks need.  The kernel knows when other
>>> tasks need to run and can skip polling in that case.
>>>
>>> Power management might also benefit if the kernel was aware of polling
>>> activity on the system.  That way polling can be controlled by the
>>> system administrator in a single place.  Perhaps smarter power saving
>>> choices can also be made by the kernel.
>>>
>>> Another advantage is that the kernel can poll hardware rings (e.g. NIC
>>> rx rings) whereas QEMU can only poll its own virtual memory (including
>>> guest RAM).  That means the kernel can bypass interrupts for devices
>>> that are using kernel drivers.
>>>
>>> State of polling in Linux
>>> -------------------------
>>> SO_BUSY_POLL causes recvmsg(2), select(2), and poll(2) family system
>>> calls to spin awaiting new receive packets.  From what I can tell epoll
>>> is not supported so that system call will sleep without polling.
>>>
>>> blk_mq poll is mainly supported by NVMe.  It is only available with
>>> synchronous direct I/O.  select(2), poll(2), epoll, and Linux AIO are
>>> therefore not integrated.  It would be nice to extend the code so a
>>> process waiting on Linux AIO using io_getevents(2), select(2), poll(2),
>>> or epoll will poll.
>>>
>>> QEMU and KVM-specific polling
>>> -----------------------------
>>> There are a few QEMU/KVM-specific items that require polling support:
>>>
>>> QEMU's event loop aio_notify() mechanism wakes up the event loop from a
>>> blocking poll(2) or epoll call.  It is used when another thread adds or
>>> changes an event loop resource (such as scheduling a BH).  There is a
>>> userspace memory location (ctx->notified) that is written by
>>> aio_notify() as well as an eventfd that can be signalled.
>>>
>>> kvm.ko's ioeventfd is signalled upon guest MMIO/PIO accesses.  Virtio
>>> devices use ioeventfd as a doorbell after new requests have been placed
>>> in a virtqueue, which is a descriptor ring in userspace memory.
>>>
>>> Eventfd polling support could look like this:
>>>
>>>   struct eventfd_poll_info poll_info = {
>>>       .addr = ...memory location...,
>>>       .size = sizeof(uint32_t),
>>>       .op   = EVENTFD_POLL_OP_NOT_EQUAL, /* check *addr != val */
>>>       .val  = ...last value...,
>>>   };
>>>   ioctl(eventfd, EVENTFD_SET_POLL, &poll_info);
>>>
>>> In the kernel, eventfd stashes this information and eventfd_poll()
>>> evaluates the operation (e.g. not equal, bitwise and, etc) to detect
>>> progress.
>>>
>>> Note that this eventfd polling mechanism doesn't actually poll the
>>> eventfd counter value.  It's useful for situations where the eventfd is
>>> a doorbell/notification that some object in userspace memory has been
>>> updated.  So it polls that userspace memory location directly.
>>>
>>> This new eventfd feature also provides a poor man's Linux AIO polling
>>> support: set the Linux AIO shared ring index as the eventfd polling
>>> memory location.  This is not as good as true Linux AIO polling support
>>> where the kernel polls the NVMe, virtio_blk, etc ring since we'd still
>>> rely on an interrupt to complete I/O requests.
>>>
>>> Thoughts?
>>
>> Would be an interesting excercise, but we should really try to avoid making
>> the iothreads more costly. When I look at some of our measurements, I/O-wise
>> we are  slightly behind z/VM, which can be tuned to be in a similar area but
>> we use more host CPUs on s390 for the same throughput.
>>
>> So I have two concerns and both a related to overhead.
>> a: I am able to get a higher bandwidth and lower host cpu utilization
>> when running fio for multiple disks when I pin the iothreads to a subset of
>> the host CPUs (there is a sweet spot). Is the polling maybe just influencing
>> the scheduler to do the same by making the iothread not doing sleep/wakeup
>> all the time?
> 
> Interesting theory, look at sched_switch tracing data to find out
> whether that is true. 

Looking at vmstat, a poll value on 50000 seems to reduce the amount of
context switches. Depending on the workload almost no change to sometimes
a lot (one test from 250000/sec to 150000/sec)
According to sched_switch it does still move as before between the CPUs,
so my theory does not seem to hold.

On the other hand this is a development s390 system that I share  with 84
other LPARs, so I have trouble to get stable results as soon as I have a 
high data rate. I would need to find a time slot on one of the dedicated
systems, but maybe its just easier to reproduce this on x86.

 Do you get any benefit from combining the sweet
> spot pinning with polling?

maybe, but it seems that you have to give a little bit more CPUs to the
iothreads. What I can tell is that combining both hurts for the case with
more than one disk and all iothreads are pinned to just one host CPU as
soon as the polling is too big.

> 
>> b: what about contention with other guests on the host? What
>> worries me a bit, is the fact that most performance measurements and
>> tunings are done for workloads without that. We (including myself) do our
>> microbenchmarks (or fio runs) with just one guest and are happy if we see
>> an improvement. But does that reflect real usage? For example have you ever
>> measured the aio polling with 10 guests or so?
>> My gut feeling (and obviously I have not done proper measurements myself) is
>> that we want to stop polling as soon as there is contention.
>>
>> As you outlined, we already have something in place in the kernel to stop
>> polling
>>
>> Interestingly enough, for SO_BUSY_POLL the network code seems to consider
>>     !need_resched() && !signal_pending(current)
>> for stopping the poll, which allows to consume your time slice. KVM instead
>> uses single_task_running() for the halt_poll_thing. This means that KVM 
>> yields much more aggressively, which is probably the right thing to do for
>> opportunistic spinning.
> 
> Another thing I noticed about the busy_poll implementation is that it
> will spin if *any* file descriptor supports polling.
> 
> In QEMU we decided to implement the opposite: spin only if *all* event
> sources support polling.  The reason is that we don't want polling to
> introduce any extra latency on the event sources that do not support
> polling.
> 
>> Another thing to consider: In the kernel we have already other opportunistic
>> spinners and we are in the process of making things less aggressive because
>> it caused real issues. For example search for the  vcpu_is_preempted​ patch 
>> set.
>> Which by the way shown another issue, running nested you do not only want to
>> consider your own load, but also the load of the hypervisor.
> 
> These are good points and it's why I think polling in the kernel can
> make smarter decisions than in polling userspace.  There are multiple
> components in the system that can do polling, it would be best to have a
> single place so that the polling activity doesn't interfere.
> 
> Stefan
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]