qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] linux-aio: keep processing events if MAX_EVENTS


From: Stefan Hajnoczi
Subject: Re: [Qemu-devel] [PATCH] linux-aio: keep processing events if MAX_EVENTS reached
Date: Fri, 15 Jul 2016 15:18:11 +0100
User-agent: Mutt/1.6.1 (2016-04-27)

On Tue, Jul 12, 2016 at 04:12:42PM +0200, Roman Penyaev wrote:
> On Tue, Jun 28, 2016 at 11:42 AM, Stefan Hajnoczi <address@hidden> wrote:
> > On Mon, Jun 27, 2016 at 08:36:19PM +0200, Roman Penyaev wrote:
> >> On Jun 27, 2016 6:37 PM, "Stefan Hajnoczi" <address@hidden> wrote:
> >> >
> >> > Commit ccb9dc10129954d0bcd7814298ed445e684d5a2a ("linux-aio: Cancel BH
> >> > if not needed") exposed a side-effect of scheduling the BH for nested
> >> > event loops.  When io_getevents(2) fetches MAX_EVENTS events then it's
> >> > necessary to call it again.  Failure to do so can lead to hung I/O
> >> > because the eventfd has already been cleared and the completion BH will
> >> > not run again.
> >> >
> >> > This patch fixes the hang by calling io_getevents(2) again if the events
> >> > array was totally full.
> >> >
> >> > Reported-by: Roman Penyaev <address@hidden>
> >> > Signed-off-by: Stefan Hajnoczi <address@hidden>
> >> > ---
> >> >  block/linux-aio.c | 5 +++++
> >> >  1 file changed, 5 insertions(+)
> >> >
> >> > diff --git a/block/linux-aio.c b/block/linux-aio.c
> >> > index e468960..af15f85 100644
> >> > --- a/block/linux-aio.c
> >> > +++ b/block/linux-aio.c
> >> > @@ -117,6 +117,7 @@ static void qemu_laio_completion_bh(void *opaque)
> >> >      LinuxAioState *s = opaque;
> >> >
> >> >      /* Fetch more completion events when empty */
> >> > +more_events:
> >> >      if (s->event_idx == s->event_max) {
> >> >          do {
> >> >              struct timespec ts = { 0 };
> >> > @@ -146,6 +147,10 @@ static void qemu_laio_completion_bh(void *opaque)
> >> >          qemu_laio_process_completion(laiocb);
> >> >      }
> >> >
> >> > +    if (s->event_idx == MAX_EVENTS) {
> >> > +        goto more_events; /* there might still be events waiting for us
> >> */
> >> > +    }
> >> > +
> >>
> >> I am on vacation and can't check anything, but seems you also
> >> could reproduce an issue and seems you are right: what I've
> >> also noticed is that io hangs only if total number of requests
> >> in guest > than 128 (what is defined in linux-aio.c).
> >>
> >> But I do not understand why you have to repeat io_getevents in
> >> case of return value == MAX_EVENTS. According to my
> >> understanding you cant submit more than 128, so the first
> >> io_getevents call returns you exactly what you've submitted.
> >
> > Hi Roman,
> > There is nothing like discussions on qemu-devel from the beach.  True
> > vacation!
> >
> > The issue is:
> >
> > 1. s->events[] is only MAX_EVENTS large so each io_getevents() call can
> >    only fetch that maximum number of events.
> >
> > 2. qemu_laio_completion_cb() clears the eventfd so the event loop will
> >    not call us again (unless additional I/O requests complete).
> >
> > Therefore, returning from qemu_laio_completion_bh() without having
> > processed all events causes a hang.
> 
> Hi Stefan,
> 
> The issue is clear now.  The thing is that I had an assumption, that we
> never submit more than MAX_EVENTS.  Now I see that according to the
> ioq_submit() this is not true, so we can have inflights > MAX_EVENTS.
> 
> Frankly, that seems a bug to me, because we promise the kernel [when
> we call io_setup(MAX_EVENTS)] not to submit more than specified value.
> Kernel allocates ring buffer aligned to the page size, also does some
> compensations to have enough free requests for each CPU, so the final
> io events number *can be* > than we have requested.  So eventually
> kernel can "swallow" more events than MAX_EVENTS.

You are right.  QEMU shouldn't exceed MAX_EVENTS in-flight requests,
even though the kernel overallocates resources so we don't see an error
(yet).

Your patch makes linux-aio.c more correct overall, so let's take that.

> I did the following patch (at the bottom of this email), which restricts
> submission more than MAX_EVENTS and the interesting thing, that it works
> faster than your current fix for this hang:
> 
> my setup: 1 iothread, VCPU=8, MQ=8
> 
> your current patch
> "linux-aio: keep processing events if MAX_EVENTS reached":
> 
>    READ: io=48199MB, aggrb=1606.5MB/s, minb=1606.5MB/s,
> maxb=1606.5MB/s, mint=30003msec, maxt=30003msec
>   WRITE: io=48056MB, aggrb=1601.8MB/s, minb=1601.8MB/s,
> maxb=1601.8MB/s, mint=30003msec, maxt=30003msec
> 
> mine changes:
> 
>    READ: io=53294MB, aggrb=1776.3MB/s, minb=1776.3MB/s,
> maxb=1776.3MB/s, mint=30003msec, maxt=30003msec
>   WRITE: io=53177MB, aggrb=1772.4MB/s, minb=1772.4MB/s,
> maxb=1772.4MB/s, mint=30003msec, maxt=30003msec
> 
> But what is the most important thing here is, that reverting
> "linux-aio: Cancel BH if not needed" brings these numbers:
> 
>    READ: io=56362MB, aggrb=1878.4MB/s, minb=1878.4MB/s,
> maxb=1878.4MB/s, mint=30007msec, maxt=30007msec
>   WRITE: io=56255MB, aggrb=1874.8MB/s, minb=1874.8MB/s,
> maxb=1874.8MB/s, mint=30007msec, maxt=30007msec
> 
> So, it seems to me that "linux-aio: Cancel BH if not needed" introduces
> regression.

Unfortunately the patch does two things so it would be best to identify
which one causes the performance regression.

Is it #1?

@@ -149,6 +149,8 @@ static void qemu_laio_completion_bh(void *opaque)
     if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
         ioq_submit(s);
     }
+
+    qemu_bh_cancel(s->completion_bh);
 }
 
 static void qemu_laio_completion_cb(EventNotifier *e)

That seems a little unlikely since qemu_bh_cancel() is very
light-weight.

Or #2?

@@ -156,7 +158,7 @@ static void qemu_laio_completion_cb(EventNotifier *e)
     LinuxAioState *s = container_of(e, LinuxAioState, e);
 
     if (event_notifier_test_and_clear(&s->e)) {
-        qemu_bh_schedule(s->completion_bh);
+        qemu_laio_completion_bh(s);
     }
 }

If it's #2 then there may be an fd handler/BH ordering issue that we
should investigate.  In other words, the problem isn't that calling
qemu_laio_completion_bh() directly is "slower" but that calling it here
reduces the number of completion events that io_getevents() can
harvest.  This is a form of batching but it could also happen if other
fd handlers/BHs are submitting I/O.

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]