qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] linux-aio: keep processing events if MAX_EVENTS


From: Roman Penyaev
Subject: Re: [Qemu-devel] [PATCH] linux-aio: keep processing events if MAX_EVENTS reached
Date: Tue, 12 Jul 2016 16:12:42 +0200

On Tue, Jun 28, 2016 at 11:42 AM, Stefan Hajnoczi <address@hidden> wrote:
> On Mon, Jun 27, 2016 at 08:36:19PM +0200, Roman Penyaev wrote:
>> On Jun 27, 2016 6:37 PM, "Stefan Hajnoczi" <address@hidden> wrote:
>> >
>> > Commit ccb9dc10129954d0bcd7814298ed445e684d5a2a ("linux-aio: Cancel BH
>> > if not needed") exposed a side-effect of scheduling the BH for nested
>> > event loops.  When io_getevents(2) fetches MAX_EVENTS events then it's
>> > necessary to call it again.  Failure to do so can lead to hung I/O
>> > because the eventfd has already been cleared and the completion BH will
>> > not run again.
>> >
>> > This patch fixes the hang by calling io_getevents(2) again if the events
>> > array was totally full.
>> >
>> > Reported-by: Roman Penyaev <address@hidden>
>> > Signed-off-by: Stefan Hajnoczi <address@hidden>
>> > ---
>> >  block/linux-aio.c | 5 +++++
>> >  1 file changed, 5 insertions(+)
>> >
>> > diff --git a/block/linux-aio.c b/block/linux-aio.c
>> > index e468960..af15f85 100644
>> > --- a/block/linux-aio.c
>> > +++ b/block/linux-aio.c
>> > @@ -117,6 +117,7 @@ static void qemu_laio_completion_bh(void *opaque)
>> >      LinuxAioState *s = opaque;
>> >
>> >      /* Fetch more completion events when empty */
>> > +more_events:
>> >      if (s->event_idx == s->event_max) {
>> >          do {
>> >              struct timespec ts = { 0 };
>> > @@ -146,6 +147,10 @@ static void qemu_laio_completion_bh(void *opaque)
>> >          qemu_laio_process_completion(laiocb);
>> >      }
>> >
>> > +    if (s->event_idx == MAX_EVENTS) {
>> > +        goto more_events; /* there might still be events waiting for us
>> */
>> > +    }
>> > +
>>
>> I am on vacation and can't check anything, but seems you also
>> could reproduce an issue and seems you are right: what I've
>> also noticed is that io hangs only if total number of requests
>> in guest > than 128 (what is defined in linux-aio.c).
>>
>> But I do not understand why you have to repeat io_getevents in
>> case of return value == MAX_EVENTS. According to my
>> understanding you cant submit more than 128, so the first
>> io_getevents call returns you exactly what you've submitted.
>
> Hi Roman,
> There is nothing like discussions on qemu-devel from the beach.  True
> vacation!
>
> The issue is:
>
> 1. s->events[] is only MAX_EVENTS large so each io_getevents() call can
>    only fetch that maximum number of events.
>
> 2. qemu_laio_completion_cb() clears the eventfd so the event loop will
>    not call us again (unless additional I/O requests complete).
>
> Therefore, returning from qemu_laio_completion_bh() without having
> processed all events causes a hang.

Hi Stefan,

The issue is clear now.  The thing is that I had an assumption, that we
never submit more than MAX_EVENTS.  Now I see that according to the
ioq_submit() this is not true, so we can have inflights > MAX_EVENTS.

Frankly, that seems a bug to me, because we promise the kernel [when
we call io_setup(MAX_EVENTS)] not to submit more than specified value.
Kernel allocates ring buffer aligned to the page size, also does some
compensations to have enough free requests for each CPU, so the final
io events number *can be* > than we have requested.  So eventually
kernel can "swallow" more events than MAX_EVENTS.

I did the following patch (at the bottom of this email), which restricts
submission more than MAX_EVENTS and the interesting thing, that it works
faster than your current fix for this hang:

my setup: 1 iothread, VCPU=8, MQ=8

your current patch
"linux-aio: keep processing events if MAX_EVENTS reached":

   READ: io=48199MB, aggrb=1606.5MB/s, minb=1606.5MB/s,
maxb=1606.5MB/s, mint=30003msec, maxt=30003msec
  WRITE: io=48056MB, aggrb=1601.8MB/s, minb=1601.8MB/s,
maxb=1601.8MB/s, mint=30003msec, maxt=30003msec

mine changes:

   READ: io=53294MB, aggrb=1776.3MB/s, minb=1776.3MB/s,
maxb=1776.3MB/s, mint=30003msec, maxt=30003msec
  WRITE: io=53177MB, aggrb=1772.4MB/s, minb=1772.4MB/s,
maxb=1772.4MB/s, mint=30003msec, maxt=30003msec

But what is the most important thing here is, that reverting
"linux-aio: Cancel BH if not needed" brings these numbers:

   READ: io=56362MB, aggrb=1878.4MB/s, minb=1878.4MB/s,
maxb=1878.4MB/s, mint=30007msec, maxt=30007msec
  WRITE: io=56255MB, aggrb=1874.8MB/s, minb=1874.8MB/s,
maxb=1874.8MB/s, mint=30007msec, maxt=30007msec

So, it seems to me that "linux-aio: Cancel BH if not needed" introduces
regression.

--
Roman

---
 block/linux-aio.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index e468960..dd768cc 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -44,7 +44,8 @@ struct qemu_laiocb {

 typedef struct {
     int plugged;
-    unsigned int n;
+    unsigned int inqueue;
+    unsigned int inflight;
     bool blocked;
     QSIMPLEQ_HEAD(, qemu_laiocb) pending;
 } LaioQueue;
@@ -129,6 +130,7 @@ static void qemu_laio_completion_bh(void *opaque)
             s->event_max = 0;
             return; /* no more events */
         }
+        s->io_q.inflight -= s->event_max;
     }

     /* Reschedule so nested event loops see currently pending completions */
@@ -190,7 +192,8 @@ static void ioq_init(LaioQueue *io_q)
 {
     QSIMPLEQ_INIT(&io_q->pending);
     io_q->plugged = 0;
-    io_q->n = 0;
+    io_q->inqueue = 0;
+    io_q->inflight = 0;
     io_q->blocked = false;
 }

@@ -203,9 +206,12 @@ static void ioq_submit(LinuxAioState *s)

     do {
         len = 0;
+        if (s->io_q.inflight >= MAX_EVENTS)
+            break;
         QSIMPLEQ_FOREACH(aiocb, &s->io_q.pending, next) {
             iocbs[len++] = &aiocb->iocb;
-            if (len == MAX_QUEUED_IO) {
+            if (len == MAX_QUEUED_IO ||
+                (s->io_q.inflight + len) >= MAX_EVENTS) {
                 break;
             }
         }
@@ -218,11 +224,12 @@ static void ioq_submit(LinuxAioState *s)
             abort();
         }

-        s->io_q.n -= ret;
+        s->io_q.inflight += ret;
+        s->io_q.inqueue  -= ret;
         aiocb = container_of(iocbs[ret - 1], struct qemu_laiocb, iocb);
         QSIMPLEQ_SPLIT_AFTER(&s->io_q.pending, aiocb, next, &completed);
     } while (ret == len && !QSIMPLEQ_EMPTY(&s->io_q.pending));
-    s->io_q.blocked = (s->io_q.n > 0);
+    s->io_q.blocked = (s->io_q.inqueue > 0);
 }

 void laio_io_plug(BlockDriverState *bs, LinuxAioState *s)
@@ -263,9 +270,9 @@ static int laio_do_submit(int fd, struct
qemu_laiocb *laiocb, off_t offset,
     io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e));

     QSIMPLEQ_INSERT_TAIL(&s->io_q.pending, laiocb, next);
-    s->io_q.n++;
+    s->io_q.inqueue++;
     if (!s->io_q.blocked &&
-        (!s->io_q.plugged || s->io_q.n >= MAX_QUEUED_IO)) {
+        (!s->io_q.plugged || s->io_q.inqueue >= MAX_QUEUED_IO)) {
         ioq_submit(s);
     }

-- 
2.8.2



reply via email to

[Prev in Thread] Current Thread [Next in Thread]