[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error
From: |
Fernando Casas Schössow |
Subject: |
Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error |
Date: |
Mon, 19 Jun 2017 22:10:15 +0000 |
Hi Ladi,
Today two guests failed again at different times of day.
One of them was the one I switched from virtio_blk to virtio_scsi so this
change didn't solved the problem.
Now in this guest I also disabled virtio_balloon, continuing with the
elimination process.
Also this time I found a different error message in the guest console.
In the guest already switched to virtio_scsi:
virtio_scsi virtio2: request:id 44 is not a head!
Followed by the usual "task blocked for more than 120 seconds." error.
On the guest still running on virtio_blk the error was similar:
virtio_blk virtio2: req.0:id 42 is not a head!
blk_update_request: I/O error, dev vda, sector 645657736
Buffer I/O error on dev dm-1, logical block 7413821, lost async page write
Followed by the usual "task blocked for more than 120 seconds." error.
Do you think that the blk_update_request and the buffer I/O error may be a
consequence of the previous "is not a head!" error or should I be worried for a
storage level issue here?
Now I will wait to see if disabling virtio_balloon helps or not and report back.
Thanks.
Fer
On vie, jun 16, 2017 at 12:25 , Ladi Prosek <address@hidden> wrote:
On Fri, Jun 16, 2017 at 12:11 PM, Fernando Casas Schössow
<address@hidden<mailto:address@hidden>> wrote:
Hi Ladi, Thanks a lot for looking into this and replying. I will do my best to
rebuild and deploy Alpine's qemu packages with this patch included but not sure
its feasible yet. In any case, would it be possible to have this patch included
in the next qemu release?
Yes, I have already added this to my todo list.
The current error message is helpful but knowing which device was involved will
be much more helpful. Regarding the environment, I'm not doing migrations and
only managed save is done in case the host needs to be rebooted or shutdown.
The QEMU process is running the VM since the host is started and this failuire
is ocurring randomly without any previous manage save done. As part of
troubleshooting on one of the guests I switched from virtio_blk to virtio_scsi
for the guest disks but I will need more time to see if that helped. If I have
this problem again I will follow your advise and remove virtio_balloon.
Thanks, please keep us posted.
Another question: is there any way to monitor the virtqueue size either from
the guest itself or from the host? Any file in sysfs or proc? This may help to
understand in which conditions this is happening and to react faster to
mitigate the problem.
The problem is not in the virtqueue size but in one piece of internal state
("inuse") which is meant to track the number of buffers "checked out" by QEMU.
It's being compared to virtqueue size merely as a sanity check. I'm afraid that
there's no way to expose this variable without rebuilding QEMU. The best you
could do is attach gdb to the QEMU process and use some clever data access
breakpoints to catch suspicious writes to the variable. Although it's likely
that it just creeps up slowly and you won't see anything interesting. It's
probably beyond reasonable at this point anyway. I would continue with the
elimination process (virtio_scsi instead of virtio_blk, no balloon, etc.) and
then maybe once we know which device it is, we can add some instrumentation to
the code.
Thanks again for your help with this! Fer On vie, jun 16, 2017 at 8:58 , Ladi
Prosek <address@hidden<mailto:address@hidden>> wrote: Hi, Would you be able to
enhance the error message and rebuild QEMU? --- a/hw/virtio/virtio.c +++
b/hw/virtio/virtio.c @@ -856,7 +856,7 @@ void *virtqueue_pop(VirtQueue *vq,
size_t sz) max = vq->vring.num; if (vq->inuse
= vq->vring.num) { - virtio_error(vdev, "Virtqueue size exceeded"); +
virtio_error(vdev, "Virtqueue %u device %s size exceeded", vq->queue_index,
vdev->name); goto done; } This would at least confirm the theory that it's
caused by virtio-blk-pci. If rebuilding is not feasible I would start by
removing other virtio devices -- particularly balloon which has had quite a few
virtio related bugs fixed recently. Does your environment involve VM migrations
or saving/resuming, or does the crashing QEMU process always run the VM from
its boot? Thanks!
- [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error, Fernando Casas Schössow, 2017/06/15
- Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error, Ladi Prosek, 2017/06/16
- Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error, Fernando Casas Schössow, 2017/06/16
- Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error, Ladi Prosek, 2017/06/16
- Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error,
Fernando Casas Schössow <=
- Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error, Ladi Prosek, 2017/06/20
- Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error, Fernando Casas Schössow, 2017/06/20
- Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error, Ladi Prosek, 2017/06/20
- Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error, Fernando Casas Schössow, 2017/06/21
- Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error, Ladi Prosek, 2017/06/22
- Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error, Fernando Casas Schössow, 2017/06/23
- Message not available
- Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error, Fernando Casas Schössow, 2017/06/24