qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error


From: Ladi Prosek
Subject: Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error
Date: Tue, 20 Jun 2017 09:52:57 +0200

On Tue, Jun 20, 2017 at 8:30 AM, Fernando Casas Schössow
<address@hidden> wrote:
> Hi Ladi,
>
> In this case both guests are CentOS 7.3 running the same kernel
> 3.10.0-514.21.1.
> Also the guest that fails most frequently is running Docker with 4 or 5
> containers.
>
> Another thing I would like to mention is that the host is running on
> Alpine's default grsec patched kernel. I have the option to install also a
> vanilla kernel. Would it make sense to switch to the vanilla kernel on the
> host and see if that helps?

The host kernel is less likely to be responsible for this, in my
opinion. I'd hold off on that for now.

> And last but not least KSM is enabled on the host. Should I disable it?

Could be worth the try.

> Following your advice I will run memtest on the host and report back. Just
> as a side comment, the host is running on ECC memory.

I see.

Would it be possible for you, once a guest is in the broken state, to
make it available for debugging? By attaching gdb to the QEMU process
for example and letting me poke around it remotely? Thanks!

> Thanks for all your help.
>
> Fer.
>
> On mar, jun 20, 2017 at 7:59 , Ladi Prosek <address@hidden> wrote:
>
> Hi Fernando, On Tue, Jun 20, 2017 at 12:10 AM, Fernando Casas Schössow
> <address@hidden> wrote:
>
> Hi Ladi, Today two guests failed again at different times of day. One of
> them was the one I switched from virtio_blk to virtio_scsi so this change
> didn't solved the problem. Now in this guest I also disabled virtio_balloon,
> continuing with the elimination process. Also this time I found a different
> error message in the guest console. In the guest already switched to
> virtio_scsi: virtio_scsi virtio2: request:id 44 is not a head! Followed by
> the usual "task blocked for more than 120 seconds." error. On the guest
> still running on virtio_blk the error was similar: virtio_blk virtio2:
> req.0:id 42 is not a head! blk_update_request: I/O error, dev vda, sector
> 645657736 Buffer I/O error on dev dm-1, logical block 7413821, lost async
> page write Followed by the usual "task blocked for more than 120 seconds."
> error.
>
> Honestly this is starting to look more and more like a memory corruption.
> Two different virtio devices and two different guest operating systems, that
> would have to be a bug in the common virtio code and we would have seen it
> somewhere else already. Would it be possible run a thorough memtest on the
> host just in case?
>
> Do you think that the blk_update_request and the buffer I/O error may be a
> consequence of the previous "is not a head!" error or should I be worried
> for a storage level issue here? Now I will wait to see if disabling
> virtio_balloon helps or not and report back. Thanks. Fer On vie, jun 16,
> 2017 at 12:25 , Ladi Prosek <address@hidden> wrote: On Fri, Jun 16, 2017
> at 12:11 PM, Fernando Casas Schössow <address@hidden> wrote: Hi
> Ladi, Thanks a lot for looking into this and replying. I will do my best to
> rebuild and deploy Alpine's qemu packages with this patch included but not
> sure its feasible yet. In any case, would it be possible to have this patch
> included in the next qemu release? Yes, I have already added this to my todo
> list. The current error message is helpful but knowing which device was
> involved will be much more helpful. Regarding the environment, I'm not doing
> migrations and only managed save is done in case the host needs to be
> rebooted or shutdown. The QEMU process is running the VM since the host is
> started and this failuire is ocurring randomly without any previous manage
> save done. As part of troubleshooting on one of the guests I switched from
> virtio_blk to virtio_scsi for the guest disks but I will need more time to
> see if that helped. If I have this problem again I will follow your advise
> and remove virtio_balloon. Thanks, please keep us posted. Another question:
> is there any way to monitor the virtqueue size either from the guest itself
> or from the host? Any file in sysfs or proc? This may help to understand in
> which conditions this is happening and to react faster to mitigate the
> problem. The problem is not in the virtqueue size but in one piece of
> internal state ("inuse") which is meant to track the number of buffers
> "checked out" by QEMU. It's being compared to virtqueue size merely as a
> sanity check. I'm afraid that there's no way to expose this variable without
> rebuilding QEMU. The best you could do is attach gdb to the QEMU process and
> use some clever data access breakpoints to catch suspicious writes to the
> variable. Although it's likely that it just creeps up slowly and you won't
> see anything interesting. It's probably beyond reasonable at this point
> anyway. I would continue with the elimination process (virtio_scsi instead
> of virtio_blk, no balloon, etc.) and then maybe once we know which device it
> is, we can add some instrumentation to the code. Thanks again for your help
> with this! Fer On vie, jun 16, 2017 at 8:58 , Ladi Prosek
> <address@hidden> wrote: Hi, Would you be able to enhance the error
> message and rebuild QEMU? --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c
> @@ -856,7 +856,7 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz) max =
> vq->vring.num; if (vq->inuse = vq->vring.num) { - virtio_error(vdev,
> "Virtqueue size exceeded"); + virtio_error(vdev, "Virtqueue %u device %s
> size exceeded", vq->queue_index, vdev->name); goto done; } This would at
> least confirm the theory that it's caused by virtio-blk-pci. If rebuilding
> is not feasible I would start by removing other virtio devices --
> particularly balloon which has had quite a few virtio related bugs fixed
> recently. Does your environment involve VM migrations or saving/resuming, or
> does the crashing QEMU process always run the VM from its boot? Thanks!
>
>
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]