I've spent the last week debugging an issue that is hitting OpenStack
with drive-mirror/block job usage.
Specifically we are seeing that a monitor command for 'query-block-jobs'
never replies to libvirt. After 3 minutes of waiting the test harness
times out and kills the VM. When working normally the entire test will
complete in just a couple of seconds, so we don't think the 3 minute
timeout is hitting a false positive.
[...] The rate failure of this problem is only
around 1 in 400 uses of drive-mirror. [...] No one has ever
managed to reproduce the problem outside of our automated test
system environment, even when running the same tests locally and
we can't log into the test systems to get GDB traces or install
custom QEMU builds.
The best I can do is to collect debug logs from libvirtd, and get
stdio from QEMU. The QEMU stderr/stdout shows nothing at all. The
libvirtd log shows the following sequence of monitor interactions
with QEMU:
[...]
5. Libvirt waits for cleanup to complete:
msg={"execute":"query-block-jobs","id":"libvirt-15"}
reply={"return": [{"io-status": "ok", "device": "drive-virtio-disk0", "busy": true, "len": 25165824, "offset": 25165824,
"paused": false, "speed": 0, "type": "mirror"}], "id": "libvirt-15"}
msg={"execute":"query-block-jobs","id":"libvirt-16"}
reply={"return": [{"io-status": "ok", "device": "drive-virtio-disk0", "busy": true, "len": 25165824, "offset": 25165824,
"paused": false, "speed": 0, "type": "mirror"}], "id": "libvirt-16"}
msg={"execute":"query-block-jobs","id":"libvirt-17"}
<...hang...>
So we can see this last 'query-block-jobs' command hangs. I've looked at
the code for handling this monitor command and struggling to come up with
any ideas of why this might hang. My best idea was the bdrv_iterate()
call it does might be happening at the same time as another thread modifies
the list, but debugging on a local QEMU shows no changes to the list at
all due to drive-mirror/block jobs, so that doesn't seem to be the cause.