qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Open qcow2 on multiple hosts simultaneously.


From: kvaps
Subject: Re: Open qcow2 on multiple hosts simultaneously.
Date: Mon, 20 Nov 2023 13:41:02 +0100

Hey Alberto,

My article on this design has just been published.
In this article I talk about the chosen technologies and the ReadWriteMany implementation:

https://blog.deckhouse.io/lvm-qcow-csi-driver-shared-san-kubernetes-81455201590e

Anticipating, I would like to mention that I have tested the method of exporting the volume from a single QSD instance to another over the user network using NBD, and I faced significant performance issues. Additionally I would like to note that this method is overkill when you already have the data accessible on a backing block device via SAN.

We have opted for an approach involving switching the cache.direct during live migration of a virtual machine, assuming that it is not a full-fledged ReadWriteMany and will be used solely for the live migration of virtual machines.

Best Regards,
Andrei Kvapil


On Wed, Aug 16, 2023 at 11:31 AM Alberto Faria <afaria@redhat.com> wrote:
On Mon, Jun 19, 2023 at 6:29 PM kvaps <kvapss@gmail.com> wrote:
> Hi Kevin and the community,
>
> I am designing a CSI driver for Kubernetes that allows efficient
> utilization of SAN (Storage Area Network) and supports thin
> provisioning, snapshots, and ReadWriteMany mode for block devices.
>
> To implement this, I have explored several technologies such as
> traditional LVM, LVMThin (which does not support shared mode), and
> QCOW2 on top of block devices. This is the same approach to what oVirt
> uses for thin provisioning over shared LUN:
>
> https://github.com/oVirt/vdsm/blob/08a656c/doc/thin-provisioning.md
>
> Based on benchmark results, I found that the performance degradation
> of block-backed QCOW2 is much lower compared to LVM and LVMThin while
> creating snapshots.
>
> https://docs.google.com/spreadsheets/d/1mppSKhEevGl5ntBhZT3ccU5t07LwxXjQz1HM2uvBIuo/edit#gid=2020746352
>
> Therefore, I have decided to use the same aproach for Kubernetes.
>
> But in Kubernetes, the storage system needs to be self-sufficient and
> not depended to the workload that uses it. Thus unlike oVirt, we have
> no option to use the libvirt interface of the running VM to invoke the
> live-migration. Instead, we should provide pure block device in
> ReadWriteMany mode, where the block device can be writable on multiple
> hosts simultaneously.
>
> To achieve this, I decided to use the qemu-storage-daemon with the
> VDUSE backend.
>
> Other technologies, such as NBD and UBLK, were also considered, and
> their benchmark results can be seen in the same document on the
> different sheet:
>
> https://docs.google.com/spreadsheets/d/1mppSKhEevGl5ntBhZT3ccU5t07LwxXjQz1HM2uvBIuo/edit#gid=416958126
>
> Taking into account the performance, stability, and versatility, I
> concluded that VDUSE is the optimal choice. To connect the device in
> Kubernetes, the virtio-vdpa interface would be used, and the entire
> scheme could look like this:
>
>
> +---------------------+  +---------------------+
> | node1               |  | node2               |
> |                     |  |                     |
> |    +-----------+    |  |    +-----------+    |
> |    | /dev/vda  |    |  |    | /dev/vda  |    |
> |    +-----+-----+    |  |    +-----+-----+    |
> |          |          |  |          |          |
> |     virtio-vdpa     |  |     virtio-vdpa     |
> |          |          |  |          |          |
> |        vduse        |  |        vduse        |
> |          |          |  |          |          |
> | qemu-storage-daemon |  | qemu-storage-daemon |
> |          |          |  |          |          |
> | +------- | -------+ |  | +------- | -------+ |
> | | LUN    |        | |  | | LUN    |        | |
> | |  +-----+-----+  | |  | |  +-----+-----+  | |
> | |  | LV (qcow2)|  | |  | |  | LV (qcow2)|  | |
> | |  +-----------+  | |  | |  +-----------+  | |
> | +--------+--------+ |  | +--------+--------+ |
> |          |          |  |          |          |
> |          |          |  |          |          |
> +--------- | ---------+  +--------- | ---------+
>            |                        |
>            |         +-----+        |
>            +---------| SAN |--------+
>                      +-----+
>
> Despite two independent instances of qemu-storage-daemon for same
> qcow2 disk running successfully on different hosts, I have concerns
> about their proper functioning. Similar to live migration, I think
> they should share the state between each other.
>
> The question is how to make qemu-storage-daemon to share the state
> between multiple nodes, or is qcow2 format inherently stateless and
> does not requires this?
>
> --
> Best Regards,
> Andrei Kvapil

Hi Andrei,

Apologies for not getting back to you sooner.

Have you made progress on this?

AIUI, and as others have mentioned, it's not possible to safely access
a qcow2 file from more than one qemu-storage-daemon (qsd) instance at
once. Disabling caching might help ensure consistency of the image's
data, but there would still be no synchronization between the qsd
instances when they are manipulating qcow2 metadata.

ReadWriteMany block volumes are something that we would eventually
like to support in Subprovisioner [1], for instance so KubeVirt live
migration can work with it. The best we have come up with is to export
the volume from a single qsd instance over the network using NBD,
whenever more than one node has the volume mounted. This means that
all but one node would be accessing the volume with degraded
performance, but that may be acceptable for use cases like KubeVirt
live migration. We would then somehow migrate the qsd instance from
the source node to the destination node whenever the former unmounts
it, so that the migrated VM can access the volume with full
performance. This may require adding live migration support to qsd
itself.

What are your thoughts on this approach?

Thanks,
Alberto

[1] https://gitlab.com/subprovisioner/subprovisioner


reply via email to

[Prev in Thread] Current Thread [Next in Thread]