[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format
From: |
Chunqiang Tang |
Subject: |
Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249% |
Date: |
Fri, 14 Jan 2011 15:56:00 -0500 |
> Based on my limited understanding, I think FVD shares a
> lot in common with the COW format (block/cow.c).
>
> But I think most of the advantages you mention could be considered as
> additions to either qcow2 or qed. At any rate, the right way to have
> that discussion is in the form of patches on the ML.
FVD is much more advanced than block/cow.c. I would be happy to discuss
possible leverage, but setting aside the details of QCOW2, QED, and FVD,
let’s start with a discussion of what is needed for the next generation
image format.
First of all, of course, we need high performance. Through extensive
benchmarking, I identified three major performance overheads in image
formats. The numbers cited below are based on the PostMark benchmark. See
the paper for more details,
http://researcher.watson.ibm.com/researcher/files/us-ctang/FVD-cow.pdf .
P1) Increased disk seek distance caused by a compact image’s distorted
data layout. Specifically, the average disk seek distance in QCOW2 is 460%
longer than that in a RAW image.
P2) Overhead of storing an image on a host file system. Specifically, a
RAW image stored on ext3 is 50-63% slower than a RAW image stored on a raw
partition.
P3) Overhead in reading or updating an image format’s on-disk metadata.
Due to this overhead, QCOW2 causes 45% more total disk I/Os (including
I/Os for accessing both data and metadata) than FVD does.
For P1), I uses the term compact image instead of sparse image, because a
RAW image stored as a sparse file in ext3 is a sparse image, but is not a
compact image. A compact image stores data in such a way that the file
size of the image file is smaller than the size of the virtual disk
perceived by the VM. QCOW2 is a compact image. The disadvantage of a
compact image is that the data layout perceived by the guest OS differs
from the actual layout on the physical disk, which defeats many
optimizations in guest file systems. Consider one concrete example. When
the guest VM issues a disk I/O request to the hypervisor using a virtual
block address (VBA), QEMU’s block device driver translates the VBA into an
image block address (IBA), which specifies where the requested data are
stored in the image file, i.e., IBA is an offset in the image file. When a
guest OS creates or resizes a file system, it writes out the file system
metadata, which are all grouped together and assigned consecutive image
block addresses (IBAs) by QCOW2, despite the fact that the metadata’s
virtual block addresses (VBAs) are deliberately scattered for better
reliability and locality, e.g., co-locating inodes and file content blocks
in block groups. As a result, it may cause a long disk seek distance
between accessing a file’s metadata and accessing the file’s content
blocks.
For P2), using a host file system is inefficient, because 1) historically
file systems are optimized for small files rather than large images, and
2) certain functions of a host file system are simply redundant with
respect to the function of a compact image, e.g., performing storage
allocation. Moreover, using a host file system not only adds overhead, but
also introduces data integrity issues. Specifically, if I/Os uses O_DSYNC,
it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data
integrity in the event of a host crash. See
http://lwn.net/Articles/348739/ .
For P3), it includes the overhead in reading on-disk metadata and the
overhead in updating on-disk metadata. The former can be reduced by
minimizing the size of metadata so that they can be easily cached in
memory. Reducing the latter requires optimizations to avoid updating the
on-disk metadata whenever possible, while not compromising data integrity
in the event of a host crash.
In addition to addressing the performance overheads caused by P1-P3,
ideally the next-generation image format should meet the following
functional requirements and perhaps beyond.
R1) Support storage over-commit.
R2) Support compact image, copy-on-write, copy-on-read, and adaptive
prefetching.
R3) Allow eliminating the host file system to achieve high performance.
R4) Make all these features orthogonal, i.e., each feature can be enabled
or disabled individually without affecting other features. The purpose is
to support diverse use cases. For example, a copy-on-write image can use a
RAW image like data layout to avoid the overhead associated with a compact
image.
Storage over-commit means that, e.g., a 100GB physical disk can be used to
host 10 VMs, each with a 20GB virtual disk. This is possible because not
every VM completely fills up its 20GB virtual disk. It is not mandatory to
use compact image in order to support storage over-commit. For example,
RAW images stored as sparse files on ext3 support storage over-commit.
Copy-on-read and adaptive prefetching compliment copy-on-write in certain
use cases, e.g., in a Cloud where the backing image is stored on
network-attached storage (NAS) while the copy-on-write image is stored on
direct-attached storage (DAS). When the VM reads a block from the backing
image, a copy of the data is saved in the copy-on-write image for later
reuse. Adaptive prefetching finds resource idle times to copy from NAS to
DAS parts of the image that have not been accessed by the VM before.
Prefetching should be conservative in that if the driver detects a
contention on any resource (including DAS, NAS, or network), it pauses
prefetching temporarily and resumes prefetching later when congestion
disappears.
Next, let me briefly describe how FVD is designed to address the
performance issues P1-P3 and the functional requirements R1-R4. FVD has
the following features.
F1) Use a bitmap to implement copy-on-write.
F2) Use a one-level lookup table to implement compact image.
F3) Use a journal to commit changes to the bitmap and the lookup table.
F4) Store a compact image on a logical volume to support storage
over-commit, and to avoid the overhead and data integrity issues of a host
file system.
For F1), a bit in the bitmap tracks the state of a block. The bit is 0 if
the block is in the base image, and the bit is 1 if the block is in the
FVD image. The default size of a block is 64KB, as that in QCOW2. To
represent the state of a 1TB base image, FVD only needs a 2MB bitmap,
which can be easily cached in memory. This bitmap also implements
copy-on-read and adaptive prefetching.
For F2), one entry in the table maps the virtual disk address of a chunk
to an offset in the FVD image where the chunk is stored. The default size
of a chunk is 1MB, as that in VirtualBox VDI (VMware VMDK and Microsoft
VHD use a chunk size of 2MB). For a 1TB virtual disk, the size of the
lookup table is only 4MB. Because of the small size, there is no need to
use a two-level lookup table as that in QCOW2.
F1) and F2) are essential. They meet the requirement R4), i.e., the
features of copy-on-write and compact image can be enabled individually.
F1) and F2) are closest to the Microsoft Virtual Hard Disk (VHD) format,
which also uses a bitmap and a one-level table. There are some key
differences though. VHD partitions the bitmap and stores a fragment of the
bitmap with every 2MB chunk. As a result, VHD does not meet the
requirement R4, because it cannot have a copy-on-write image using a RAW
image like data layout. Also because of that, a bit in VHD can only
represents the state of a 512-byte sector (if a bit represents a 64KB
block, the chunk size then has to be 2GB, which is way too large and makes
storage over-commit ineffective). For a 1TB image, the size of the bitmap
in VHD is 256MB, vs. 2MB in FVD, which makes caching more difficult.
F3) uses a journal to commit metadata updates, which is not essential and
there are alternative implementations. F3) however does provide benefits
in addressing P3) (i.e., reducing metadata update overhead) and
simplifying implementation. By default, the size of the journal is 16MB.
When the bitmap and/or the lookup table are updated by a write, the
changes are saved in the journal. When the journal is full, the entire
bitmap and the entire lookup table are flushed to the disk, and the
journal can recycled for reuse. Because the bitmap and the lookup table
are small, the flush is quick. The journal provides several benefits.
First, updating both the bitmap and the lookup table requires only a
single write to journal. Second, K concurrent updates to any potions of
the bitmap or the lookup table are converted to K sequential writes in the
journal, and they can be merged into a single write by the host Linux
kernel. Third, it increases concurrency by avoiding locking the bitmap or
the lookup table. For example, updating one bit in the bitmap requires
writing a 512-byte sector to the on-disk bitmap. This bitmap sector covers
a total of 512*8*64K=256MB data. That is, any two writes that target that
256MB data and require updating the bitmap cannot be processed
concurrently. The journal solves this problem and eliminates locking.
For F4), it is actually quite straightforward to eliminate the host file
system. The main thing that an image format needs from the host file
system is to perform storage allocation. This function, however, is
already performed by a compact image. Using a host file system simply ends
up doing storage allocation twice, which requires updating on-disk
metadata twice and introduces distorted data layout twice. Therefore, if
we migrate the necessary function of a host file system into an image
format, in other words, implementing a mini file system in an image
format, then we can get rid of the host file system. This is exactly what
FVD does, by slightly enhancing the compact image function that is already
there. FVD can manage incrementally added storage space, like ZFS and
unlike ext2/3/4. For example, when FVD manages a 100GB virtual disk, it
initially gets 5GB storage space from the logical volume manager and uses
it to host many 1MB chunks. When the first 5GB is used up, FVD gets
another 5GB to host more 1MB chunks, and so forth. Unlike QCOW2 and more
like a file system, FVD does not have to allocate a new chunk always right
after where the previous chunk was allocated. Instead, it may spread out
used chunks in the storage space in order to mimic a raw image like data
layout. More details will be explained in follow up emails.
The description above is long but is still a summary. Please refer to more
detailed information on the web site,
http://researcher.watson.ibm.com/researcher/view_project.php?id=1852 .
Hopefully I have given a summary of the problems, the requirements, and
the solutions in FVD, which can serve as the basis for a productive
discussion.
Regards,
ChunQiang (CQ) Tang, Ph.D.
Homepage: http://www.research.ibm.com/people/c/ctang
- [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%, Chunqiang Tang, 2011/01/04
- Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%, Anthony Liguori, 2011/01/05
- Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%,
Chunqiang Tang <=
- Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%, Jamie Lokier, 2011/01/18
- Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%, Stefan Hajnoczi, 2011/01/19
- Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%, Chunqiang Tang, 2011/01/19
- Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%, Christoph Hellwig, 2011/01/19
- Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%, Jamie Lokier, 2011/01/19
- Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%, Christoph Hellwig, 2011/01/19
- Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%, Chunqiang Tang, 2011/01/19
- Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%, Christoph Hellwig, 2011/01/19
- Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%, Chunqiang Tang, 2011/01/19
- Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%, Christoph Hellwig, 2011/01/19