Re: [Qemu-devel] Re: Strategic decision: COW format

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Re: Strategic decision: COW format

From:	Chunqiang Tang
Subject:	Re: [Qemu-devel] Re: Strategic decision: COW format
Date:	Mon, 14 Mar 2011 12:32:58 -0400

> > FVD's novel uses of the reference count table reduces the metadata 
update
> > overhead down to literally zero during normal execution of a VM. This 
gets
> > the bests of QCOW2's reference count table but without its oeverhead. 
In
> > FVD, the reference count table is only updated when creating a new
> > snapshot or deleting an existing snapshot. The reference count table 
is
> > never updated during normal execution of a VM.
> 
> Do you want to send out a break-down of the steps (and cost) involved in 
doing:
> 
> 1. Snapshot creation.
> 2. Snapshot deletion.
> 3. Opening an image with n snapshots.

Here is a detailed description. Relevant to the discussion of snapshot, 
FVD uses a one-level lookup table and a refcount table. FVD’s one-level 
lookup table is very similar to QCOW2’s two-level lookup table, except 
that it is much smaller in FVD, and is preallocated and hence contiguous 
in image. 

FVD’s refcount table is almost identical to that of QCOW2, but with a key 
difference. An image consists of an arbitrary number of read-only 
snapshots, and a single writeable image front, which is the current image 
state perceived by the VM. Below, I will simply refer to the read-only 
snapshots as snapshots, and refer to the “writeable image front” as 
“writeable-front.” QCOW2’s refcount table counts clusters that are used by 
either read-only snapshots or writeable-front. Because writeable-front 
changes as the VM runs, QCOW2 needs to update the refcount table on the 
fast path of normal VM execution. 

By contrast, FVD’s refcount table only counts chunks that are used by 
read-only snapshots, and does not count chunks used by write-front. This 
is the key that allows FVD to entirely avoid updating the refcount table 
on the fast path of normal VM execution. 

Below are the detailed steps for different operations.

O1: Open an image with n snapshots.

Let me introduce some basic concepts first. The storage allocation unit in 
FVD is called a chunk (like cluster in QCOW2). The default chunk size is 
1MB, like that in VDI (VMDK and Microsoft VHD use 2MB chunks). An FVD 
image file is conceptually divided into chunks, where chunk 0 is the first 
1MB of the image file, chunk 1 is the second 1MB, … chunk j, …and so 
forth. The size of an image file grow as needed, just like that of QCOW2. 
The refcount table is a linear array “uint16_t refcount[]”. If a chunk j 
is referenced by s different snapshots, then refcount[j] = s. If a new 
snapshot is created and this new snapshot also uses chunk j, then 
refcount[j] is incremented to refcount[j] = s+1.

If all snapshots together use 1TB storage spaces, there are 
1TB/1MB=1,000,000 chunks, and the size of the refcount table is 2MB. 
Loading the entire 2MB refcount table from disk into memory takes about 15 
milliseconds. If the virtual disk size perceived by the VM is also 1TB, 
FVD’s one-level lookup table is 4MB. FVD’s one-level lookup table serves 
the same purpose as QCOW2’s two-level lookup table, but FVD’s one-level 
table is much smaller and is preallocated and hence continuous in the 
image. Loading the entire 4MB lookup table from disk into memory takes 
about 20 milliseconds. These numbers mean that it is quite affordable to 
scan the entire tables at VM boot time, although the scan can also be 
avoided in FVD. The optimizations will be described later.

When opening an image with n snapshots, an unoptimized version of FVD 
performs the following steps:

O1: Load the entire 2MB reference count table from disk into memory. This 
step takes about 15ms.

O2: Load the entire 4MB lookup table from disk into memory. This step 
takes about 20ms.

O3: Use the two tables to build an in-memory data structure called 
“free-chunk-bitmap.” This step takes about 2ms. The free-chunk-bitmap 
identifies free chunks that are not used by either snapshots or writeable 
front, and hence can be allocated for future writes. The size of the 
free-chunk-bitmap is only 125KB for a 1TB disk, and hence the memory 
overhead is negligible. The free-chunk-bitmap also supports trim 
operations. The free-chunk-bitmap does not have to be persisted on disk as 
it can always be rebuilt easily, although as an optimization it can be 
persisted on disk on VM shutdown.

O4: Compare the refcount table and the lookup table to identify chunks 
that are in both tables (i.e., shared) and hence the running VM’s write to 
those chunks in writeable-front triggers copy-on-write. This step takes 
about 2ms. One bit in the lookup table’s entry is stolen to mark whether a 
chunk in writeable-front is shared with snapshots and hence needs 
copy-on-write upon a write.

The whole process above, i.e., opening an image with n (e.g., n=1000) 
snapshots, takes about 39ms and it is a one-time cost at VM boot. Later, I 
will describe optimizations that can further reduce this 39ms by saving 
the 125KB free-chunk-bitmap to disk on VM shutdown, but that optimization 
is more than likely to an over-engineering effort, given that 39ms on VM 
boot perhaps is not a big deal.

Once the image is opened, the refcount table is discarded and never 
updated during normal execution of the VM. This is how FVD gets the bests 
of QCOW2’s refcount table but without its runtime overhead. When the VM 
issues a write to a chunk in writeable-front and this chunk is shared with 
snapshots (this is known by looking at the special marker bit in lookup 
table entries), FVD uses the free-chunk-bitmap to find a free chunk, 
performs copy-on-write, and updates the lookup table to point to that new 
chunk. This metadata change to the lookup table is persisted in FVD’s 
journal and is not updated directly to the writeable-front’s on-disk 
lookup table. When the VM shuts down gracefully, the entire lookup table 
is written back to the writeable-front’s on-disk lookup table (this step 
takes about 20ms), and FVD’s image header is updated to indicate that 
there is no need to recover from journal on the next VM boot.

Now let me describe an optimization that can further reduce the 39ms time 
needed for opening an image with n snapshots. When the VM shuts down 
gracefully, FVD can save the in-memory 125KB free-chunk-bitmap to disk. 
When the VM reboots, FVD can load the 125KB free-chunk-bitmap from disk 
and skips steps O1, O2, and O3. Since the lookup table is saved to disk on 
a clean shutdown and one bit in the table entries marks whether a chunk is 
shared with snapshots, the scanning step in O4 can also be avoided. In 
other words, all the steps above O1-O4 can all be avoided on a clean 
shutdown. They are needed only after a host crash. During the recovery 
process after a host crash, the journal re-builds the lookup table; the 
refcount table and the lookup table rebuild the free-chunk-bitmap.

This is a description of FVD's image open operation.

(…to be continued… I am running out of time now, and will write about 
FVD’s other snapshot operations in separate emails.) 

Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] Re: Strategic decision: COW format, (continued)
- Re: [Qemu-devel] Re: Strategic decision: COW format, Kevin Wolf, 2011/03/14

Prev by Date: Re: [Qemu-devel] [PATCH 2/7] introduce libcacard/vscard_common.h
Next by Date: [Qemu-devel] Re: KVM call agenda for Mars 14th
Previous by thread: Re: [Qemu-devel] Re: Strategic decision: COW format
Next by thread: Re: [Qemu-devel] Re: Strategic decision: COW format
Index(es):
- Date
- Thread