qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Re: Strategic decision: COW format


From: Kevin Wolf
Subject: Re: [Qemu-devel] Re: Strategic decision: COW format
Date: Mon, 14 Mar 2011 11:12:35 +0100
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.15) Gecko/20101027 Fedora/3.0.10-1.fc12 Thunderbird/3.0.10

Am 13.03.2011 06:51, schrieb Chunqiang Tang:
> After the heated debate, I thought more about the right approach of 
> implementing snapshot, and it becomes clear to me that there are major 
> limitations with both VMDK's external snapshot approach (which stores each 
> snapshot as a separate CoW file) and QCOW2's internal snapshot approach 
> (which stores all snapshots in one file and uses a reference count table 
> to keep track of them). I just posted to the mailing list a patch that 
> implements internal snapshot in FVD but does it in a way without the 
> limitations of VMDK and QCOW2. 
> 
> Let's first list the properties of an ideal virtual disk snapshot 
> solution, and then discuss how to achieve them.
> 
> G1: Do no harm (or avoid being a misfeature), i.e., the added snapshot 
> code should not slow down the runtime performance of an image that has no 
> snapshots.  This implies that an image without snapshot should not cache 
> the reference count table in memory and should not update the on-disk 
> reference count table.
> 
> G2: Even better, an image with 1 snapshot runs as fast as an image without 
> snapshot.
> 
> G3: Even even better, an image with 1,000 snapshots runs as fast as an 
> image without snapshot. This basically means getting the snapshot feature 
> for free.
> 
> G4: An image with 1,000 snapshots consumes no more memory than an image 
> without snapshot. This again means getting the snapshot feature for free.
> 
> G5: Regardless of the number of existing snapshots, creating a new 
> snapshot is fast, e.g., taking no more than 1 second.
> 
> G6: Regardless of the number of existing snapshots, deleting a snapshot is 
> fast, e.g., taking no more than 1 second.
> 
> Now let's evaluate VMDK and QCOW2 against these ideal properties. 
> 
> G1: VMDK good; QCOW2 poor
> G2: VMDK ok; QCOW2 poor
> G3: VMDK very poor; QCOW2 poor
> G4: VMDK very poor; QCOW2 poor
> G5: VMDK good; QCOW2 good
> G6: VMDK poor; QCOW2 good

Okay. I think I don't agree with all of these. I'm not entirely sure how
VMDK works, so I take this as "random image format that uses backing
files" (so it also applies to qcow2 with backing files, which I hope
isn't too confusing).

G1: VMDK good; QCOW2 poor for cache=writethrough, ok otherwise; QCOW3 good
G2: VMDK ok; QCOW2 good
G3: VMDK poor; QCOW2 good
G4: VMDK very poor; QCOW2 ok
G5: VMDK good; QCOW2 good
G6: VMDK very poor; QCOW2 good

Also, let me add another feature which I believe is an important factor
in the decision between internal and external snapshots:

G7: Loading/Reverting to a snapshot is fast
G7: VMDK good; QCOW2 ok

> On the other hand, QCOW'2 internal snapshot has two major limitations that 
> hurt runtime performance: caching the reference count table in memory and 
> updating the on-disk reference count table. If we can eliminate both, then 
> it is an ideal solution.

It's not even necessary to get completely rid of it. What hurts is
writing the additional metadata. So if you can delay writing the
metadata and only write out a refcount block once you need to load the
next one into memory, the overhead is lost in the noise (remember, even
with 64k clusters, a refcount block covers 2 GB of virtual disk space).

We already do that for qcow2 in all writeback cache modes. We can't do
it yet for cache=writethrough, but we were planning to allow using QED's
dirty flag approach which would get rid of the writes also in
writethrough modes.

I think this explains my estimation for G1.

For G2 and G3, I'm not sure why you think that having internal snapshots
slows down operation. It's basically just data that sits in the image
file and is unused. After startup or after deleting a snapshot you
probably have to look at all of the refcount table again for cluster
allocations, is this what you mean?

For G4, the size of snapshots in memory, the only overhead of internal
snapshots that I could think of is the snapshot table. I would hardly
rate this as "poor".

For G5 and G6 I basically agree with your estimation, except that I
think that the overhead of deleting a snapshot is _really_ bad. This is
one of the major problems we have with external snapshots today.

> In an internal snapshot implementation, the reference count table is used 
> to track used blocks and free blocks. It serves no other purposes. In FVD, 
> its "static" reference count table only tracks blocks used by (static) 
> snapshots, and it does not track blocks (dynamically) allocated (on a 
> write) or freed (on a trim) for the running VM. This is a simple but 
> fundamental difference w.r.t. to QCOW2, whose reference count table tracks 
> both the static content and the dynamic content. Because data blocks used 
> by snapshots are static and do not change unless a snapshot is created or 
> deleted, there is no need to update FVD's "static" reference count table 
> when a VM runs, and actually there is even no need to cache it in memory. 
> Data blocks that are dynamically allocated or freed for a running VM are 
> already tracked by FVD's one-level lookup table (which is similar to 
> QCOW2's two-level table, but in FVD it is much smaller and faster) even 
> before introducing the snapshot feature, and hence it comes for free. 
> Updating FVD's one-level lookup table is efficient because of FVD's 
> journal.

So when is a cluster considered free? Only if both its refcount is 0 and
it's not referenced by a used lookup table entry?

How do you check the latter condition without scanning the whole lookup
table?

> When the VM boots, FVD scans the reference count table once to build a 
> so-called free-block-bitmap in memory, which identifies blocks not used by 
> static snapshots. The reference count table is then thrown away and never 
> updated when the VM runs.

This is an implementation detail and not related to the format.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]