[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] QCOW2 deduplication design
From: |
Benoît Canet |
Subject: |
[Qemu-devel] QCOW2 deduplication design |
Date: |
Wed, 9 Jan 2013 16:24:44 +0100 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
Hello,
Here is a mail to open a discussion on QCOW2 deduplication design and
performance.
The actual deduplication strategy is RAM based.
One of the goal of the project is to plan and implement an alternative way to do
the lookups from disk for bigger images.
I will in a first section enumerate the disk overheads of the RAM based lookup
strategy and then in the second section enumerate the additionals costs of doing
lookups in a disk based prefix b-tree.
Comments and sugestions are welcome.
I) RAM based lookups overhead
The qcow2 read path is not modified by the deduplication patchset.
Each cluster written gets its hash computed.
Two GTrees are used to give access to the hashes : one indexed by hash and
one other indexed by physical offset.
I.0) unaligned write
when a write is unaligned or smaller than a 4KB cluster the deduplication code
issue one or two reads to get the missing data required to build a 4KB*n linear
buffer.
The deduplication metrics code show that this situation don't happen with virtio
and ext3 as a guest partition.
I.1) First write overhead
The hash is computed
the cluster is not duplicated so the hash is stored in a linked list
after that the writev call get a new 64KB L2 dedup hash block corresponding to
the physical sector of the writen cluster.
(This can be an allocating write requiring to write the offset of the new block
in the dedup table and flush)
The hash is written in the l2 dedup hash block and flushed later by the
dedup_block_cache
I.2) Same cluster rewrite at the same place
The hash is computed
qcow2_get_cluster_offset is called and the result is used to check that it is a
rewrite
The cluster is counted as duplicated and not rewriten on disk
I.3) First duplicated cluster write
The hash is computed
qcow2_get_cluster_offset is called and we see that we are not rewriting the same
cluster at the same place
I.3.a) The L2 entry of the first cluster written with this hash is overwritten
to remove the QCOW_OFLAG_COPIED flag.
I.3.b) the dedup hash block of the hash is overwritten to remember at the next
startup that QCOW_OFLAG_COPIED has been cleared.
A new L2 entry is created for this logical sector pointing to the physical
cluster. (potential allocating write)
the refcount of the physical cluster is updated
I.4) Duplicated clusters further writes
Same as I.2 without I.3.a and I.3.b
I.5) cluster removal
When a L2 entry to a cluster become stale the qcow2 code decrement the
refcount.
When the refcount reach zero the L2 hash block of the stale cluster
is written to clear the hash.
This happen often and require the second GTree to find the hash by it's physical
sector number
I.6) max refcount reached
The L2 hash block of the cluster is written in order to remember at next startup
that it must not be used anymore for deduplication. The hash is dropped from the
gtrees.
II) Disk based lookup additional overhead
My initial idea is to replace the RAM based GTree indexed by hash
by a prefix b-tree stored on disk.
ict.pue.udlap.mx/people/carlos/is215/papers/p11-bayer.pdf
As hash are mostly random the prefix tree should work well.
An additional data structure would be needed to do the lookups by
physical offsets.
This additional data structure is clearly a bottleneck in this design.
Each lookup will cost 0(log n) disk ios.
Insertion and deletion can cost the double of io (need to split and merge leafs
and nodes)
II.1) First write overhead
O(log n) lookup in the b-tree for lookup of the hash
Disks Lookups by physical sector to remove stale hash node.
(What structure to use for this)
When the hash is written the leaf of the failed lookup can be reused
If leaf is full splitting the leaf and restructuring the tree must be done
in O(log n).
Update of a not yet defined way to lookup b-tree leaf by their offset on disk
(potential allocating write)
II.2) Same cluster rewrite at the same place
lookup in the b-tree to find the hash
II.3) First duplicated cluster write.
lookup in the b-tree to find the hash
The leaf of the b-tree must be overwritten to remember that we have cleared
QCOW_OFLAG_COPIED.
II.4) Duplicated cluster further writes
lookup in the b-tree to find the hash
II.5) cluster removal
A query to find the hash b-tree leaf by sector must be done on a not yet defined
data structure
The hash must be removed from the b-tree leaf
Two methods can be used to bypass the needs of this additional data structure.
-Read the cluster then compute this hash and use the hash for the removal
-Store hash and cluster forever while setting their refcount at a special value
meaning "reserved for future reuse".
II.6) max refcount reached
The ram implementation just drop the hash corresponding to the written cluster
and mark on disk that this hash must not be reloaded. A dupplicate hash is
then created.
I have not found yet how to handle this with a b-tree as it's not supposed
to contains duplicate keys.
Regards
Benoît