qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel][PATCH,RFC] Zero cluster dedup


From: Kevin Wolf
Subject: Re: [Qemu-devel][PATCH,RFC] Zero cluster dedup
Date: Wed, 03 Sep 2008 14:47:16 +0200
User-agent: Thunderbird 2.0.0.12 (X11/20071114)

Shahar Frank schrieb:
>>>  typedef struct QCowHeader {
>>>      uint32_t magic;
>>>      uint32_t version;
>>> @@ -140,8 +148,10 @@ typedef struct BDRVQcowState {
>>>      uint32_t refcount_table_size;
>>>      uint64_t refcount_block_cache_offset;
>>>      uint16_t *refcount_block_cache;
>>> -    int64_t free_cluster_index;
>>> +    int64_t free_cluster_index1;
>>> +    int64_t free_cluster_index2;
>> These names are not exactly intuitive. At least you should immediately
>> see which one is for big chunks and which one for single clusters.
> 
> OK. How about free_single_cluster_index, and free_multi_cluster_index ?

Yes, would be fine with me.


>>> -static int qcow_write(BlockDriverState *bs, int64_t sector_num,
>>> +static int write_cluster(BlockDriverState *bs, int64_t sector_num,
>> uint64_t cluster_offset,
>>>                       const uint8_t *buf, int nb_sectors)
>>>  {
>>>      BDRVQcowState *s = bs->opaque;
>>> -    int ret, index_in_cluster, n;
>>> -    uint64_t cluster_offset;
>>> -    int n_end;
>>> +    int index_in_cluster = sector_num & (s->cluster_sectors - 1);
>>> +    int n_end = index_in_cluster + nb_sectors;
>>> +    int n = nb_sectors;
>>> +    int ret;
>>>
>>> -    while (nb_sectors > 0) {
>>> -        index_in_cluster = sector_num & (s->cluster_sectors - 1);
>>> -        n_end = index_in_cluster + nb_sectors;
>>> -        if (s->crypt_method &&
>>> -            n_end > QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors)
>>> +    if (s->crypt_method && n_end > QCOW_MAX_CRYPT_CLUSTERS * s-
>>> cluster_sectors)
>>>              n_end = QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors;
>>> +
>>> +    if (!cluster_offset)
>>>          cluster_offset = alloc_cluster_offset(bs, sector_num << 9,
>>>                                                index_in_cluster,
>>>                                                n_end, &n);
>>> @@ -1152,6 +1163,89 @@ static int qcow_write(BlockDriverState *bs,
>> int64_t sector_num,
>>>          }
>>>          if (ret != n * 512)
>>>              return -1;
>>> +    return 0;
>>> +}
>> What is this change good for? I think qcow_write is small enough and
>> doesn't need further splitting.
> 
> Maybe it is not obvious from the patch, but I did it not because
> qcow_write is too big (it is not), but because I wanted to use its
> internal core without the new dedup check.

Sorry, I missed the zero cluster allocation. You're right.

>>> +
>>> +static int is_not_zero(const uint8_t *buf, int len)
>>> +{
>>> +    int i;
>>> +    len >>= 2;
>>> +    for(i = 0;i < len; i++) {
>>> +        if (((uint32_t *)buf)[i] != 0)
>>> +            return 1;
>>> +    }
>>> +    return 0;
>>> +}
>> Why negative logic? I'd prefer buf_is_zero(). Also, could probably be
> an
>> inline function.
>>
> 
> You are completely right. I just took this function from qemu-img, but
> this is not a very good excuse... ;)

Oh, I see. Then what about moving it into a common header file instead
of copying?

 >>> +    uint64_t cluster_offset = sector_num << 9, poffset;
>>> +
>>> +    DEBUG("%s 0x%llx", bs->filename, cluster_offset);
>>> +    if (!zero_dedup || (sector_num & (s->cluster_sectors - 1)) ||
>>> +        *nb_sectors < cluster || is_not_zero(buf, s->cluster_size))
>>> +        return 0;          /* cannot/should not dedup */
>> Why do you check the value of nb_sectors? It's never used and only an
>> output parameter.
> 
> This is wrong. nb_sectors is input/output parameter.

But if I'm not completely mistaken, you don't use its input value apart
from this check. I just double-checked this and I still don't see its
usage as an input parameter.

>>> +
>>> +    DEBUG("%s zero dedup 0x%llx", bs->filename, cluster_offset);
>>> +    *nb_sectors = cluster; /* allocate exactly one cluster */
>>> +
>>> +    if (!s->zero_cluster) {
>>> +        if (!(poffset = alloc_clusters_noref(bs, s->cluster_size))
> ||
>>> +            write_cluster(bs, sector_num, poffset, buf, cluster) <
> 0)
>>> +                return 0;
>>> +        s->zero_cluster = poffset;
>>> +    }
>> I'm not sure how one could avoid this best, but you are accumulating
>> zero clusters because you create a new one for each VM run.
> 
> This is true. It is not so bad and the alternative as I can see it is to
> change the qcow2 to remember the zero_cluster. In this stage I think it
> is not justifying it. On the other hand, it would be very nice to add an
> qcow headers extensions support to add optional features. In fact my
> initial implementation did it (in backwards compatible way), but I
> intentionally left it out because it is a standalone issue.

Ok. What about leaving it as it is for now and just adding a TODO comment?

>>> @@ -1315,6 +1411,13 @@ static void qcow_aio_write_cb(void *opaque,
> int
>> ret)
>>>          qemu_aio_release(acb);
>>>          return;
>>>      }
>>> +        acb->n = acb->nb_sectors;
>>> +    } while ((ret = qcow_dedup(bs, acb->sector_num, acb->buf,
> &acb->n))
>>> 0);
>> I'm not sure this loop is the right approach. When you get a big write
>> request for zeros, what you are doing is to split it up into clusters
>> and change and write the L2 table to disk for each single cluster.
>>
>> This is different from the other read/write functions which try to use
>> as big chunks as possible to minimize the overhead.
> 
> If I get your point, I tend to disagree. Even though there may be cases
> where large write will be broken due the zero dedup - I think it will be
> rare, and the tradeoff of saving space should be worth it. 

I don't think it will be that rare. After all, as you said in your
initial comment, you are doing a cleanup of the complete disk. I'd
expect that large requests will be quite common in such a case.

Please note that I don't talk about trading off disk space against
writes here, it's about optimizing the dedup.

> For the
> cluster granularity issue, it is meaning less for dedup (zero or other).
> The deduped clusters are rarely de-duped to sequential clusters and
> anyhow a dedup write operation is just changing the meta-data of the
> block (logical offset l2 entry + reference count of the physical
> cluster) so no large write optimization can be done.

This is still the L2 table which has to be written each time. Say we
have a 64k write, then you could save 15 L2 table writes if qcow_dedup
could handle zeroing multiple contiguous clusters at once. This could
matter for performance when wiping lots of data.

Laurent, do you have an opinion on this? You should know this stuff
better than me as you did the according patches.

Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]