qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Qemu-devel][PATCH,RFC] Zero cluster dedup


From: Shahar Frank
Subject: RE: [Qemu-devel][PATCH,RFC] Zero cluster dedup
Date: Wed, 3 Sep 2008 05:05:07 -0700

> -----Original Message-----
> From: Kevin Wolf [mailto:address@hidden
> Sent: Wednesday, September 03, 2008 12:39 PM
> To: Shahar Frank
> Cc: address@hidden; Laurent Vivier
> Subject: Re: [Qemu-devel][PATCH,RFC] Zero cluster dedup
> 
> >
> > An additional included change is a small improvement for the
> alloc_clusters_noref function. The original version supported a very
> simple first fit algorithm, with a simple pointer to remember the last
> position that we got to searching for a free space. The main problem
with
> the original function was that the pointer scheme is not optimized for
> different allocation sizes. If for example you allocate a large space
of
> 64k (8 clusters) the pointer would skip many smaller free spaces such
that
> any future allocation, even for one cluster will not be able to use
the
> skipped space. The included improvement is very simple- it uses two
> pointers, one for single cluster allocations, and one for bigger
> allocations. This may be still too simple, but it should solve the
simple
> one cluster allocation (which should be very common).
> 
> This is a different issue and needs to go in a separate patch, IMHO.

Agreed. I will do it soon.

> >
> > +int zero_dedup = 1;
> 
> This value is never changed, only checked in an if condition. You
> probably want to allow the user to set it (or drop it). Would
> enable_zero_dedup be a better name?

This variable is just a global flag that should be controlled from the
qemu command line parameters or otherwise. I didn't want to cover this
issue yet, so I just left it as it is.
 
enable_zero_dedup is a better name. I will change it in the next patch.

> 
> > +
> > +#if defined(DEBUG_ALLOC) || defined(DEBUG_ALLOC2)
> > +#define DEBUG(msg, args...)     fprintf(stderr, "(%d) %s: " msg
"\n",
> getpid(),  __FUNCTION__, ##args)
> > +#else
> > +#define DEBUG(msg, args...)
> > +#endif
> > +
> >  typedef struct QCowHeader {
> >      uint32_t magic;
> >      uint32_t version;
> > @@ -140,8 +148,10 @@ typedef struct BDRVQcowState {
> >      uint32_t refcount_table_size;
> >      uint64_t refcount_block_cache_offset;
> >      uint16_t *refcount_block_cache;
> > -    int64_t free_cluster_index;
> > +    int64_t free_cluster_index1;
> > +    int64_t free_cluster_index2;
> 
> These names are not exactly intuitive. At least you should immediately
> see which one is for big chunks and which one for single clusters.

OK. How about free_single_cluster_index, and free_multi_cluster_index ?

> > -static int qcow_write(BlockDriverState *bs, int64_t sector_num,
> > +static int write_cluster(BlockDriverState *bs, int64_t sector_num,
> uint64_t cluster_offset,
> >                       const uint8_t *buf, int nb_sectors)
> >  {
> >      BDRVQcowState *s = bs->opaque;
> > -    int ret, index_in_cluster, n;
> > -    uint64_t cluster_offset;
> > -    int n_end;
> > +    int index_in_cluster = sector_num & (s->cluster_sectors - 1);
> > +    int n_end = index_in_cluster + nb_sectors;
> > +    int n = nb_sectors;
> > +    int ret;
> >
> > -    while (nb_sectors > 0) {
> > -        index_in_cluster = sector_num & (s->cluster_sectors - 1);
> > -        n_end = index_in_cluster + nb_sectors;
> > -        if (s->crypt_method &&
> > -            n_end > QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors)
> > +    if (s->crypt_method && n_end > QCOW_MAX_CRYPT_CLUSTERS * s-
> >cluster_sectors)
> >              n_end = QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors;
> > +
> > +    if (!cluster_offset)
> >          cluster_offset = alloc_cluster_offset(bs, sector_num << 9,
> >                                                index_in_cluster,
> >                                                n_end, &n);
> > @@ -1152,6 +1163,89 @@ static int qcow_write(BlockDriverState *bs,
> int64_t sector_num,
> >          }
> >          if (ret != n * 512)
> >              return -1;
> > +    return 0;
> > +}
> 
> What is this change good for? I think qcow_write is small enough and
> doesn't need further splitting.

Maybe it is not obvious from the patch, but I did it not because
qcow_write is too big (it is not), but because I wanted to use its
internal core without the new dedup check.

> > +
> > +static int is_not_zero(const uint8_t *buf, int len)
> > +{
> > +    int i;
> > +    len >>= 2;
> > +    for(i = 0;i < len; i++) {
> > +        if (((uint32_t *)buf)[i] != 0)
> > +            return 1;
> > +    }
> > +    return 0;
> > +}
> 
> Why negative logic? I'd prefer buf_is_zero(). Also, could probably be
an
> inline function.
>

You are completely right. I just took this function from qemu-img, but
this is not a very good excuse... ;)
 
> > +
> > +static int qcow_remap(BlockDriverState *bs, uint64_t loffset,
uint64_t
> poffset)
> > +{
> > +    uint64_t l2_offset, *l2_table;
> > +    BDRVQcowState *s = bs->opaque;
> > +    uint64_t ooffset;
> > +    int l2_index, ret;
> > +
> > +    ret = get_cluster_table(bs, loffset, &l2_table, &l2_offset,
> &l2_index);
> > +    if (ret == 0)
> > +        return 0;
> > +
> > +    ooffset = be64_to_cpu(l2_table[l2_index]);
> > +    if (ooffset == poffset)
> > +        return 1;  /* nothing to do, already mapped to poffset */
> > +
> > +    if (ooffset) {
> > +        DEBUG("free old offset %llx (%llx)", ooffset, ooffset &
> ~QCOW_OFLAG_COPIED);
> > +        free_any_clusters(bs, ooffset & ~QCOW_OFLAG_COPIED, 1);
> > +    }
> > +
> > +    /* update L2 table */
> > +    l2_table[l2_index] = cpu_to_be64(poffset);
> > +
> > +    if (bdrv_pwrite(s->hd,
> > +                    l2_offset + l2_index * sizeof(uint64_t),
> > +                    l2_table + l2_index, sizeof(uint64_t)) !=
> sizeof(uint64_t))
> > +        return 0;
> > +
> > +    update_refcount(bs, poffset, s->cluster_size, 1);
> 
> Might be better for robustness to update the refcount first and then
> write the L2 table.
> 
> > +    return 1;
> > +}
> > +
> > +static int qcow_dedup(BlockDriverState *bs, int64_t sector_num,
> > +                             const uint8_t *buf, int *nb_sectors)
> > +{
> > +    BDRVQcowState *s = bs->opaque;
> > +    int cluster = s->cluster_sectors;
> 
> You're using this merely as an abbreviation for s->cluster_sectors,
not
> to have different semantics. Now while it may have saved you typing a
> few characters, everybody knows what s->cluster_sectors means but has
to
> look up what cluster might be here.
> 

OK. Will be fixed.

> > +    uint64_t cluster_offset = sector_num << 9, poffset;
> > +
> > +    DEBUG("%s 0x%llx", bs->filename, cluster_offset);
> > +    if (!zero_dedup || (sector_num & (s->cluster_sectors - 1)) ||
> > +        *nb_sectors < cluster || is_not_zero(buf, s->cluster_size))
> > +        return 0;          /* cannot/should not dedup */
> 
> Why do you check the value of nb_sectors? It's never used and only an
> output parameter.

This is wrong. nb_sectors is input/output parameter.

> 
> > +
> > +    DEBUG("%s zero dedup 0x%llx", bs->filename, cluster_offset);
> > +    *nb_sectors = cluster; /* allocate exactly one cluster */
> > +
> > +    if (!s->zero_cluster) {
> > +        if (!(poffset = alloc_clusters_noref(bs, s->cluster_size))
||
> > +            write_cluster(bs, sector_num, poffset, buf, cluster) <
0)
> > +                return 0;
> > +        s->zero_cluster = poffset;
> > +    }
> 
> I'm not sure how one could avoid this best, but you are accumulating
> zero clusters because you create a new one for each VM run.

This is true. It is not so bad and the alternative as I can see it is to
change the qcow2 to remember the zero_cluster. In this stage I think it
is not justifying it. On the other hand, it would be very nice to add an
qcow headers extensions support to add optional features. In fact my
initial implementation did it (in backwards compatible way), but I
intentionally left it out because it is a standalone issue.

> 
> > +
> > +    return qcow_remap(bs, cluster_offset, s->zero_cluster);
> > +}
> > +
> > +static int qcow_write(BlockDriverState *bs, int64_t sector_num,
> > +                     const uint8_t *buf, int nb_sectors)
> > +{
> > +    BDRVQcowState *s = bs->opaque;
> > +    int n;
> > +
> > +    while (nb_sectors > 0) {
> > +        n = nb_sectors;
> > +        if (qcow_dedup(bs, sector_num, buf, &n) <= 0 &&
> > +            write_cluster(bs, sector_num, sector_num << 9, buf, n)
< 0)
> > +            return -1;
> > +
> >          nb_sectors -= n;
> >          sector_num += n;
> >          buf += n * 512;
> 
> So this are the three lines which made you split up the function?

No. See the above comments.

> 
> I'd prefer a nested if over this && usage.
> 
> > @@ -1298,6 +1392,7 @@ static void qcow_aio_write_cb(void *opaque,
int
> ret)
> >
> >      acb->hd_aiocb = NULL;
> >
> > +    DEBUG("<<");
> 
> Left-over debug message.

True.

> 
> >      if (ret < 0) {
> >      fail:
> >          acb->common.cb(acb->common.opaque, ret);
> > @@ -1305,6 +1400,7 @@ static void qcow_aio_write_cb(void *opaque,
int
> ret)
> >          return;
> >      }
> >
> > +    do {
> >      acb->nb_sectors -= acb->n;
> >      acb->sector_num += acb->n;
> >      acb->buf += acb->n * 512;
> 
> Indentation?

Will be fixed.

> 
> > @@ -1315,6 +1411,13 @@ static void qcow_aio_write_cb(void *opaque,
int
> ret)
> >          qemu_aio_release(acb);
> >          return;
> >      }
> > +        acb->n = acb->nb_sectors;
> > +    } while ((ret = qcow_dedup(bs, acb->sector_num, acb->buf,
&acb->n))
> > 0);
> 
> I'm not sure this loop is the right approach. When you get a big write
> request for zeros, what you are doing is to split it up into clusters
> and change and write the L2 table to disk for each single cluster.
> 
> This is different from the other read/write functions which try to use
> as big chunks as possible to minimize the overhead.

If I get your point, I tend to disagree. Even though there may be cases
where large write will be broken due the zero dedup - I think it will be
rare, and the tradeoff of saving space should be worth it. For the
cluster granularity issue, it is meaning less for dedup (zero or other).
The deduped clusters are rarely de-duped to sequential clusters and
anyhow a dedup write operation is just changing the meta-data of the
block (logical offset l2 entry + reference count of the physical
cluster) so no large write optimization can be done.
On the other hand, if I missed your intension, please elaborate.

> 
> > +
> > +    if (ret < 0) {
> > +        ret = -EIO;
> > +        goto fail;
> > +    }
> >
> >      index_in_cluster = acb->sector_num & (s->cluster_sectors - 1);
> >      n_end = index_in_cluster + acb->nb_sectors;
> > @@ -2172,25 +2275,33 @@ static int64_t
> alloc_clusters_noref(BlockDriverState *bs, int64_t size)
> >  {
> >      BDRVQcowState *s = bs->opaque;
> >      int i, nb_clusters;
> > +    uint64_t free_cluster_index = s->free_cluster_index1;
> >
> >      nb_clusters = (size + s->cluster_size - 1) >> s->cluster_bits;
> > +    if (nb_clusters > 1)
> > +        free_cluster_index = s->free_cluster_index2;
> > +
> >      for(;;) {
> > -        if (get_refcount(bs, s->free_cluster_index) == 0) {
> > -            s->free_cluster_index++;
> > +        if (get_refcount(bs, free_cluster_index) == 0) {
> > +            free_cluster_index++;
> >              for(i = 1; i < nb_clusters; i++) {
> > -                if (get_refcount(bs, s->free_cluster_index) != 0)
> > +                if (get_refcount(bs, free_cluster_index) != 0)
> >                      goto not_found;
> > -                s->free_cluster_index++;
> > +                free_cluster_index++;
> >              }
> >  #ifdef DEBUG_ALLOC2
> >              printf("alloc_clusters: size=%lld -> %lld\n",
> >                     size,
> > -                   (s->free_cluster_index - nb_clusters) << s-
> >cluster_bits);
> > +                   (free_cluster_index - nb_clusters) << s-
> >cluster_bits);
> >  #endif
> > -            return (s->free_cluster_index - nb_clusters) << s-
> >cluster_bits;
> > +            if (nb_clusters > 1)
> > +                s->free_cluster_index2 = free_cluster_index;
> > +            else
> > +                s->free_cluster_index1 = free_cluster_index;
> > +            return (free_cluster_index - nb_clusters) << s-
> >cluster_bits;
> >          } else {
> >          not_found:
> > -            s->free_cluster_index++;
> > +            free_cluster_index++;
> >          }
> >      }
> >  }
> > @@ -2374,8 +2485,13 @@ static int
> update_cluster_refcount(BlockDriverState *bs,
> >      refcount += addend;
> >      if (refcount < 0 || refcount > 0xffff)
> >          return -EINVAL;
> > -    if (refcount == 0 && cluster_index < s->free_cluster_index) {
> > -        s->free_cluster_index = cluster_index;
> > +    if (refcount == 0) {
> > +        if (cluster_index < s->free_cluster_index1)
> > +            s->free_cluster_index1 = cluster_index;
> > +        if (cluster_index < s->free_cluster_index2)
> > +            s->free_cluster_index2 = cluster_index;
> > +        if (cluster_index == s->zero_cluster)
> > +            s->zero_cluster = 0;
> >      }
> >      s->refcount_block_cache[block_index] = cpu_to_be16(refcount);
> 
> As said before, this last part belongs into its own patch, so I didn't
> review it now.
> 
> And then, one general thing: Recently, Laurent has done a great job
> adding comments to the functions dealing with clusters. I think it
> wouldn't hurt to add some comments to the new functions you introduce,
> too.

I agree. Comments help in this non-trivial code. Will be done.

> 
> Kevin

I will fix anything worth fixing and report the patch.

Shahar




reply via email to

[Prev in Thread] Current Thread [Next in Thread]