qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2 01/12] qcow2: Add new overlap check functions


From: Max Reitz
Subject: Re: [Qemu-devel] [PATCH v2 01/12] qcow2: Add new overlap check functions
Date: Wed, 04 Feb 2015 11:29:36 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0

On 2015-02-03 at 18:08, Eric Blake wrote:
On 11/24/2014 08:56 AM, Max Reitz wrote:
The existing qcow2 metadata overlap detection function used existing
structures to determine the location of the image metadata, from plain
fields such as l1_table_offset and l1_size in the BDRVQcowState, over
image structures in memory such as the L1 table for the L2 tables'
positions, or it even read the required data directly from disk for
every requested check, such as the snapshot L1 tables for the inactive
L2 tables' positions.

These new functions instead keep a dedicated structure for keeping track
of the metadata positions in memory. It consists of two parts: First,
there is one structure which is basically a list of all metadata
structures. Each entry has a bitmask of types (because some metadata
structures may actually overlap, such as active and inactive L2 tables),
a number of clusters occupied and the offset from the previous entry in
clusters. This structure requires relatively few memory, but checking a
s/few/little/

certain point may take relatively long. Each entry is called a
"fragment".

Therefore, there is another representation which is a bitmap, or rather
a bytemap, of metadata types. The previously described list is split
into multiple windows with each describing a constant number of clusters
(WINDOW_SIZE). If the list is to be queried or changed, the respective
window is selected in constant time and the bitmap is generated from the
fragments belonging to the window. This bitmap can then be queried in
constant time and easily be changed.

Because the bitmap representation requires more memory, it is only used
as a cache. Whenever a window is removed from the cache, the fragment
list will be rebuilt from the bitmap if the latter has been modified.
Therefore, the fragment list is only used as the background
representation to save memory, whereas the bitmap is used whenever
possible.

Regarding the size of the fragment list in memory: As one refcount block
can handle cluster_size / 2 entries and one L2 table can handle
cluster_size / 8 entries, for a qcow2 image with the standard cluster
size of 64 kB, there is a ratio of data to metadata of about 1/6000
(1/32768 refblocks and 1/8192 L2 tables) if one ignores the fact that
not every cluster requires an L2 table entry. The refcount table and the
L1 table is generally negligible. At the worst, each metadata cluster
requires its own entry in the fragment list; each entry takes up four
bytes, therefore, at the worst, the fragment list should take up (for an
image with 64 kB clusters) (4 B) / (64 kB * 6000) of the image size,
which is about 1.e-8 (i.e., 11 kB for a 1 TB image, or 11 MB for a 1 PB
image).

Signed-off-by: Max Reitz <address@hidden>
---
  block/Makefile.objs   |   3 +-
  block/qcow2-overlap.c | 404 ++++++++++++++++++++++++++++++++++++++++++++++++++
  block/qcow2.h         |  13 ++
  3 files changed, 419 insertions(+), 1 deletion(-)
  create mode 100644 block/qcow2-overlap.c
Are you still hoping to get this in 2.3?

There's no reason to rush it into 2.3.

+++ b/block/qcow2-overlap.c
@@ -0,0 +1,404 @@
+/*
+ * QCOW2 runtime metadata overlap detection
+ *
+ * Copyright (c) 2014 Max Reitz <address@hidden>
Slow review means it is now 2015.

Well, if this series had been fine, 2014 would have been correct. ;-)

+/* Number of clusters which are covered by each metadata window;
+ * note that this may not exceed 2^16 as long as
+ * Qcow2MetadataFragment::relative_start is a uint16_t */
+#define WINDOW_SIZE 4096
So this says that for every 4096 clusters, we have one bytemap
representation, as well as a chain of up to 4096 fragment descriptors?

Indeed.

+
+/* Describes a fragment of a or a whole metadata range; does not necessarily
s/of a or/of or/

+ * describe the whole range because it needs to be split on window boundaries 
*/
+typedef struct Qcow2MetadataFragment {
+    /* Bitmask of QCow2MetadataOverlap values */
+    uint8_t types;
+    uint8_t nb_clusters;
+    /* Number of clusters between the start of the window and this range */
+    uint16_t relative_start;
So even I have a file with 4096 consecutive sectors all tied up in the
same purpose within a given window, I have to represent it as 16
Qcow2MetadataFragments rather than 1 fragment, because the uint8_t size
of nb_clusters limits me to at most 256 clusters per Fragment entry?

Yes, but remember that only metadata is covered, data clusters aren't. You'll have a much harder time finding a qcow2 image with 256 consecutive metadata clusters of the same type (it certainly can happen with L2 tables and refcount blocks, but it is much less likely).

And worst case, a window will have a list of 4096 fragments if every
cluster alternates between some other type.

Indeed, but again, very unlikely (at least it's very unlikely for a lot of windows to be like that; in general, we'll need one fragment per metadata cluster (at maximum) and it doesn't really matter whether they're all in the same window or spread all over the image).

If you're really concerned about DoS, I propose that we could let the user specify the maximum size to be used for the overlap structures, and if that is exceeded, no overlap checks will be performed (and we'd set that size to something that seems reasonable to us by default). There already is an option @cache_size for qcow2_create_empty_metadata_list(), which we could just use for the whole metadata structures and if it becomes impossible to adhere to that limit, just drop the checks (maybe coupled with a QMP event).

+} QEMU_PACKED Qcow2MetadataFragment;
Is QEMU_PACKED really essential here? If I'm not mistaken, this struct
is only ever kept in memory and not written out to disk.

It safes memory (albeit the compiler probably doesn't inject any padding anyway in this case).

On the other
hand, I understand that you are trying to ensure that the compiler
packed this into 32 bits rather than injecting any padding.  Would a
BUG_ON(sizeof(Qcow2MetadataFragment) == 4) be any better at representing
that fact?

I don't know. First, it wouldn't be better, because we'd need the QEMU_PACKED anyway. Second, if the compiler decides that even with QEMU_PACKED the structure should occupy five bytes, something definitely is wrong, but it doesn't really concern me here. As long as the structure as small as possible, that's fine.

So I can most certainly add that assertion, but I don't think it'll be too useful (and it won't replace the QEMU_PACKED).

+
+typedef struct Qcow2MetadataWindow {
+    Qcow2MetadataFragment *fragments;
+    int nb_fragments, fragments_array_size;
+
+    /* If not NULL, this is an expanded version of the "RLE" version given by
+     * the fragments array; there are WINDOW_SIZE entries */
+    uint8_t *bitmap;
+    bool bitmap_modified;
+
+    /* Time of last access */
+    unsigned age;
+
+    /* Index in Qcow2MetadataList::cached_windows */
+    int cached_windows_index;
+} Qcow2MetadataWindow;
+
+struct Qcow2MetadataList {
+    Qcow2MetadataWindow *windows;
+    uint64_t nb_windows;
+
+    unsigned current_age;
+
+    /* Index into the windows array */
+    int *cached_windows;
+    size_t nb_cached_windows;
+};
Is there a maximum size for nb_cached_windows before you start evicting
cached windows?

Yes, and it isn't fixed. This series does not introduce an option for the user to set it, but it would definitely be possible.

+
+/**
+ * Destroys the cached window bitmap. If it has been modified, the fragment 
list
+ * will be rebuilt accordingly.
+ */
+static void destroy_window_bitmap(Qcow2MetadataList *mdl,
+                                  Qcow2MetadataWindow *window)
+{
+    if (!window->bitmap) {
+        return;
+    }
+
+    if (window->bitmap_modified) {
+        int bitmap_i, fragment_i = 0;
+        QCow2MetadataOverlap current_types = 0;
+        int current_nb_clusters = 0;
+
+        /* Rebuild the fragment list; the case bitmap_i == WINDOW_SIZE is for
+         * entering the last fragment at the bitmap end */
+
+        for (bitmap_i = 0; bitmap_i <= WINDOW_SIZE; bitmap_i++) {
+            /* Qcow2MetadataFragment::nb_clusters is a uint8_t, so
+             * current_nb_clusters may not exceed 255 */
Wait. Why 255 and not 256? Can't you use nb_clusters==0 as a modulo for
256 consecutive clusters, as the fragments should never encode a
0-length run?

I could, and if you insist on it, I'll do. But I don't know if we'll gain a lot from it, and it'll make the code more complicated (although just how much more complicated remains to be seen). Maybe just storing nb_clusters - 1 is simpler.

That way, you can represent 4096 consecutive clusters in
16 fragments of 256 each, instead of 17 (16 of 255, and 1 of 16).

+            if (bitmap_i < WINDOW_SIZE &&
+                current_types == window->bitmap[bitmap_i] &&
+                current_nb_clusters < 255)
+            {
+                current_nb_clusters++;
+            } else {
+                if (current_types && current_nb_clusters) {
+                    if (fragment_i >= window->fragments_array_size) {
+                        window->fragments_array_size =
+                            3 * window->fragments_array_size / 2 + 1;
+
+                        /* new_nb_fragments should be small enough, and there 
is
+                         * nothing we can do on failure anyway, so do not use
+                         * g_try_renew() here */
+                        window->fragments =
+                            g_renew(Qcow2MetadataFragment, window->fragments,
+                                    window->fragments_array_size);
+                    }
+
+                    window->fragments[fragment_i++] = (Qcow2MetadataFragment){
+                        .types          = current_types,
+                        .nb_clusters    = current_nb_clusters,
+                        .relative_start = bitmap_i - current_nb_clusters,
+                    };
+                }
+
+                current_nb_clusters = 0;
+                if (bitmap_i < WINDOW_SIZE) {
+                    current_types = window->bitmap[bitmap_i];
+                }
+            }
+        }
+
+        window->nb_fragments = fragment_i;
Any need to clear window->bitmap_modified at this point? Or are you
careful to only rely on it when window->bitmap is non-NULL.

window->bitmap_modified is read only in one place, and that is right in this function which is only called if window->bitmap != NULL.

window->bitmap_modified is written in several places, but all of those are preceded by writes to the bitmap so window->bitmap is sure to be non-NULL there (if it was NULL, build_window_bitmap() is called beforehand).

+    }
+
+    g_free(window->bitmap);
+    window->bitmap = NULL;
+}
+
+/**
+ * Creates a bitmap from the fragment list.
+ */
+static void build_window_bitmap(Qcow2MetadataList *mdl,
+                                Qcow2MetadataWindow *window)
+{
+    int cache_i, oldest_cache_i = -1, i;
+    unsigned oldest_cache_age = 0;
+
+    for (cache_i = 0; cache_i < mdl->nb_cached_windows; cache_i++) {
+        unsigned age;
+
+        if (mdl->cached_windows[cache_i] < 0) {
+            break;
+        }
+
+        age = mdl->current_age - 
mdl->windows[mdl->cached_windows[cache_i]].age;
+        if (age > oldest_cache_age) {
+            oldest_cache_age = age;
+            oldest_cache_i = cache_i;
+        }
+    }
+
+    if (cache_i >= mdl->nb_cached_windows) {
+        destroy_window_bitmap(mdl,
+            &mdl->windows[mdl->cached_windows[oldest_cache_i]]);
+        cache_i = oldest_cache_i;
+    }
+
+    assert(cache_i >= 0);
+    mdl->cached_windows[cache_i] = window - mdl->windows;
+    window->cached_windows_index = cache_i;
+
+    window->age = mdl->current_age++;
+
+    window->bitmap = g_new0(uint8_t, WINDOW_SIZE);
+
+    for (i = 0; i < window->nb_fragments; i++) {
+        Qcow2MetadataFragment *fragment = &window->fragments[i];
+
+        memset(&window->bitmap[fragment->relative_start], fragment->types,
+               fragment->nb_clusters);
Hmm. If you do use my idea of nb_clusters==0 for 256, this needs a
special case. Another option would be storing number of clusters - 1 (so
a value of 0 is 1 cluster, a value of 255 is 256 clusters).

+    }
+
+    window->bitmap_modified = false;
+}
+
+/**
+ * Enters a new range into the metadata list.
+ */
+void qcow2_metadata_list_enter(BlockDriverState *bs, uint64_t offset,
+                               int nb_clusters, QCow2MetadataOverlap types)
+{
+    BDRVQcowState *s = bs->opaque;
+    uint64_t start_cluster = offset >> s->cluster_bits;
+    uint64_t end_cluster = start_cluster + nb_clusters;
+    uint64_t current_cluster = start_cluster;
+
+    types &= s->overlap_check;
+    if (!types) {
+        return;
+    }
+
+    if (offset_into_cluster(s, offset)) {
+        /* Do not enter apparently broken metadata ranges */
+        return;
+    }
+
+    while (current_cluster < end_cluster) {
+        int bitmap_i;
+        int bitmap_i_start = current_cluster % WINDOW_SIZE;
+        int bitmap_i_end = MIN(WINDOW_SIZE,
+                               end_cluster - current_cluster + bitmap_i_start);
+        uint64_t window_i = current_cluster / WINDOW_SIZE;
+        Qcow2MetadataWindow *window;
+
+        if (window_i >= s->metadata_list->nb_windows) {
+            /* This should not be happening too often, so it is fine to resize
+             * the array to exactly the required size */
+            Qcow2MetadataWindow *new_windows;
+
+            new_windows = g_try_renew(Qcow2MetadataWindow,
+                                      s->metadata_list->windows,
+                                      window_i + 1);
+            if (!new_windows) {
+                return;
+            }
+
+            memset(new_windows + s->metadata_list->nb_windows, 0,
+                   (window_i + 1 - s->metadata_list->nb_windows) *
+                   sizeof(Qcow2MetadataWindow));
+
+            s->metadata_list->windows = new_windows;
+            s->metadata_list->nb_windows = window_i + 1;
+        }
+
+        window = &s->metadata_list->windows[window_i];
+        if (!window->bitmap) {
+            build_window_bitmap(s->metadata_list, window);
+        }
+
+        for (bitmap_i = bitmap_i_start; bitmap_i < bitmap_i_end; bitmap_i++) {
+            window->bitmap[bitmap_i] |= types;
This adds in new types but keeps existing types listed.  Is that okay?

Yes. The caller is responsible for making sure all overlaps are intended (like an L2 table being both active an inactive at the same time).

+        }
+
+        window->age = s->metadata_list->current_age++;
+        window->bitmap_modified = true;
+
+        /* Go to the next window */
+        current_cluster += WINDOW_SIZE - bitmap_i_start;
+    }
+}
...

+++ b/block/qcow2.h
@@ -159,6 +159,9 @@ typedef struct QCowSnapshot {
  struct Qcow2Cache;
  typedef struct Qcow2Cache Qcow2Cache;
+struct Qcow2MetadataList;
+typedef struct Qcow2MetadataList Qcow2MetadataList;
+
  typedef struct Qcow2UnknownHeaderExtension {
      uint32_t magic;
      uint32_t len;
@@ -261,6 +264,7 @@ typedef struct BDRVQcowState {
bool discard_passthrough[QCOW2_DISCARD_MAX]; + Qcow2MetadataList *metadata_list;
      int overlap_check; /* bitmask of Qcow2MetadataOverlap values */
      bool signaled_corruption;
@@ -576,4 +580,13 @@ int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset,
      void **table);
  int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table);
+/* qcow2-overlap.c functions */
+int qcow2_create_empty_metadata_list(BlockDriverState *bs, size_t cache_size,
+                                     Error **errp);
+void qcow2_metadata_list_destroy(BlockDriverState *bs);
+void qcow2_metadata_list_enter(BlockDriverState *bs, uint64_t offset,
+                               int nb_clusters, QCow2MetadataOverlap type);
+void qcow2_metadata_list_remove(BlockDriverState *bs, uint64_t offset,
+                                int nb_clusters, QCow2MetadataOverlap type);
+
  #endif

Looks reasonable in general.

Thanks! :-)

Max



reply via email to

[Prev in Thread] Current Thread [Next in Thread]