qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qcow2 journalling draft


From: Kevin Wolf
Subject: Re: [Qemu-devel] [RFC] qcow2 journalling draft
Date: Wed, 4 Sep 2013 11:39:51 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

First of all, excuse any inconsistencies in the following mail. I wrote
it from top to bottom, and there was some thought process involved in
almost every paragraph...

Am 04.09.2013 um 10:03 hat Stefan Hajnoczi geschrieben:
> On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote:
> > @@ -103,7 +107,11 @@ in the description of a field.
> >                      write to an image with unknown auto-clear features if 
> > it
> >                      clears the respective bits from this field first.
> >  
> > -                    Bits 0-63:  Reserved (set to 0)
> > +                    Bit 0:      Journal valid bit. This bit indicates that 
> > the
> > +                                image contains a valid main journal 
> > starting at
> > +                                journal_offset.
> 
> Whether the journal is used can be determined from the journal_offset
> value (header length must be large enough and journal offset must be
> valid).
> 
> Why do we need this autoclear bit?

Hm, I introduced this one first and the journal dirty incompatible bit
later, perhaps it's unnecessary now. Let's check...

The obvious thing we need to protect against is applying stale journal
data to an image that has been changed by an older version. As long as
the journal is clean, this can't happen, and the journal dirty bit will
ensure that the old version can only open the image if it is clean.

However, what if we run 'qemu-img check -r leaks' with an old qemu-img
version? It will reclaim the clusters used by the journal, and if we
continue using the journal we'll corrupt whatever new data is there
now.

Can we protect against this without using an autoclear bit?

> > +Journals are used to allow safe updates of metadata without impacting
> > +performance by requiring flushes to order updates to different parts of the
> > +metadata.
> 
> This sentence is hard to parse.  Maybe something shorter like this:
> 
> Journals allow safe metadata updates without the need for carefully
> ordering and flushing between update steps.

Okay, I'll update the text with your proposal.

> > +They consist of transactions, which in turn contain operations that
> > +are effectively executed atomically. A qcow2 image can have a main image
> > +journal that deals with cluster management operations, and additional 
> > specific
> > +journals can be used by other features like data deduplication.
> 
> I'm not sure if multiple journals will work in practice.  Doesn't this
> re-introduce the need to order update steps and flush between them?

This is a question for Benoît, who made this requirement. I asked him
the same a while ago and apparently his explanation made some sense to
me, or I would have remembered that I don't want it. ;-)

It might have something to do with the fact that deduplication uses the
journal more as a kind of cache for hash values that can be dropped and
rebuilt after a crash.

> > +A journal is organised in journal blocks, all of which have a reference 
> > count
> > +of exactly 1. It starts with a block containing the following journal 
> > header:
> > +
> > +    Byte  0 -  7:   Magic ("qjournal" ASCII string)
> > +
> > +          8 - 11:   Journal size in bytes, including the header
> > +
> > +         12 - 15:   Journal block size order (block size in bytes = 1 << 
> > order)
> > +                    The block size must be at least 512 bytes and must not
> > +                    exceed the cluster size.
> > +
> > +         16 - 19:   Journal block index of the descriptor for the last
> > +                    transaction that has been synced, starting with 1 for 
> > the
> > +                    journal block after the header. 0 is used for empty
> > +                    journals.
> > +
> > +         20 - 23:   Sequence number of the last transaction that has been
> > +                    synced. 0 is recommended as the initial value.
> > +
> > +         24 - 27:   Sequence number of the last transaction that has been
> > +                    committed. When replaying a journal, all transactions
> > +                    after the last synced one up to the last commit one 
> > must be
> > +                    synced. Note that this may include a wraparound of 
> > sequence
> > +                    numbers.
> > +
> > +         28 -  31:  Checksum (one's complement of the sum of all bytes in 
> > the
> > +                    header journal block except those of the checksum 
> > field)
> > +
> > +         32 - 511:  Reserved (set to 0)
> 
> I'm not sure if these fields are necessary.  They require updates (and
> maybe flush) after every commit and sync.
> 
> The fewer metadata updates, the better, not just for performance but
> also to reduce the risk of data loss.  If any metadata required to
> access the journal is corrupted, the image will be unavailable.
> 
> It should be possible to determine this information by scanning the
> journal transactions.

This is rather handwavy. Can you elaborate how this would work in detail?


For example, let's assume we get to read this journal (a journal can be
rather large, too, so I'm not sure if we want to read it in completely):

 - Descriptor, seq 42, 2 data blocks
 - Data block
 - Data block
 - Data block starting with "qjbk"
 - Data block
 - Descriptor, seq 7, 0 data blocks
 - Descriptor, seq 8, 1 data block
 - Data block

Which of these have already been synced? Which have been committed?


I guess we could introduce an is_commited flag in the descriptor, but
wouldn't correct operation look like this then:

1. Write out descriptor commit flag clear and any data blocks
2. Flush
3. Rewrite descriptor with commit flag set

This ensures that the commit flag is only set if all the required data
is indeed stable on disk. What has changed compared to this proposal is
just the offset at which you write in step 3 (header vs. descriptor).

For sync I suppose the same option exists.


One unrelated thing we need to take care of is this 'data block starting
with "qjbk"' thing I mentioned in the example. I can see two solutions:
The first is never creating such a journal, by always blanking the next
block after a transaction that we write, so that the descriptor chain is
terminated. The second one is never creating such a journal, by
zeroing an initial "qjbk" in data blocks and setting a flag in the
descriptor instead.

I was thinking of the first, but I guess it would be worth at least
mentioning the problem in the spec.

> > +A wraparound may not occur in the middle of a single transaction, but only
> > +between two transactions. For the necessary padding an empty descriptor 
> > with
> > +any number of data blocks can be used as the last entry of the ring.
> 
> Why have this limitation?

I thought it would make implementations easier if they didn't have to
read/write in data blocks in two parts in some cases.

> > +All descriptors start with a common part:
> > +
> > +    Byte  0 -  1:   Descriptor type
> > +                        0 - No-op descriptor
> > +                        1 - Write data block
> > +                        2 - Copy data
> > +                        3 - Revoke
> > +                        4 - Deduplication hash insertion
> > +                        5 - Deduplication hash deletion
> > +
> > +          2 -  3:   Size of the descriptor in bytes
> 
> Data blocks are not included in the descriptor size?  I just want to
> make sure that we don't be limited to 64 KB for the actual data.

Right, this only for the descriptors, without data. It implies a maximum
journal block size of 64k, which I think is okay.

> > +
> > +          4 -  n:   Type-specific data
> > +
> > +The following section specifies the purpose (i.e. the action that is to be
> > +performed when syncing) and type-specific data layout of each descriptor 
> > type:
> > +
> > +  * No-op descriptor: No action is to be performed when syncing this 
> > descriptor
> > +
> > +          4 -  n:   Ignored
> > +
> > +  * Write data block: Write literal data associated with this transaction 
> > from
> > +    the journal to a given offset.
> > +
> > +          4 -  7:   Length of the data to write in bytes
> > +
> > +          8 - 15:   Offset in the image file to write the data to
> > +
> > +         16 - 19:   Index of the journal block at which the data to write
> > +                    starts. The data must be stored sequentially and be 
> > fully
> > +                    contained in the data blocks associated with the
> > +                    transaction.
> > +
> > +    The type-specific data can be repeated, specifying multiple chunks of 
> > data
> > +    to be written in one operation. This means the size of the descriptor 
> > must
> > +    be 4 + 16 * n.
> 
> Why is the necessary?  Multiple data descriptors could be used, is it
> worth the additional logic and testing?

What does a typical journal transaction look like?

My assumption is that initially it has lots of L2 table and refcount
block updates and little else. All of these are data writes. Once we
implement Delayed COW using the journal, we get some data copy
operations as well. Assuming a stupid guest, we get two copies and two
writes for each cluster allocation.

The space required for these is 2*(4 + 16n) + 2*(4 + 20n) = 16 + 72n. In
one 4k descriptor block you can fit 56 cluster allocations.

If you used separate descriptors, you get 2*20n + 2*24n = 88n. In one 4k
descriptor block you could fit 46 cluster allocations.

Considering that in both cases the space used by descriptors is dwarved
by the updated tables to be written, I guess it's not worth it for the
main journal. Things may look different for the dedpulication journal,
where no data blocks are used and the space taken is determined only by
the descriptors.

And in fact... All of the above sounded great, but is in fact wrong.
Typically you'd get _one_ L2 update for all (sequential) allocations,
the COW entries without data should dominate in the main journal as well.

> > +
> > +  * Copy data: Copy data from one offset in the image to another one. This 
> > can
> > +    be used for journalling copy-on-write operations.
> 
> This reminds me to ask what the plan is for journal scope: metadata only
> or also data?  For some operations like dedupe it seems that full data
> journalling may be necessary.  But for an image without dedupe it would
> not be necessary to journal the rewrites to an already allocated
> cluster, for example.

Generally only metadata.

The COW case is a bit tricky. Here we really store only metadata in the
journal as well ("this COW is still pending"), but it directly affects
data and the qemu read/write paths for data must take such pending COWs
into consideration. This will require some ordering between the journal
and data (e.g. only free the COWed cluster after the journal is
committed), but I think that's okay.

> > +          4 -  7:   Length of the data to write in bytes
> > +
> > +          8 - 15:   Target offset in the image file
> > +
> > +         16 - 23:   Source offset in the image file
> 
> Source and target cannot overlap?

I don't think there's a use case for it, so let's forbid it.

> > +
> > +    The type-specific data can be repeated, specifying multiple chunks of 
> > data
> > +    to be copied in one operation. This means the size of the descriptor 
> > must
> > +    be 4 + 20 * n.
> > +
> > +  * Revoke: Marks operations on a given range in the imag file invalid for 
> > all
> 
> s/imag/image/
> 
> > +    earlier transactions (this does not include the transaction containing 
> > the
> > +    revoke). They must not be executed on a sync operation (e.g. because 
> > the
> > +    range in question has been freed and may have been reused for other, 
> > not
> > +    journalled data structures that must not be overwritten with stale 
> > data).
> > +    Note that this may mean that operations are to be executed partially.
> 
> Example scenario?

Once I had one... Well, let's try this:

1. Cluster allocation that allocates a new L2 table, i.e. L1 update gets
   journalled

2. Another cluster allocation, this time the L1 has to grow. The write
   to the new L1 position is journalled, the old table is freed.

3. Next allocation reuses the clusters where the old L1 table was stored

4. Journal sync. The L1 update from step 1 is written out to the
   clusters that are now used for data. Oops.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]