qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qcow2 journalling draft


From: Stefan Hajnoczi
Subject: Re: [Qemu-devel] [RFC] qcow2 journalling draft
Date: Thu, 5 Sep 2013 11:21:03 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

On Wed, Sep 04, 2013 at 11:39:51AM +0200, Kevin Wolf wrote:
> First of all, excuse any inconsistencies in the following mail. I wrote
> it from top to bottom, and there was some thought process involved in
> almost every paragraph...

I should add this disclaimer to all my emails ;-).

> Am 04.09.2013 um 10:03 hat Stefan Hajnoczi geschrieben:
> > On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote:
> > > @@ -103,7 +107,11 @@ in the description of a field.
> > >                      write to an image with unknown auto-clear features 
> > > if it
> > >                      clears the respective bits from this field first.
> > >  
> > > -                    Bits 0-63:  Reserved (set to 0)
> > > +                    Bit 0:      Journal valid bit. This bit indicates 
> > > that the
> > > +                                image contains a valid main journal 
> > > starting at
> > > +                                journal_offset.
> > 
> > Whether the journal is used can be determined from the journal_offset
> > value (header length must be large enough and journal offset must be
> > valid).
> > 
> > Why do we need this autoclear bit?
> 
> Hm, I introduced this one first and the journal dirty incompatible bit
> later, perhaps it's unnecessary now. Let's check...
> 
> The obvious thing we need to protect against is applying stale journal
> data to an image that has been changed by an older version. As long as
> the journal is clean, this can't happen, and the journal dirty bit will
> ensure that the old version can only open the image if it is clean.
> 
> However, what if we run 'qemu-img check -r leaks' with an old qemu-img
> version? It will reclaim the clusters used by the journal, and if we
> continue using the journal we'll corrupt whatever new data is there
> now.
> 
> Can we protect against this without using an autoclear bit?

You are right.  It's a weird case I didn't think of but it could happen.
An autoclear bit sounds like the simplest solution.

Please document this scenario.

> > > +A journal is organised in journal blocks, all of which have a reference 
> > > count
> > > +of exactly 1. It starts with a block containing the following journal 
> > > header:
> > > +
> > > +    Byte  0 -  7:   Magic ("qjournal" ASCII string)
> > > +
> > > +          8 - 11:   Journal size in bytes, including the header
> > > +
> > > +         12 - 15:   Journal block size order (block size in bytes = 1 << 
> > > order)
> > > +                    The block size must be at least 512 bytes and must 
> > > not
> > > +                    exceed the cluster size.
> > > +
> > > +         16 - 19:   Journal block index of the descriptor for the last
> > > +                    transaction that has been synced, starting with 1 
> > > for the
> > > +                    journal block after the header. 0 is used for empty
> > > +                    journals.
> > > +
> > > +         20 - 23:   Sequence number of the last transaction that has been
> > > +                    synced. 0 is recommended as the initial value.
> > > +
> > > +         24 - 27:   Sequence number of the last transaction that has been
> > > +                    committed. When replaying a journal, all transactions
> > > +                    after the last synced one up to the last commit one 
> > > must be
> > > +                    synced. Note that this may include a wraparound of 
> > > sequence
> > > +                    numbers.
> > > +
> > > +         28 -  31:  Checksum (one's complement of the sum of all bytes 
> > > in the
> > > +                    header journal block except those of the checksum 
> > > field)
> > > +
> > > +         32 - 511:  Reserved (set to 0)
> > 
> > I'm not sure if these fields are necessary.  They require updates (and
> > maybe flush) after every commit and sync.
> > 
> > The fewer metadata updates, the better, not just for performance but
> > also to reduce the risk of data loss.  If any metadata required to
> > access the journal is corrupted, the image will be unavailable.
> > 
> > It should be possible to determine this information by scanning the
> > journal transactions.
> 
> This is rather handwavy. Can you elaborate how this would work in detail?
> 
> 
> For example, let's assume we get to read this journal (a journal can be
> rather large, too, so I'm not sure if we want to read it in completely):
> 
>  - Descriptor, seq 42, 2 data blocks
>  - Data block
>  - Data block
>  - Data block starting with "qjbk"
>  - Data block
>  - Descriptor, seq 7, 0 data blocks
>  - Descriptor, seq 8, 1 data block
>  - Data block
> 
> Which of these have already been synced? Which have been committed?
> 
> 
> I guess we could introduce an is_commited flag in the descriptor, but
> wouldn't correct operation look like this then:
> 
> 1. Write out descriptor commit flag clear and any data blocks
> 2. Flush
> 3. Rewrite descriptor with commit flag set
> 
> This ensures that the commit flag is only set if all the required data
> is indeed stable on disk. What has changed compared to this proposal is
> just the offset at which you write in step 3 (header vs. descriptor).

A commit flag cannot be relied upon.  A transaction can be corrupted
after being committed, or it can be corrupted due to power failure while
writing the transaction.  In both cases we have an invalid transaction
and we must discard it.

The checksum tells us whether the transaction is valid (i.e. committed).
I see no need for a commit flag since the checksum already tells us if
the transaction is valid or not.

(If the checksum is weak then corruptions can sneak through, so we need
to choose a reasonable algorithm.)

> For sync I suppose the same option exists.

The sync process is:

1. Apply descriptor operations to the image file
2. Flush - ensure transactions have been applied to disk
3. Mark transactions synced

Transactions can be marked synced by writing a new block to the journal.
Or we could even piggyback the next transaction by including a
last_sync_seq in the header.

> One unrelated thing we need to take care of is this 'data block starting
> with "qjbk"' thing I mentioned in the example. I can see two solutions:
> The first is never creating such a journal, by always blanking the next
> block after a transaction that we write, so that the descriptor chain is
> terminated. The second one is never creating such a journal, by
> zeroing an initial "qjbk" in data blocks and setting a flag in the
> descriptor instead.
> 
> I was thinking of the first, but I guess it would be worth at least
> mentioning the problem in the spec.

Agreed.

> > > +A wraparound may not occur in the middle of a single transaction, but 
> > > only
> > > +between two transactions. For the necessary padding an empty descriptor 
> > > with
> > > +any number of data blocks can be used as the last entry of the ring.
> > 
> > Why have this limitation?
> 
> I thought it would make implementations easier if they didn't have to
> read/write in data blocks in two parts in some cases.

Readers must detect wrapped/truncated transactions anyway since the
journal may be corrupted.  So I don't think we save anything by adding
the padding special case.

> > > +
> > > +  * Copy data: Copy data from one offset in the image to another one. 
> > > This can
> > > +    be used for journalling copy-on-write operations.
> > 
> > This reminds me to ask what the plan is for journal scope: metadata only
> > or also data?  For some operations like dedupe it seems that full data
> > journalling may be necessary.  But for an image without dedupe it would
> > not be necessary to journal the rewrites to an already allocated
> > cluster, for example.
> 
> Generally only metadata.

Okay, that's what I was hoping since writing all data through the
journal would mean massive churn and definitely affect performance
(everything that goes via the journal will be written twice).

> > > +          4 -  7:   Length of the data to write in bytes
> > > +
> > > +          8 - 15:   Target offset in the image file
> > > +
> > > +         16 - 23:   Source offset in the image file
> > 
> > Source and target cannot overlap?
> 
> I don't think there's a use case for it, so let's forbid it.

Agreed.

> > > +    earlier transactions (this does not include the transaction 
> > > containing the
> > > +    revoke). They must not be executed on a sync operation (e.g. because 
> > > the
> > > +    range in question has been freed and may have been reused for other, 
> > > not
> > > +    journalled data structures that must not be overwritten with stale 
> > > data).
> > > +    Note that this may mean that operations are to be executed partially.
> > 
> > Example scenario?
> 
> Once I had one... Well, let's try this:
> 
> 1. Cluster allocation that allocates a new L2 table, i.e. L1 update gets
>    journalled
> 
> 2. Another cluster allocation, this time the L1 has to grow. The write
>    to the new L1 position is journalled, the old table is freed.

Another solution is to sync all transactions before moving metadata
around or updating the qcow2 header (e.g. L1 table offset).

Since 'revoke' itself requires writing to the journal, it may not be
much more expensive to simply sync transactions.

I'm not sure which approach is best but I wanted to mention it if it
helps use keep the journal design and implementation simpler.

> 3. Next allocation reuses the clusters where the old L1 table was stored
> 
> 4. Journal sync. The L1 update from step 1 is written out to the
>    clusters that are now used for data. Oops.

Stefan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]