qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qcow2 journalling draft


From: Kevin Wolf
Subject: Re: [Qemu-devel] [RFC] qcow2 journalling draft
Date: Thu, 5 Sep 2013 13:18:28 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

Am 05.09.2013 um 11:21 hat Stefan Hajnoczi geschrieben:
> On Wed, Sep 04, 2013 at 11:39:51AM +0200, Kevin Wolf wrote:
> > However, what if we run 'qemu-img check -r leaks' with an old qemu-img
> > version? It will reclaim the clusters used by the journal, and if we
> > continue using the journal we'll corrupt whatever new data is there
> > now.
> > 
> > Can we protect against this without using an autoclear bit?
> 
> You are right.  It's a weird case I didn't think of but it could happen.
> An autoclear bit sounds like the simplest solution.
> 
> Please document this scenario.

Okay, I've updated the description as follows:

    Bit 0:      Journal valid bit. This bit indicates that the
                image contains a valid main journal starting at
                journal_offset; it is used to mark journals
                invalid if the image was opened by older
                implementations that may have "reclaimed" the
                journal clusters that would appear as leaked
                clusters to them.

> > > > +A journal is organised in journal blocks, all of which have a 
> > > > reference count
> > > > +of exactly 1. It starts with a block containing the following journal 
> > > > header:
> > > > +
> > > > +    Byte  0 -  7:   Magic ("qjournal" ASCII string)
> > > > +
> > > > +          8 - 11:   Journal size in bytes, including the header
> > > > +
> > > > +         12 - 15:   Journal block size order (block size in bytes = 1 
> > > > << order)
> > > > +                    The block size must be at least 512 bytes and must 
> > > > not
> > > > +                    exceed the cluster size.
> > > > +
> > > > +         16 - 19:   Journal block index of the descriptor for the last
> > > > +                    transaction that has been synced, starting with 1 
> > > > for the
> > > > +                    journal block after the header. 0 is used for empty
> > > > +                    journals.
> > > > +
> > > > +         20 - 23:   Sequence number of the last transaction that has 
> > > > been
> > > > +                    synced. 0 is recommended as the initial value.
> > > > +
> > > > +         24 - 27:   Sequence number of the last transaction that has 
> > > > been
> > > > +                    committed. When replaying a journal, all 
> > > > transactions
> > > > +                    after the last synced one up to the last commit 
> > > > one must be
> > > > +                    synced. Note that this may include a wraparound of 
> > > > sequence
> > > > +                    numbers.
> > > > +
> > > > +         28 -  31:  Checksum (one's complement of the sum of all bytes 
> > > > in the
> > > > +                    header journal block except those of the checksum 
> > > > field)
> > > > +
> > > > +         32 - 511:  Reserved (set to 0)
> > > 
> > > I'm not sure if these fields are necessary.  They require updates (and
> > > maybe flush) after every commit and sync.
> > > 
> > > The fewer metadata updates, the better, not just for performance but
> > > also to reduce the risk of data loss.  If any metadata required to
> > > access the journal is corrupted, the image will be unavailable.
> > > 
> > > It should be possible to determine this information by scanning the
> > > journal transactions.
> > 
> > This is rather handwavy. Can you elaborate how this would work in detail?
> > 
> > 
> > For example, let's assume we get to read this journal (a journal can be
> > rather large, too, so I'm not sure if we want to read it in completely):
> > 
> >  - Descriptor, seq 42, 2 data blocks
> >  - Data block
> >  - Data block
> >  - Data block starting with "qjbk"
> >  - Data block
> >  - Descriptor, seq 7, 0 data blocks
> >  - Descriptor, seq 8, 1 data block
> >  - Data block
> > 
> > Which of these have already been synced? Which have been committed?

So what's your algorithm for this?

> > I guess we could introduce an is_commited flag in the descriptor, but
> > wouldn't correct operation look like this then:
> > 
> > 1. Write out descriptor commit flag clear and any data blocks
> > 2. Flush
> > 3. Rewrite descriptor with commit flag set
> > 
> > This ensures that the commit flag is only set if all the required data
> > is indeed stable on disk. What has changed compared to this proposal is
> > just the offset at which you write in step 3 (header vs. descriptor).
> 
> A commit flag cannot be relied upon.  A transaction can be corrupted
> after being committed, or it can be corrupted due to power failure while
> writing the transaction.  In both cases we have an invalid transaction
> and we must discard it.

No, I believe it is vitally important to distinguish these two cases.

If a transaction was corrupted due to power failure while writing the
transaction, then we can simply discard it indeed.

If, however, a transaction was committed and gets corrupted after the
fact, then we have a problem because the data on the disk is laid out as
described by on-disk metadat (e.g. L2 tables) _with the journal fully
applied_. The replay and consequently bdrv_open() must fail in this case.

The first case is handled by any information that tells us whether the
transaction is already committed; the second should never happen, but
would be caught by a checksum.

> The checksum tells us whether the transaction is valid (i.e. committed).
> I see no need for a commit flag since the checksum already tells us if
> the transaction is valid or not.
> 
> (If the checksum is weak then corruptions can sneak through, so we need
> to choose a reasonable algorithm.)

So you're essentially replacing a flush by a checksum? This is an
interesting thought, but feels quite dangerous. It also can only work if
the only goal is to ensure that the transaction is fully written, but
not for ordering. Generally after committing the journal, we want to
rely on the journalled data for other operations. I don't think we can
save that flush.

> > For sync I suppose the same option exists.
> 
> The sync process is:
> 
> 1. Apply descriptor operations to the image file
> 2. Flush - ensure transactions have been applied to disk
> 3. Mark transactions synced
> 
> Transactions can be marked synced by writing a new block to the journal.
> Or we could even piggyback the next transaction by including a
> last_sync_seq in the header.

Would this mean that there is always at least one unsynced transaction?

> > > > +A wraparound may not occur in the middle of a single transaction, but 
> > > > only
> > > > +between two transactions. For the necessary padding an empty 
> > > > descriptor with
> > > > +any number of data blocks can be used as the last entry of the ring.
> > > 
> > > Why have this limitation?
> > 
> > I thought it would make implementations easier if they didn't have to
> > read/write in data blocks in two parts in some cases.
> 
> Readers must detect wrapped/truncated transactions anyway since the
> journal may be corrupted.  So I don't think we save anything by adding
> the padding special case.

Fair enough. I'll change it.

> > > > +    earlier transactions (this does not include the transaction 
> > > > containing the
> > > > +    revoke). They must not be executed on a sync operation (e.g. 
> > > > because the
> > > > +    range in question has been freed and may have been reused for 
> > > > other, not
> > > > +    journalled data structures that must not be overwritten with stale 
> > > > data).
> > > > +    Note that this may mean that operations are to be executed 
> > > > partially.
> > > 
> > > Example scenario?
> > 
> > Once I had one... Well, let's try this:
> > 
> > 1. Cluster allocation that allocates a new L2 table, i.e. L1 update gets
> >    journalled
> > 
> > 2. Another cluster allocation, this time the L1 has to grow. The write
> >    to the new L1 position is journalled, the old table is freed.
> 
> Another solution is to sync all transactions before moving metadata
> around or updating the qcow2 header (e.g. L1 table offset).
> 
> Since 'revoke' itself requires writing to the journal, it may not be
> much more expensive to simply sync transactions.
> 
> I'm not sure which approach is best but I wanted to mention it if it
> helps use keep the journal design and implementation simpler.
> 
> > 3. Next allocation reuses the clusters where the old L1 table was stored
> > 
> > 4. Journal sync. The L1 update from step 1 is written out to the
> >    clusters that are now used for data. Oops.

I think revoke would be an essential part of Delayed COW anyway (not
having to actually perform the journalled COW is the whole point of
Delayed COW), so using it for simpler cases as well shouldn't hurt.

Actually I believe there's potential for simpler code with revoke:
update_refcounts() could directly notify the journalling code about any
clusters that reach a refcount of 0, so changes to them can be revoked.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]