qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qcow2 journalling draft


From: Stefan Hajnoczi
Subject: Re: [Qemu-devel] [RFC] qcow2 journalling draft
Date: Thu, 5 Sep 2013 16:55:17 +0200

On Thu, Sep 5, 2013 at 1:18 PM, Kevin Wolf <address@hidden> wrote:
> Am 05.09.2013 um 11:21 hat Stefan Hajnoczi geschrieben:
>> On Wed, Sep 04, 2013 at 11:39:51AM +0200, Kevin Wolf wrote:
>> > However, what if we run 'qemu-img check -r leaks' with an old qemu-img
>> > version? It will reclaim the clusters used by the journal, and if we
>> > continue using the journal we'll corrupt whatever new data is there
>> > now.
>> >
>> > Can we protect against this without using an autoclear bit?
>>
>> You are right.  It's a weird case I didn't think of but it could happen.
>> An autoclear bit sounds like the simplest solution.
>>
>> Please document this scenario.
>
> Okay, I've updated the description as follows:
>
>     Bit 0:      Journal valid bit. This bit indicates that the
>                 image contains a valid main journal starting at
>                 journal_offset; it is used to mark journals
>                 invalid if the image was opened by older
>                 implementations that may have "reclaimed" the
>                 journal clusters that would appear as leaked
>                 clusters to them.

Great, thanks.

>> > > > +A journal is organised in journal blocks, all of which have a 
>> > > > reference count
>> > > > +of exactly 1. It starts with a block containing the following journal 
>> > > > header:
>> > > > +
>> > > > +    Byte  0 -  7:   Magic ("qjournal" ASCII string)
>> > > > +
>> > > > +          8 - 11:   Journal size in bytes, including the header
>> > > > +
>> > > > +         12 - 15:   Journal block size order (block size in bytes = 1 
>> > > > << order)
>> > > > +                    The block size must be at least 512 bytes and 
>> > > > must not
>> > > > +                    exceed the cluster size.
>> > > > +
>> > > > +         16 - 19:   Journal block index of the descriptor for the last
>> > > > +                    transaction that has been synced, starting with 1 
>> > > > for the
>> > > > +                    journal block after the header. 0 is used for 
>> > > > empty
>> > > > +                    journals.
>> > > > +
>> > > > +         20 - 23:   Sequence number of the last transaction that has 
>> > > > been
>> > > > +                    synced. 0 is recommended as the initial value.
>> > > > +
>> > > > +         24 - 27:   Sequence number of the last transaction that has 
>> > > > been
>> > > > +                    committed. When replaying a journal, all 
>> > > > transactions
>> > > > +                    after the last synced one up to the last commit 
>> > > > one must be
>> > > > +                    synced. Note that this may include a wraparound 
>> > > > of sequence
>> > > > +                    numbers.
>> > > > +
>> > > > +         28 -  31:  Checksum (one's complement of the sum of all 
>> > > > bytes in the
>> > > > +                    header journal block except those of the checksum 
>> > > > field)
>> > > > +
>> > > > +         32 - 511:  Reserved (set to 0)
>> > >
>> > > I'm not sure if these fields are necessary.  They require updates (and
>> > > maybe flush) after every commit and sync.
>> > >
>> > > The fewer metadata updates, the better, not just for performance but
>> > > also to reduce the risk of data loss.  If any metadata required to
>> > > access the journal is corrupted, the image will be unavailable.
>> > >
>> > > It should be possible to determine this information by scanning the
>> > > journal transactions.
>> >
>> > This is rather handwavy. Can you elaborate how this would work in detail?
>> >
>> >
>> > For example, let's assume we get to read this journal (a journal can be
>> > rather large, too, so I'm not sure if we want to read it in completely):
>> >
>> >  - Descriptor, seq 42, 2 data blocks
>> >  - Data block
>> >  - Data block
>> >  - Data block starting with "qjbk"
>> >  - Data block
>> >  - Descriptor, seq 7, 0 data blocks
>> >  - Descriptor, seq 8, 1 data block
>> >  - Data block
>> >
>> > Which of these have already been synced? Which have been committed?
>
> So what's your algorithm for this?

Scan the journal to find unsynced transactions, if they exist:

last_sync_seq = 0
last_seqno = 0
while True:
    block = journal[(i++) % journal_nblocks]
    if i >= journal_nblocks * 2:
        break # avoid infinite loop
    if block.magic != 'qjbk':
        continue
    if block.seqno < last_seqno:
        # Wrapped around to oldest transaction
        break
    elif block.seqno == seqno:
        # Corrupt journal, sequence number should be
        # monotonically increasing
        raise InvalidJournalException
    if block.last_sync_seq != last_sync_seq:
        last_sync_seq = block.last_sync_seq
    last_seqno = block.seqno

print 'First unsynced block seq no:', last_sync_seq
print 'Last block seq no:', last_seqno

This is broken pseudocode, but hopefully the idea makes sense.

>> > I guess we could introduce an is_commited flag in the descriptor, but
>> > wouldn't correct operation look like this then:
>> >
>> > 1. Write out descriptor commit flag clear and any data blocks
>> > 2. Flush
>> > 3. Rewrite descriptor with commit flag set
>> >
>> > This ensures that the commit flag is only set if all the required data
>> > is indeed stable on disk. What has changed compared to this proposal is
>> > just the offset at which you write in step 3 (header vs. descriptor).
>>
>> A commit flag cannot be relied upon.  A transaction can be corrupted
>> after being committed, or it can be corrupted due to power failure while
>> writing the transaction.  In both cases we have an invalid transaction
>> and we must discard it.
>
> No, I believe it is vitally important to distinguish these two cases.
>
> If a transaction was corrupted due to power failure while writing the
> transaction, then we can simply discard it indeed.

I agree that we need to detect whether a journal is corrupted partway
through or just the last unsynced transactions.

If there is a valid transaction after an invalid transaction we know
we're in trouble because the journal was corrupted partway through.

A committed bit isn't necessary for doing that.

> If, however, a transaction was committed and gets corrupted after the
> fact, then we have a problem because the data on the disk is laid out as
> described by on-disk metadat (e.g. L2 tables) _with the journal fully
> applied_. The replay and consequently bdrv_open() must fail in this case.
>
> The first case is handled by any information that tells us whether the
> transaction is already committed; the second should never happen, but
> would be caught by a checksum.
>
>> The checksum tells us whether the transaction is valid (i.e. committed).
>> I see no need for a commit flag since the checksum already tells us if
>> the transaction is valid or not.
>>
>> (If the checksum is weak then corruptions can sneak through, so we need
>> to choose a reasonable algorithm.)
>
> So you're essentially replacing a flush by a checksum? This is an
> interesting thought, but feels quite dangerous. It also can only work if
> the only goal is to ensure that the transaction is fully written, but
> not for ordering. Generally after committing the journal, we want to
> rely on the journalled data for other operations. I don't think we can
> save that flush.

We definitely need to flush if we're making a promise that the
transaction is safely in the journal.

My goal here is for the journal to be append-only instead of using
random accesses to modify the header.

Imagine we're adding a transaction to the journal and just updating
the journal header block when power fails.  This could corrupt the
journal header block.

If we have an append-only solution then the final transaction is
invalid but the rest of the journal is still accessible.

The journal is no good if the entire thing can be inaccessible due to
power failure.

>> > For sync I suppose the same option exists.
>>
>> The sync process is:
>>
>> 1. Apply descriptor operations to the image file
>> 2. Flush - ensure transactions have been applied to disk
>> 3. Mark transactions synced
>>
>> Transactions can be marked synced by writing a new block to the journal.
>> Or we could even piggyback the next transaction by including a
>> last_sync_seq in the header.
>
> Would this mean that there is always at least one unsynced transaction?

Yes.  If you want to sync all transactions you need to make a nop
transaction.  Kind of like a ring buffer with a dummy element so head
and tail don't overlap.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]