qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Re: qcow2 performance plan


From: Stefan Hajnoczi
Subject: Re: [Qemu-devel] Re: qcow2 performance plan
Date: Tue, 14 Sep 2010 16:14:18 +0100

On Tue, Sep 14, 2010 at 3:01 PM, Kevin Wolf <address@hidden> wrote:
> Am 14.09.2010 15:07, schrieb Avi Kivity:
>>   Here's a draft of a plan that should improve qcow2 performance.  It's
>> written in wiki syntax for eventual upload to wiki.qemu.org; lines
>> starting with # are numbered lists, not comments.
>>
>> = Basics =
>>
>> At the minimum level, no operation should block the main thread.  This
>> could be done in two ways: extending the state machine so that each
>> blocking operation can be performed asynchronously (<code>bdrv_aio_*</code>)
>> or by threading: each new operation is handed off to a worker thread.
>> Since a full state machine is prohibitively complex, this document
>> will discuss threading.
>>
>> == Basic threading strategy ==
>>
>> A first iteration of qcow2 threading adds a single mutex to an image.
>> The existing qcow2 code is then executed within a worker thread,
>> acquiring the mutex before starting any operation and releasing it
>> after completion.  Concurrent operations will simply block until the
>> operation is complete.  For operations which are already asynchronous,
>> the blocking time will be negligible since the code will call
>> <code>bdrv_aio_{read,write}</code> and return, releasing the mutex.
>> The immediate benefit is that currently blocking operations no long block
>> the main thread, instead they just block the block operation which is
>> blocking anyway.
>>
>> == Eliminating the threading penalty ==
>>
>> We can eliminate pointless context switches by using the worker thread
>> context we're in to issue the I/O.  This is trivial for synchronous calls
>> (<code>bdrv_read</code> and <code>bdrv_write</code>); we simply issue
>> the I/O
>> from the same thread we're currently in.  The underlying raw block format
>> driver threading code needs to recognize we're in a worker thread context so
>> it doesn't need to use a worker thread of its own; perhaps using a thread
>> variable to see if it is in the main thread or an I/O worker thread.
>>
>> For asynchronous operations, this is harder.  We may add a
>> <code>bdrv_queue_aio_read</code> and <code>bdrv_queue_aio_write</code> if
>> to replace a
>>
>>      bdrv_aio_read()
>>      mutex_unlock(bs.mutex)
>>      return;
>>
>> sequence.  Alternatively, we can just eliminate asynchronous calls.  To
>> retain concurrency we drop the mutex while performing the operation:
>> an convert a <code>bdrv_aio_read</code> to:
>>
>>      mutex_unlock(bs.mutex)
>>      bdrv_read()
>>      mutex_lock(bs.mutex)
>>
>> This allows the operations to proceed in parallel.
>>
>> For asynchronous metadata operations, the code is simplified considerably.
>> Dependency lists that are maintained in metadata caches are replaced by a
>> mutex; instead of adding an operation to a dependency list, acquire the
>> mutex.
>> Then issue your metadata update synchronously.  If there is a lot of
>> contention
>> on the resource, we can batch all updates into a single write:
>>
>>     mutex_lock(l1.mutex)
>>     if not l1.dirty:
>>         l1.future = l1.data
>>         l1.dirty = True
>>     l1.future[idx] = cluster
>>     mutex_lock(l1.write_mutex)
>>     if l1.dirty:
>>         tmp = l1.future
>>         mutex_unlock(l1.mutex)
>>         bdrv_write(tmp)
>>         sync
>>         mutex_lock(l1.mutex)
>>         l1.dirty = tmp != l1.future
>>     mutex_unlock(l1.write_mutex)
>>
>> == Special casing linux-aio ==
>>
>> There is one case where a worker thread approach is detrimental:
>> <code>cache=none</code> together with <code>aio=native</code>.  We can solve
>> this by checking for the case where we're ready to issue the operation with
>> no metadata I/O:
>>
>>      if mutex_trylock(bs.mutex):
>>         m = metadata_loopup(offset, length)
>>         if m:
>>             bdrv_aio_read(bs, m, offset, length, callback) # or write
>>             mutex_unlock(bs.mutex)
>>             return
>>      queue_task(operation, offset, length, callback)
>>
>> = Speeding up allocation =
>>
>> When a write grows a qcow2 image, the following operations take place:
>>
>> # clusters are allocated, and the refcount table is updated to reflect this
>> # sync to ensure the allocation is committed
>> # the data is written to the clusters
>> # the L2 table is located; if it doesn't exist, it is allocated and linked
>> # the L2 table is updated
>> # sync to ensure the L2->data pointer is committed
>
> I have been thinking about changing this into:
>
> # clusters are allocated, and the refcount table is updated to reflect this
> # the data is written to the clusters
> # Return success to the caller and schedule a second part to be run
> immediately before the next sync is done for other reasons (e.g. the
> guest requesting a flush)
>
> This is done only before the next sync:
>
> # sync to ensure the allocation is committed
> # the L2 table is located; if it doesn't exist, it is allocated and linked
> # the L2 table is updated
> # One sync for everything that has accumulated
>
> This would leave us with no sync in the typical cluster allocation case.
> The downside is that we're in trouble if we get an I/O error during the
> pre-sync writes. Losing this data is actually okay, though, because the
> guest hasn't flushed yet.
>
> If you extend this to more complicated things like a COW this will
> involve a bit more syncs. We can probably have a queue like "before the
> next sync, do A, B and C; after the second one D; after the third one E
> and F".

If the metadata update code is smart enough it may be able to combine
multiple L2 updates contained in the same sector, for example.

If you have support for barriers/dependencies between metadata
updates, can this mechanism also be used to update the refcount table?
 For example:

# Write new data to a free cluster

Out-of-line:
# Increment the refcount
# Then, the L2 table is located; if it doesn't exist, it is allocated
# the L2 table is updated
# Then, If the L2 table was allocated, the L1 table is updated

(where "Then," is an ordering relationship using a sync)

It feels like this route is powerful but becomes very complex and
errors violating data integrity are subtle and may not be noticed.

Stefan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]