qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] Re: qcow2 performance plan


From: Kevin Wolf
Subject: [Qemu-devel] Re: qcow2 performance plan
Date: Tue, 14 Sep 2010 16:01:26 +0200
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.12) Gecko/20100907 Fedora/3.0.7-1.fc12 Thunderbird/3.0.7

Am 14.09.2010 15:07, schrieb Avi Kivity:
>   Here's a draft of a plan that should improve qcow2 performance.  It's 
> written in wiki syntax for eventual upload to wiki.qemu.org; lines 
> starting with # are numbered lists, not comments.
> 
> = Basics =
> 
> At the minimum level, no operation should block the main thread.  This
> could be done in two ways: extending the state machine so that each
> blocking operation can be performed asynchronously (<code>bdrv_aio_*</code>)
> or by threading: each new operation is handed off to a worker thread.
> Since a full state machine is prohibitively complex, this document
> will discuss threading.
> 
> == Basic threading strategy ==
> 
> A first iteration of qcow2 threading adds a single mutex to an image.
> The existing qcow2 code is then executed within a worker thread,
> acquiring the mutex before starting any operation and releasing it
> after completion.  Concurrent operations will simply block until the
> operation is complete.  For operations which are already asynchronous,
> the blocking time will be negligible since the code will call
> <code>bdrv_aio_{read,write}</code> and return, releasing the mutex.
> The immediate benefit is that currently blocking operations no long block
> the main thread, instead they just block the block operation which is
> blocking anyway.
> 
> == Eliminating the threading penalty ==
> 
> We can eliminate pointless context switches by using the worker thread
> context we're in to issue the I/O.  This is trivial for synchronous calls
> (<code>bdrv_read</code> and <code>bdrv_write</code>); we simply issue 
> the I/O
> from the same thread we're currently in.  The underlying raw block format
> driver threading code needs to recognize we're in a worker thread context so
> it doesn't need to use a worker thread of its own; perhaps using a thread
> variable to see if it is in the main thread or an I/O worker thread.
> 
> For asynchronous operations, this is harder.  We may add a
> <code>bdrv_queue_aio_read</code> and <code>bdrv_queue_aio_write</code> if
> to replace a
> 
>      bdrv_aio_read()
>      mutex_unlock(bs.mutex)
>      return;
> 
> sequence.  Alternatively, we can just eliminate asynchronous calls.  To
> retain concurrency we drop the mutex while performing the operation:
> an convert a <code>bdrv_aio_read</code> to:
> 
>      mutex_unlock(bs.mutex)
>      bdrv_read()
>      mutex_lock(bs.mutex)
> 
> This allows the operations to proceed in parallel.
> 
> For asynchronous metadata operations, the code is simplified considerably.
> Dependency lists that are maintained in metadata caches are replaced by a
> mutex; instead of adding an operation to a dependency list, acquire the 
> mutex.
> Then issue your metadata update synchronously.  If there is a lot of 
> contention
> on the resource, we can batch all updates into a single write:
> 
>     mutex_lock(l1.mutex)
>     if not l1.dirty:
>         l1.future = l1.data
>         l1.dirty = True
>     l1.future[idx] = cluster
>     mutex_lock(l1.write_mutex)
>     if l1.dirty:
>         tmp = l1.future
>         mutex_unlock(l1.mutex)
>         bdrv_write(tmp)
>         sync
>         mutex_lock(l1.mutex)
>         l1.dirty = tmp != l1.future
>     mutex_unlock(l1.write_mutex)
> 
> == Special casing linux-aio ==
> 
> There is one case where a worker thread approach is detrimental:
> <code>cache=none</code> together with <code>aio=native</code>.  We can solve
> this by checking for the case where we're ready to issue the operation with
> no metadata I/O:
> 
>      if mutex_trylock(bs.mutex):
>         m = metadata_loopup(offset, length)
>         if m:
>             bdrv_aio_read(bs, m, offset, length, callback) # or write
>             mutex_unlock(bs.mutex)
>             return
>      queue_task(operation, offset, length, callback)
> 
> = Speeding up allocation =
> 
> When a write grows a qcow2 image, the following operations take place:
> 
> # clusters are allocated, and the refcount table is updated to reflect this
> # sync to ensure the allocation is committed
> # the data is written to the clusters
> # the L2 table is located; if it doesn't exist, it is allocated and linked
> # the L2 table is updated
> # sync to ensure the L2->data pointer is committed

I have been thinking about changing this into:

# clusters are allocated, and the refcount table is updated to reflect this
# the data is written to the clusters
# Return success to the caller and schedule a second part to be run
immediately before the next sync is done for other reasons (e.g. the
guest requesting a flush)

This is done only before the next sync:

# sync to ensure the allocation is committed
# the L2 table is located; if it doesn't exist, it is allocated and linked
# the L2 table is updated
# One sync for everything that has accumulated

This would leave us with no sync in the typical cluster allocation case.
The downside is that we're in trouble if we get an I/O error during the
pre-sync writes. Losing this data is actually okay, though, because the
guest hasn't flushed yet.

If you extend this to more complicated things like a COW this will
involve a bit more syncs. We can probably have a queue like "before the
next sync, do A, B and C; after the second one D; after the third one E
and F".

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]