qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format


From: Kevin Wolf
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date: Mon, 13 Sep 2010 13:03:06 +0200
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.12) Gecko/20100907 Fedora/3.0.7-1.fc12 Thunderbird/3.0.7

Am 12.09.2010 17:56, schrieb Avi Kivity:
>>>> To me, the biggest burden in qcow2 is thinking through how you deal 
>>>> with shared resources.  Because you can block for a long period of 
>>>> time during write operations, it's not enough to just carry a mutex 
>>>> during all metadata operations.  You have to stage operations and 
>>>> commit them at very specific points in time.
>>>
>>> The standard way of dealing with this is to have a hash table for 
>>> metadata that contains a local mutex:
>>>
>>>     l2cache = defaultdict(L2)
>>>
>>>     def get_l2(pos):
>>>         l2 = l2cache[pos]
>>>         l2.mutex.lock()
>>>         if not l2.valid:
>>>              l2.pos = pos
>>>              l2.read()
>>>              l2.valid = True
>>>         return l2
>>>
>>>     def put_l2(l2):
>>>         if l2.dirty:
>>>             l2.write()
>>>             l2.dirty = False
>>>         l2.mutex.unlock()
>>
>> You're missing how you create entries.  That means you've got to do:
>>
>> def put_l2(l2):
>>    if l2.committed:
>>        if l2.dirty
>>            l2.write()
>>            l2.dirty = False
>>        l2.mutex.unlock()
>>     else:
>>        l2.mutex.lock()
>>        l2cache[l2.pos] = l2
>>        l2.mutex.unlock()
> 
> The in-memory L2 is created by defaultdict().  I did omit linking L2 
> into L1, by that's a function call.  With a state machine, it's a new 
> string of states and calls.
> 
>>
>> And this really illustrates my point.  It's a harder problem that it 
>> seems.  You also are keeping l2 reads from occurring when flushing a 
>> dirty l2 entry which is less parallel than what qed achieves today.
> 
> There are standard threading primitives like shared/exclusive locks or 
> barriers that can be used to increase concurrency.  It's nowhere near as 
> brittle as modifying a state machine.
> 
>>
>> This is part of why I prefer state machines.  Acquiring a mutex is too 
>> easy and it makes it easy to not think through what all could be 
>> running.  When you are more explicit about when you are allowing 
>> concurrency, I think it's easier to be more aggressive.
>>
>> It's a personal preference really.  You can find just as many folks on 
>> the intertubes that claim Threads are Evil as claim State Machines are 
>> Evil.
> 
> The dark side of the force is tempting.
> 
>> The only reason we're discussing this is you've claimed QEMU's state 
>> machine model is the biggest inhibitor and I think that's over 
>> simplifying things.  It's like saying, QEMU's biggest problem is that 
>> too many of it's developers use vi verses emacs.  You may personally 
>> believe that vi is entirely superior to emacs but by the same token, 
>> you should be able to recognize that some people are able to be 
>> productive with emacs.
>>
>> If someone wants to rewrite qcow2 to be threaded, I'm all for it.  I 
>> don't think it's really any simpler than making it a state machine.  I 
>> find it hard to believe you think there's an order of magnitude 
>> difference in development work too.
> 
> Kevin is best positioned to comment on this.

Not sure. Anthony claims there's not much difference in how hard it is
to implement either way. I'm not going to contradict because I haven't
thought a lot about the complicating details this would involve, but in
any case I don't think a state machine would be easier to implement than
a threaded model either.

However, even if we assume that it's the same amount of work, the result
of that work is a different one. The state machine model would tend to
clutter up the code even more than it already is, impacting readability.

>>>> Refcount table.  See above discussion  for my thoughts on refcount 
>>>> table.
>>>
>>> Ok.  It boils down to "is fsck on startup acceptable".  Without a 
>>> freelist, you need fsck for both unclean shutdown and for UNMAP.
>>
>> To rebuild the free list on unclean shutdown.
> 
> If you have an on-disk compact freelist, you don't need that fsck.  If 
> your freelist is the L2 table, then you need that fsck to find out if 
> you have any holes in your image.
> 
> On the other hand, allocating a cluster in qcow2 as it is now requires 
> scanning the refcount table.  Not very pretty.  Kevin, how does that 
> perform?

Probably not very well in the worst case (which is: free a cluster at
the start of a fully allocated image, allocate two clusters). We
probably get away with it today because typical image sizes have only a
couple of refcount blocks and because we don't free clusters a lot anyway.

If QED can manage a free list in memory, though, there's no reason why
the same shouldn't work qcow2.

>>>> Yes, you'll want to have that regardless.  But adding new things to 
>>>> qcow2 has all the problems of introducing a new image format.
>>>
>>> Just some of them.  On mount, rewrite the image format as qcow3.  On 
>>> clean shutdown, write it back to qcow2.  So now there's no risk of 
>>> data corruption (but there is reduced usability).
>>
>> It means on unclean shutdown, you can't move images to older 
>> versions.  That means a management tool can't rely on the mobility of 
>> images which means it's a new format for all practical purposes.
>>
>> QED started it's life as qcow3.  You start with qcow3, remove the 
>> features that are poorly thought out and make correctness hard, add 
>> some future proofing, and you're left with QED.
>>
>> We're fully backwards compatible with qcow2 (by virtue that qcow2 is 
>> still in tree) but new images require new versions of QEMU.  That 
>> said, we have a conversion tool to convert new images to the old 
>> format if mobility is truly required.
>>
>> So it's the same story that you're telling above from an end-user 
>> perspective.
> 
> It's not exactly the same story (you can enable it selectively, or you 
> can run fsck before moving) but I agree it isn't a good thing.

The real reason why it's not the same story is that a qcow3 would be
backwards compatible. Old images would just work as qcow3 by changing
the version number in the header. Even if they are on a block device.
Even if they are encrypted. Even if they are compressed. Even if they
have internal snapshots.

We would have just one driver instead of two, and we could tell people
the upgrade their images and they didn't have to care about feature loss.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]