qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] CoW image commit+shrink(= make_empty) support


From: Kevin Wolf
Subject: Re: [Qemu-devel] CoW image commit+shrink(= make_empty) support
Date: Mon, 11 Jun 2012 09:57:44 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1

Am 09.06.2012 18:52, schrieb Jeff Cody:
> On 06/08/2012 05:08 PM, Kevin Wolf wrote:
>> Am 08.06.2012 20:33, schrieb Jeff Cody:
>>> On 06/08/2012 01:57 PM, Kevin Wolf wrote:
>>>> The question is what the speed parameter really means for a live commit
>>>> block job (or for an active mirror). I think it would make sense to
>>>> limit only the speed of I/O that doesn't come from the guest but is from
>>>> the background copying.
>>>>
>>>> If you did apply it to guest I/O as well, then I think the logical
>>>> consequence would be that guest I/O is throttled as well because
>>>> requests only complete when they have reached both source and target.
>>>>
>>>> (I know I'm happily mixing terminology of all kinds of live block jobs
>>>> here, hope you understand anyway what I mean ;-))
>>>
>>> I understand what you mean :) My thought is that the block commit
>>> 'speed' parameter would essentially only apply to 'old data', that is
>>> being copied in the commit coroutine, and it would be ignored (violated)
>>> for 'new data' in the mirroring-like case.
>>>
>>> I assumed we wouldn't want to throttle the guest I/O, but that may be a
>>> bad assumption on my part.
>>
>> I don't disagree, I just wanted to point out that throttling the guest
>> would be another option. Applying the limit only to old data probably
>> makes most sense indeed.
>>
> 
> Given that, perhaps it does make sense to get it implemented in two
> phases - first the straightforward block commit for images below the
> active layer, and then a follow-up patch set to implement committing in
> the top layer alongside an active mirroring approach for convergence.

Yes, perhaps that does make sense. I'm not sure how much of it we'll be
able to keep when we implement the active mirror approach, and possibly
we'll just replace the initial code wholesale then, but doing everything
at once sounds too big to be manageable.

>>>>>>> I am happy to do it that way.  I'll shift my focus to the atomic image
>>>>>>> reopen in r/w mode.  I'll go ahead and post my diagrams and other info
>>>>>>> for block-commit on the wiki, because I don't think it conflicts with we
>>>>>>> discussed above (although I will modify my diagrams to not show commit
>>>>>>> from the top-level image).  Of course, feel free to change it as
>>>>>>> necessary.
>>>>>>
>>>>>> I may have mentioned it before, but just in case: I think Supriya's
>>>>>> bdrv_reopen() patches are a good starting point. I don't know why they
>>>>>> were never completed, but I think we all agreed on the general design,
>>>>>> so it should be possible to pick that up.
>>>>>>
>>>>>> Though if you have already started with your own work on it, Jeff, I
>>>>>> expect that it won't be much different because it's basically the same
>>>>>> transactional approach that you know and that we already used for group
>>>>>> snapshots.
>>>>>
>>>>> I will definitely use parts of Supriya's as it makes sense - what I
>>>>> started work on is similar to bdrv_append() and the current transaction
>>>>> approach, so there will be plenty in common to reuse, even with some
>>>>> differences.
>>>>
>>>> I'm not sure how far a bdrv_append-like approach will work in this case.
>>>> The problem is that you must not open any image r/w twice. The policy is
>>>> that you can open an image either once r/w or multiple times read-only.
>>>
>>> I think there are 3 possibilities that would be OK: open once r/w, one
>>> or more times r/o, or once r/w + one or more times r/o.  Do you agree
>>> with this?  Is the latter case against block-layer policy?
>>
>> Yes, it is. The problem is that image formats have data cached in
>> memory, and as soon as you have a writer, the other users of the same
>> image file - no matter whether r/w or r/o - get out of sync. They might
>> mix old (cached) matadata with other new metadata, and I think you can
>> imagine what mess would result.
>>
>> Strictly speaking even raw caches things like the image size, so if the
>> r/w instance happens to resize the image, the additional r/o readers
>> would have a problem even with raw.
>>
> 
> OK, thanks. I should have qualified that with saying, for this
> application, that it is not intended to have the images open r/w and r/o
> for ongoing operations; it is just for the purpose of overlapping the
> closing of the r/o instance.

It's quite possible that it works under these circumstances. Note
however that after finishing the merge you will also want to turn it
back into an r/o image. The very least you need there is a bdrv_flush(),
and I don't think it's enough in all cases (basically you get a problem
whenever the image format uses something like QED mode where the image
becomes consistent only during a clean shutdown).

Things are becoming really tricky quite quickly, and this is why I
prefer sticking to the strict "one r/w or many r/o" rule, even though
special cases may happen to work in violation of this rule.

>>> In this special case for block commit, we need to do the reopen() to go
>>> from a r/o image to a r/w image.  If an image is opened r/o, the drv
>>> layer can do another open with a new descriptor, this time r/w, without
>>> closing the r/o instance.  If the new r/w open was successful, then the
>>> r/o instance can be closed, and some bdrv_append()-like operations done.
>>> Otherwise, like the current transactional code, we can just 'walk away'
>>> by only closing the new image (if it got that far), and the existing BDS
>>> with the r/o image is safely unchanged (and never closed, so no
>>> unrecoverable error paths).
>>
>> Yes. It's just that you need to be very careful. You would ask the qcow2
>> driver to reopen the image, and it would take care to invalidate any
>> cached data and then pass the request on to the next layer. When you
>> have drilled down through the stack to raw-posix, it would actually open
>> the second file descriptor and wait for a commit/abort that goes through
>> the whole stack again.
> 
> That makes sense, and I agree that it is a bit tricky.  The other
> question is what protocol / drivers will support this sort of operation,
> as well. And I certainly think that using Supriya's patches will prove
> useful.

It might be a bit tricky, but a whole lot less ugly than bdrv_append. ;-)

For drivers that don't implement the operation, I think we can use a
generic .close/.open sequence. Of course, this breaks the transactional
semantics, but it is very unlikely that an error occurs during opening
the image format layer. If we get any error, it's almost for sure in the
lowest layer.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]