[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] Live migration without bdrv_drain_all()
From: |
Juan Quintela |
Subject: |
Re: [Qemu-devel] Live migration without bdrv_drain_all() |
Date: |
Wed, 28 Sep 2016 11:03:15 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) |
"Dr. David Alan Gilbert" <address@hidden> wrote:
> * Stefan Hajnoczi (address@hidden) wrote:
>> On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote:
>> > Heya!
>> >
>> > > On 29 Aug 2016, at 08:06, Stefan Hajnoczi <address@hidden> wrote:
>> > >
>> > > At KVM Forum an interesting idea was proposed to avoid
>> > > bdrv_drain_all() during live migration. Mike Cui and Felipe Franciosi
>> > > mentioned running at queue depth 1. It needs more thought to make it
>> > > workable but I want to capture it here for discussion and to archive
>> > > it.
>> > >
>> > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O
>> > > requests hang. We should find a better way of quiescing I/O that is
>> > > not synchronous. Up until now I thought we should simply add a
>> > > timeout to bdrv_drain_all() so it can at least fail (and live
>> > > migration would fail) if I/O is stuck instead of hanging the VM. But
>> > > the following approach is also interesting...
>> > >
>> > > During the iteration phase of live migration we could limit the queue
>> > > depth so points with no I/O requests in-flight are identified. At
>> > > these points the migration algorithm has the opportunity to move to
>> > > the next phase without requiring bdrv_drain_all() since no requests
>> > > are pending.
>> >
>> > I actually think that this "io quiesced state" is highly unlikely
>> > to _just_ happen on a busy guest. The main idea behind running at
>> > QD1 is to naturally throttle the guest and make it easier to
>> > "force quiesce" the VQs.
>> >
>> > In other words, if the guest is busy and we run at QD1, I would
>> > expect the rings to be quite full of pending (ie. unprocessed)
>> > requests. At the same time, I would expect that a call to
>> > bdrv_drain_all() (as part of do_vm_stop()) should complete much
>> > quicker.
>> >
>> > Nevertheless, you mentioned that this is still problematic as that
>> > single outstanding IO could block, leaving the VM paused for
>> > longer.
>> >
>> > My suggestion is therefore that we leave the vCPUs running, but
>> > stop picking up requests from the VQs. Provided nothing blocks,
>> > you should reach the "io quiesced state" fairly quickly. If you
>> > don't, then the VM is at least still running (despite seeing no
>> > progress on its VQs).
>> >
>> > Thoughts on that?
>>
>> If the guest experiences a hung disk it may enter error recovery. QEMU
>> should avoid this so the guest doesn't remount file systems read-only.
>>
>> This can be solved by only quiescing the disk for, say, 30 seconds at a
>> time. If we don't reach a point where live migration can proceed during
>> those 30 seconds then the disk will service requests again temporarily
>> to avoid upsetting the guest.
>>
>> I wonder if Juan or David have any thoughts from the live migration
>> perspective?
>
> Throttling IO to reduce the time in the final drain makes sense
> to me, however:
> a) It doesn't solve the problem if the IO device dies at just the wrong
> time,
> so you can still get that hang in bdrv_drain_all
>
> b) Completely stopping guest IO sounds too drastic to me unless you can
> time it to be just at the point before the end of migration; that feels
> tricky to get right unless you can somehow tie it to an estimate of
> remaining dirty RAM (that never works that well).
>
> c) Something like a 30 second pause still feels too long; if that was
> a big hairy database workload it would effectively be 30 seconds
> of downtime.
>
> Dave
I think something like the proposed thing could work.
We can put queue depth = 1 or somesuch when we know we are near
completion for migration. What we need them is a way to call the
equivalent of:
bdrv_drain_all() to return EAGAIN or EBUSY if it is a bad moment. In
that case, we just do another round over the whole memory, or retry in X
seconds. Anything is good for us, we just need a way to ask for the
operation but that it don't block.
Notice that migration is the equivalent of:
while (true) {
write_some_dirty_pages();
if (dirty_pages < threshold) {
break;
}
}
bdrv_drain_all();
write_rest_of_dirty_pages();
(Lots and lots of details ommited)
What we really want is to issue the call of bdrv_drain_all() equivalent
inside the while, so, if there is any problem, we just do another cycle,
no problem.
Later, Juan.