qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [Qemu-devel] [PATCH] migration: flush the bdrv before s


From: Kevin Wolf
Subject: Re: [Qemu-block] [Qemu-devel] [PATCH] migration: flush the bdrv before stopping VM
Date: Wed, 18 Mar 2015 12:17:09 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

[ Cc: qemu-block ]

Am 18.03.2015 um 04:19 hat Li, Liang Z geschrieben:
> > This needs further review/changes on the block layer.
> > 
> > First explanation, why I think this don't fix the full problem.
> > Whith this patch, we fix the problem where we have a dirty block layer but
> > basically nothing dirtying the memory on the guest (we are moving the 20
> > seconds from max_downtime for the blocklayer flush), to 20 seconds until
> > we have decided that the amount of dirty memory is small enough to be
> > transferred during max_downtime.  But it is still going to take 20 seconds 
> > to
> > flush the block layer, and during that 20 seconds, the amount of memory that
> > can be dirty is HUGE.
> 
> It's true.

What kind of cache is it actually that takes 20s to flush here?

Inside of qemu, we don't cache a whole lot. If you use qcow2, we do use
a writeback cache for some metadata that might need to be written back,
but it is small and shouldn't take any significant time.

Then we have the kernel page cache, or for some network protocols caches
in the respective libs. This sounds like the right size for a 20s stall,
but don't we require cache.direct=on for live migration anyway for
coherency, i.e. bypassing any such cache?

Lastly there may be a disk cache, but it's too small either.

> > I think our ouptions are:
> > 
> > - tell the block layer at the beggining of migration
> >   Hey, we are migrating, could you please start flusing data now, and
> >   don't get the caches to grow too much, please, pretty please.
> >   (I left the API to the block layer)
> > - Add on that point a new function:
> >   bdrvr_flush_all_start()
> >   That starts the sending of pages, and we "hope" that by the time that
> >   we have migrated all memory, they have also finished (so our last
> >   call to block_flush_all() have less work to do)
> > - Add another function:
> >   int bdrv_flush_all_timeout(int timeout)
> >   that returns if timeout pass, telling if it has migrated all pages or
> >   timeout has passed.  So we can got back to the iterative stage if it
> >   has taken too long.

The problem is that the block layer really doesn't have an option to
control what is getting synced once the data is cached outside of qemu.
Essentially we can do an fdatasync() or we can leave it, that's the only
choice we have.

Now doing an asynchronous fdatasync() in the background is completely
reasonable in order to reduce the amount of data to be flushed later.
But the patch is doing it while holding both the BQL and the AIOContext
lock of the device, which doesn't feel right. Maybe it should schedule a
BH in the AIOContext instead and flush from there asynchronously.


The other thing is that flushing once doesn't mean that new data isn't
accumulating in the cache, especially if you decide to do the background
flush at the start of the migration.

The obvious way to avoid that would be to switch to a writethrough mode,
so any write request goes directly to the disk. This will, however,
impact performance so heavily that it's probably not a realistic option.

An alternative approach could be to repeat the background flush
periodically, either time based or after every x bytes that are written
to a device. Time based should actually be quite easy to implement.


Once we have the flushing in the background, the migration code can
apply any timeouts it wants. If the timeout is exceeded, the flush
wouldn't be aborted but keep running in the background, but migration
can go back to the iterative state anyway.

> > Notice that *normally* bdrv_flush_all() is very fast, the problem is that
> > sometimes it get really, really slow (NFS decided to go slow, TCP drop a
> > packet, whatever).
> > 
> > Right now, we don't have an interface to detect that cases and got back to
> > the iterative stage.
> 
> How about go back to the iterative stage when detect that the pending_size is 
> larger
> Than max_size, like this:
> 
> +                /* do flush here is aimed to shorten the VM downtime,
> +                 * bdrv_flush_all is a time consuming operation
> +                 * when the guest has done some file writing */
> +                bdrv_flush_all();
> +                pending_size = qemu_savevm_state_pending(s->file, max_size);
> +                if (pending_size && pending_size >= max_size) {
> +                    qemu_mutex_unlock_iothread();
> +                    continue;
> +                }
>                   ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
>                   if (ret >= 0) {
>                       qemu_file_set_rate_limit(s->file, INT64_MAX);
> 
> and this is quite simple.

Yes, but it is too simple. If you hold all the locks during
bdrv_flush_all(), your VM will effectively stop as soon as it performs
the next I/O access, so you don't win much. And you still don't have a
timeout for cases where the flush takes really long.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]