Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping fre

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping fre

From:	Michael S. Tsirkin
Subject:	Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages
Date:	Wed, 23 Mar 2016 16:08:04 +0200

On Wed, Mar 23, 2016 at 06:05:27AM +0000, Li, Liang Z wrote:
> > > To make things easier, I wrote this doc about the possible designs and
> > > my choices. Comments are welcome!
> > 
> > Thanks for putting this together, and especially for taking the trouble to
> > benchmark existing code paths!
> > 
> > I think these numbers do show that there are gains to be had from merging
> > your code with the existing balloon device. It will probably be a bit more 
> > work,
> > but I think it'll be worth it.
> > 
> > More comments below.
> > 
> 
> Thanks for your comments!
> 
> > > 2. Why not use virtio-balloon
> > > Actually, the virtio-balloon can do the similar thing by inflating the
> > > balloon before live migration, but its performance is no good, for an
> > > 8GB idle guest just boots, it takes about 5.7 Sec to inflate the
> > > balloon to 7GB, but it only takes 25ms to get a valid free page bitmap
> > > from the guest.  There are some of reasons for the bad performance of
> > > vitio-balloon:
> > > a. allocating pages (5%, 304ms)
> > 
> > Interesting. This is definitely worth improving in guest kernel.
> > Also, will it be faster if we allocate and pass to guest huge pages instead?
> > Might speed up madvise as well.
> 
> Maybe.
> 
> > > b. sending PFNs to host (71%, 4194ms)
> > 
> > OK, so we probably should teach balloon to pass huge lists in bitmaps.
> > Will be benefitial for regular balloon operation, as well.
> > 
> 
> Agree. Current balloon just send 256 PFNs a time, that's too few and lead to 
> too many times 
> of virtio transmission, that's the main reason for the bad performance.
> Change the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a large value can  improve the
> performance significant. Maybe we should increase it before doing the further 
> optimization,
> do you think so ?

We could push it up a bit higher: 256 is 1kbyte in size,
so we can make it 3x bigger and still fit struct virtio_balloon
is a single page. But if we are going to add the bitmap variant
anyway, we probably shouldn't bother.

> > > c. address translation and madvise() operation (24%, 1423ms)
> > 
> > How is this split between translation and madvise?  I suspect it's mostly
> > madvise since you need translation when using bitmap as well.
> > Correct? Could you measure this please?  Also, what if we use the new
> > MADV_FREE instead?  By how much would this help?
> > 
> For the current balloon, address translation is needed. 
> But for live migration, there is no need to do address translation.

Well you need ram address in order to clear the dirty bit.
How would you get it without translation?

> 
> I did a another try and got the following data:
>    a. allocating pages (6.4%, 402ms)
>    b. sending PFNs to host (68.3%, 4263ms)
>    c. address translation (6.2%, 389ms)
>    d. madvise (19.0%, 1188ms)
> 
> The address translation is a time consuming operation too.
> I will try MADV_FREE later.


Thanks!

> > Finally, we could teach balloon to skip madvise completely.
> > By how much would this help?
> > 
> > > Debugging shows the time spends on these operations are listed in the
> > > brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to
> > a
> > > large value, such as 16384, the time spends on sending the PFNs can be
> > > reduced to about 400ms, but it’s still too long.
> > > Obviously, the virtio-balloon mechanism has a bigger performance
> > > impact to the guest than the way we are trying to implement.
> > 
> > Since as we see some of the new interfaces might be benefitial to balloon as
> > well, I am rather of the opinion that extending the balloon (basically 3a)
> > might be the right thing to do.
> > 
> > > 3. Virtio interface
> > > There are three different ways of using the virtio interface to send
> > > the free page information.
> > > a. Extend the current virtio device
> > > The virtio spec has already defined some virtio devices, and we can
> > > extend one of these devices so as to use it to transport the free page
> > > information. It requires modifying the virtio spec.
> > 
> > You don't have to do it all by yourself by the way.
> > Submit the proposal to the oasis virtio tc mailing list, we will take it 
> > from there.
> > 
> That's great.
> 
> >> 4. Construct free page bitmap
> >> To minimize the space for saving free page information, it’s better to 
> >> use a bitmap to describe the free pages. There are two ways to 
> >> construct the free page bitmap.
> >> 
> >> a. Construct free page bitmap when demand (My choice) Guest can 
> >> allocate memory for the free page bitmap only when it receives the 
> >> request from QEMU, and set the free page bitmap by traversing the free 
> >> page list. The advantage of this way is that it’s quite simple and 
> >> easy to implement. The disadvantage is that the traversing operation 
> >> may consume quite a long time when there are a lot of free pages. 
> >> (About 20ms for 7GB free pages)
> >> 
> >> b. Update free page bitmap when allocating/freeing pages Another 
> >> choice is to allocate the memory for the free page bitmap when guest 
> >>boots, and then update the free page bitmap when allocating/freeing 
> >> pages. It needs more modification to the code related to memory 
> >>management in guest. The advantage of this way is that guest can 
> >> response QEMU’s request for a free page bitmap very quickly, no matter 
> >> how many free pages in the guest. Do the kernel guys like this?
> >>
> 
> > > 8. Pseudo code
> > > Dirty page logging should be enabled before getting the free page
> > > information from guest, this is important because during the process
> > > of getting free pages, some free pages may be used and written by the
> > > guest, dirty page logging can trace these pages. The pseudo code is
> > > like below:
> > >
> > >     -----------------------------------------------
> > >     MigrationState *s = migrate_get_current();
> > >     ...
> > >
> > >     memory_global_dirty_log_start();
> > >
> > >     if (get_guest_mem_info(&info)) {
> > >         while (!get_free_page_bmap(free_page_bitmap,  drop_page_cache)
> > &&
> > >                s->state != MIGRATION_STATUS_CANCELLING) {
> > >             usleep(1000) // sleep for 1 ms
> > >         }
> > >
> > >         tighten_free_page_bmap =
> > tighten_guest_free_pages(free_page_bitmap);
> > >         filter_out_guest_free_pages(tighten_free_page_bmap);
> > >     }
> > >
> > >     migration_bitmap_sync();
> > >     ...
> > >
> > >     -----------------------------------------------
> > 
> > 
> > I don't completely agree with this part.  In my opinion, it should be
> > asynchronous, depending on getting page lists from guest:
> > 
> > anywhere/periodically:
> >     ...
> >     request_guest_mem_info
> >     ...
> > 
> 
> Periodically? That means filtering out guest free pages not only
> in the ram bulk stage, but during the whole process of live migration. right? 
>  
> If so, it's better to use 4b to construct the free page bitmap.

That's up to guest. I would say focus on 4a first, once it works,
experiment with 4b and see what the speedup is.

> > later:
> > 
> > 
> >     handle_guest_mem_info()
> >     {
> >             address_space_sync_dirty_bitmap
> >             filter_out_guest_free_pages
> >     }
> > 
> > as long as we filter with VCPU stopped like this, we can drop the sync dirty
> > stage, or alternatively we could move filter_out_guest_free_pages into bh
> > so it happens later while VCPU is running.
> > 
> > This removes any need for waiting.
> > 
> > 
> > Introducing delay into migration might still be benefitial but this way it 
> > is
> > optional, we still get part of the benefit even if we don't wait long 
> > enough.
> > 
> 
> Yes, I agree asynchronous mode is better and I will change it.
>  From the perspective of saving resources(CPU and network bandwidth), waiting 
> is not so bad. :)
> 
> Liang

Sure, all I am saying is don't tie the logic to waiting enough.

> > 
> > >
> > > --
> > > 1.9.1

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages, Liang Li, 2016/03/22
- Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages, Michael S. Tsirkin, 2016/03/22
  - Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages, Li, Liang Z, 2016/03/23
    - Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages, Michael S. Tsirkin <=
    - Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages, Li, Liang Z, 2016/03/23
    - Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages, Michael S. Tsirkin, 2016/03/24
    - Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages, Li, Liang Z, 2016/03/24
    - Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages, Michael S. Tsirkin, 2016/03/24
    - Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages, Li, Liang Z, 2016/03/24
    - Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages, Michael S. Tsirkin, 2016/03/24
    - Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages, Li, Liang Z, 2016/03/24
    - Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages, Paolo Bonzini, 2016/03/24
    - Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages, Li, Liang Z, 2016/03/24
    - Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages, Michael S. Tsirkin, 2016/03/24

Prev by Date: Re: [Qemu-devel] [PATCH 0/5] trace: Add events for vCPU memory accesses
Next by Date: Re: [Qemu-devel] [PATCH] hw/display contains files named *_template.h. These are included many times with different values of the DEPTH macro. However, only the DEPTH == 32 case is used. Removed support for DEPTH != 32 in the template headers and in the file that include them.
Previous by thread: Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages
Next by thread: Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages
Index(es):
- Date
- Thread