qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v1] migration: skip sending ram pages released b


From: Li, Liang Z
Subject: Re: [Qemu-devel] [PATCH v1] migration: skip sending ram pages released by virtio-balloon driver.
Date: Fri, 11 Mar 2016 10:54:56 +0000

> >>> I wonder if it is the scanning for zeros or sending the whiteout
> >>> which affects the total migration time more.  If it is the former
> >>> (as I would
> >>> expect) then a rather local change to is_zero_range() to make use of
> >>> the mapping information before scanning would get you all the
> >>> speedups without protocol changes, interfering with postcopy etc.
> >>>
> >>> Roman.
> >>>
> >>
> >> Localizing the solution to zero page scan check is a good idea. I too
> >> agree that most of the time is send in scanning for zero page in
> >> which case we should be able to localize solution to is_zero_range().
> >> However in case of ballooned out pages (which can be seen as a subset
> >> of guest zero pages) we also spend a very small portion of total
> >> migration time in sending the control information, which can be also
> avoided.
> >>   From my tests for 16GB idle guest of which 12GB was ballooned out,
> >> the zero page scan time for 12GB ballooned out pages was ~1789 ms and
> >> save_page_header + qemu_put_byte(f, 0); for same 12GB ballooned out
> >> pages was ~556 ms. Total migration time was ~8000 ms
> >
> > How did you do the tests? ~ 556ms seems too long for putting several
> bytes to the buffer.
> > It's likely the time you measured contains the portion to processes the
> other 4GB guest memory pages.
> >
> > Liang
> >
> 
> I modified save_zero_page() as below and updated timers only for ballooned
> out pages so is_zero_page() should return true(also
> qemu_balloon_bitmap_test() from my patchset returned 1) With below
> instrumentation, I got t1 = ~1789ms and t2 = ~556ms. Also the total migration
> time noted (~8000ms) is for unmodified qemu source.

You mean the total live migration time for the unmodified qemu and the 'you 
modified for test' qemu
are almost the same?

> It seems to addup to final migration time with proposed patchset.
> 
> Here is the last entry for "another round" of test, this time its ~547ms
> JK: block=7f5417a345e0, offset=3ffe42020, zero_page_scan_time=1218 us,
> save_page_header_time=184 us, total_save_zero_page_time=1453 us
> cumulated vals: zero_page_scan_time=1723920378 us,
> save_page_header_time=547514618 us,
> total_save_zero_page_time=2371059239 us
> 
> static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
>                            uint8_t *p, uint64_t *bytes_transferred) {
>      int pages = -1;
>      int64_t time1, time2, time3, time4;
>      static int64_t t1 = 0, t2 = 0, t3 = 0;
> 
>      time1 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>      if (is_zero_range(p, TARGET_PAGE_SIZE)) {
>          time2 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>          acct_info.dup_pages++;
>          *bytes_transferred += save_page_header(f, block,
>                                                 offset | 
> RAM_SAVE_FLAG_COMPRESS);
>          qemu_put_byte(f, 0);
>          time3 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>          *bytes_transferred += 1;
>          pages = 1;
>      }
>      time4 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> 
>      if (qemu_balloon_bitmap_test(block, offset) == 1) {
>          t1 += (time2-time1);
>          t2 += (time3-time2);
>          t3 += (time4-time1);
>          fprintf(stderr, "block=%lx, offset=%lx, zero_page_scan_time=%ld us,
> save_page_header_time=%ld us, total_save_zero_page_time=%ld us\n"
>                          "cumulated vals: zero_page_scan_time=%ld us,
> save_page_header_time=%ld us, total_save_zero_page_time=%ld us\n",
>                           (unsigned long)block, (unsigned long)offset,
>                           (time2-time1), (time3-time2), (time4-time1), t1, 
> t2, t3);
>      }
>      return pages;
> }
> 

Thanks for your  description.
The issue here is that there are too many qemu_clock_get_ns() call,  the cost 
of the function
itself may become the main time consuming operation.  You can measure the time 
consumed 
by  the qemu_clock_get_ns() you added for test by comparing the result with the 
version
which not add the qemu_clock_get_ns().

Liang




reply via email to

[Prev in Thread] Current Thread [Next in Thread]