qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v1] migration: skip sending ram pages released b


From: Jitendra Kolhe
Subject: Re: [Qemu-devel] [PATCH v1] migration: skip sending ram pages released by virtio-balloon driver.
Date: Fri, 11 Mar 2016 20:09:09 +0530
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0

On 3/11/2016 4:24 PM, Li, Liang Z wrote:
I wonder if it is the scanning for zeros or sending the whiteout
which affects the total migration time more.  If it is the former
(as I would
expect) then a rather local change to is_zero_range() to make use of
the mapping information before scanning would get you all the
speedups without protocol changes, interfering with postcopy etc.

Roman.


Localizing the solution to zero page scan check is a good idea. I too
agree that most of the time is send in scanning for zero page in
which case we should be able to localize solution to is_zero_range().
However in case of ballooned out pages (which can be seen as a subset
of guest zero pages) we also spend a very small portion of total
migration time in sending the control information, which can be also
avoided.
   From my tests for 16GB idle guest of which 12GB was ballooned out,
the zero page scan time for 12GB ballooned out pages was ~1789 ms and
save_page_header + qemu_put_byte(f, 0); for same 12GB ballooned out
pages was ~556 ms. Total migration time was ~8000 ms

How did you do the tests? ~ 556ms seems too long for putting several
bytes to the buffer.
It's likely the time you measured contains the portion to processes the
other 4GB guest memory pages.

Liang


I modified save_zero_page() as below and updated timers only for ballooned
out pages so is_zero_page() should return true(also
qemu_balloon_bitmap_test() from my patchset returned 1) With below
instrumentation, I got t1 = ~1789ms and t2 = ~556ms. Also the total migration
time noted (~8000ms) is for unmodified qemu source.

You mean the total live migration time for the unmodified qemu and the 'you 
modified for test' qemu
are almost the same?


Not sure I understand the question, but if 'you modified for test' means below modifications to save_zero_page(), then answer is no. Here is what I tried, let’s say we have 3 versions of qemu (below timings are for 16GB idle guest with 12GB ballooned out)

v1. Unmodified qemu – absolutely not code change – Total Migration time = ~7600ms (I rounded this one to ~8000ms) v2. Modified qemu 1 – with proposed patch set (which skips both zero pages scan and migrating control information for ballooned out pages) - Total Migration time = ~5700ms v3. Modified qemu 2 – only with changes to save_zero_page() as discussed in previous mail (and of course using proposed patch set only to maintain bitmap for ballooned out pages) – Total migration time is irrelevant in this case.
Total Zero page scan time = ~1789ms
Total (save_page_header + qemu_put_byte(f, 0)) = ~556ms.
Everything seems to add up here (may not be exact) – 5700+1789+559 = ~8000ms

I see 2 factors that we have not considered in this add up a. overhead for migrating balloon bitmap to target and b. as you mentioned below overhead of qemu_clock_get_ns().

It seems to addup to final migration time with proposed patchset.

Here is the last entry for "another round" of test, this time its ~547ms
JK: block=7f5417a345e0, offset=3ffe42020, zero_page_scan_time=1218 us,
save_page_header_time=184 us, total_save_zero_page_time=1453 us
cumulated vals: zero_page_scan_time=1723920378 us,
save_page_header_time=547514618 us,
total_save_zero_page_time=2371059239 us

static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
                            uint8_t *p, uint64_t *bytes_transferred) {
      int pages = -1;
      int64_t time1, time2, time3, time4;
      static int64_t t1 = 0, t2 = 0, t3 = 0;

      time1 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
      if (is_zero_range(p, TARGET_PAGE_SIZE)) {
          time2 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
          acct_info.dup_pages++;
          *bytes_transferred += save_page_header(f, block,
                                                 offset | 
RAM_SAVE_FLAG_COMPRESS);
          qemu_put_byte(f, 0);
          time3 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
          *bytes_transferred += 1;
          pages = 1;
      }
      time4 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);

      if (qemu_balloon_bitmap_test(block, offset) == 1) {
          t1 += (time2-time1);
          t2 += (time3-time2);
          t3 += (time4-time1);
          fprintf(stderr, "block=%lx, offset=%lx, zero_page_scan_time=%ld us,
save_page_header_time=%ld us, total_save_zero_page_time=%ld us\n"
                          "cumulated vals: zero_page_scan_time=%ld us,
save_page_header_time=%ld us, total_save_zero_page_time=%ld us\n",
                           (unsigned long)block, (unsigned long)offset,
                           (time2-time1), (time3-time2), (time4-time1), t1, t2, 
t3);
      }
      return pages;
}


Thanks for your  description.
The issue here is that there are too many qemu_clock_get_ns() call,  the cost 
of the function
itself may become the main time consuming operation.  You can measure the time 
consumed
by  the qemu_clock_get_ns() you added for test by comparing the result with the 
version
which not add the qemu_clock_get_ns().

Liang


Yes, we can try to measure overhead for qemu_clock_get_ns() calls and see if things add up perfectly.

Thanks,
- Jitendra



reply via email to

[Prev in Thread] Current Thread [Next in Thread]