qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH COLO-Frame (Base) v21 00/17] COarse-grain LOck-s


From: Li Zhijian
Subject: Re: [Qemu-devel] [PATCH COLO-Frame (Base) v21 00/17] COarse-grain LOck-stepping(COLO) Virtual Machines for Non-stop Service (FT)
Date: Wed, 26 Oct 2016 18:14:52 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0



On 10/26/2016 04:26 PM, Amit Shah wrote:
On (Wed) 26 Oct 2016 [14:43:30], Hailiang Zhang wrote:
Hi Amit,

On 2016/10/26 14:09, Amit Shah wrote:
Hello,

On (Tue) 18 Oct 2016 [20:09:56], zhanghailiang wrote:
This is the 21th version of COLO frame series.

Rebase to the latest master.

I've reviewed the patchset, have some minor comments, but overall it
looks good.  The changes are contained, and common code / existing
code paths are not affected much.  We can still target to merge this
for 2.8.


I really appreciate your help ;), I will fix all the issues later
and send v22. Hope we can still catch the deadline of V2.8.

Do you have any tests on how much the VM slows down / downtime
incurred during checkpoints?


Yes, we tested that long time ago, it all depends.
The downtime is determined by the time of transferring the dirty pages
and the time of flushing ram from ram buffer.
But we really have methods to reduce the downtime.

One method is to reduce the amount of data (dirty pages mainly) while do 
checkpoint
by transferring dirty pages asynchronously while PVM and SVM are running (no in
the time of doing checkpoint). Besides we can re-use the capability of 
migration, such
as compressing, etc.
Another method is to reduce the time of flushing ram by using userfaultfd API
to convert copying ram into marking bitmap. We can also flushing the ram buffer
by multiple threads which advised by Dave ...

Yes, I understand that as with any migration numbers, this too depends
on what the guest is doing.  However, can you just pick some standard
workload - kernel compile or something like that - and post a few
observations?

Sure, we have collected some performance data from previous COLO few month ago.
Networking configuration:
host(Primary and Secondary): 10000Mb/S NIC with checkpoint
client: 1000Mb/s NIC connect to host

-------------------------+----------+-----------+--------------+-------------------+---------------+
benchmark                | guest    | case      | native       |   COLO         
   | performance
-------------------------+----------+-----------+--------------+-------------------+---------------+
webbench (bytes/sec)     | 2vCPU 2G | 50 clients|   105358952  | 
99396093.3333333  |   94.34%
-------------------------+----------+-----------+--------------+-------------------+---------------+
ftp put(byte/S upload)   | 2vCPU 2G |1GB file   |   77079.59   |     
61310.20333   |   79.54%

-------------------------+----------+-----------+--------------+-------------------+---------------+
ftp get(byte/S download) | 2vCPU 2G |2GB file   | 74222.26333  |  65799.19667   
   |88.65%

-------------------------+----------+-----------+--------------+-------------------+---------------+
pgbench (trans/S)        | 2vCPU 2G |1000 clients   | 189      |  100           
   | 53%           |
                                     100 trnasations|          |                
   |
-------------------------+----------+-----------+--------------+-------------------+---------------+
netperf                  | 2vCPU 2G | TCP_RR    |  3413.413333 |2078.093333     
   |60.88%

(Mbit/S)                 +          
+-----------+--------------+-------------------+---------------+
                         |          | TCP_STREAM| 941.3233333  |860.27          
   |91.39%

                         +          
+-----------+--------------+-------------------+---------------+
-------------------------+----------+-----------+--------------+-------------------+---------------+
kernel build (Second)    | 8vCPU 8G | make -j8  | 2m16.172     | 2m38.883       
   | 86%
-------------------------+----------+-----------+--------------+-------------------+---------------+

Further more, with Ping test, we got 1ms latency than the native.

Note:
- pgbench will generate random value net package which will trigger a new 
checkpiont frequently
- netperf with TCP_RR, client will get *twice* latency with a Request/Response

--
Best regards.
Li Zhijian



Also, can you tell how did you arrive at the default checkpoint
interval?


Er, for this value, we referred to Remus in XEN platform. ;)
But after we implement COLO with colo proxy, this interval value will be changed
to a bigger one (10s). And we will make it configuration too. Besides, we will
add another configurable value to control the min interval of checkpointing.

OK - any typical value that is a good mix between COLO keeping the
network too busy / guest paused vs guest making progress?  Again this
is something that's workload-dependent, but I guess you have typical
numbers from a network-bound workload?

Thanks,

                Amit


.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]