|
| From: | Chegu Vinod |
| Subject: | Re: [Qemu-devel] [PATCH 00/41] Migration cleanups and latency improvements |
| Date: | Tue, 19 Feb 2013 09:59:33 -0800 |
| User-agent: | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 |
|
On 2/15/2013 9:46 AM, Paolo Bonzini
wrote:
'am still in the midst of reviewing the changes but gave them a try. The following are my preliminary observations : - The mult-second freezes at the start of migration of larger guests (i.e. 128GB and higher) aren't observable with the above changes. (The simple timer script that does a gettimeofday every 100ms didn't complain about delays etc.). - Noticed improvements in bandwidth utilization during the iterative pre-copy phase and during the "downtime" phase. - The total migration time reduced...more for larger guests (Note: The undesirably large actual "downtime" for larger guests is a different topic that still needs to be addressed independent of these changes). Some details follow below... Thanks Vinod Details: ---------- Host and Guest kernels are running : 3.8-rc5. Comparing upstream (Qemu 1.4.50) vs. Paolo's branch(Qemu 1.3.92 based) i.e. git clone git://github.com/bonzini/qemu.git -b migration-thread-20130115 First set of experiments are with [not-so-interesting] *Idle* guests of different sizes. The second experiment was with an OLTP workload. A) Idle guests: -------------------- (The migration speed was set to 10G and the downtime was set to 2) 1) 5vcpu/32G - *idle* guest QEMU 1.4.50: total time: 31801 milliseconds downtime: 2831 milliseconds Paolo's branch: total time: 29012 milliseconds downtime: 1987 milliseconds -- 2) 10vcpu/64G - *idle* guest QEMU 1.4.50: total time: 62699 milliseconds downtime: 2506 milliseconds Paolo's branch: total time: 59174 milliseconds downtime: 2451 milliseconds -- 3) 10vcpu/128G - *idle* guest QEMU 1.4.50: total time: 123179 milliseconds downtime: 2566 milliseconds address@hidden ~]# ./timer delay of 3083 ms <- freeze (@start of migration) delay of 1916 ms <- freeze (due to downtime) Paolo's branch: total time: 116809 milliseconds downtime: 2703 milliseconds address@hidden ~]# ./timer delay of 2820 ms <- freeze (due to downtime) -- 4) 20vcpu/256G - *idle* guest QEMU 1.4.50: total time: 277775 milliseconds downtime: 3718 milliseconds address@hidden ~]# ./timer delay of 6317 ms <- freeze (@ start of migration) delay of 2952 ms <- freeze (due to downtime) Paolo's branch: total time: 261790 milliseconds downtime: 3809 milliseconds address@hidden ~]# ./timer delay of 3982 ms <- freeze (due to downtime) -- 5) 40vcpu/512G - *idle* guest QEMU 1.4.50: total time: 631654 milliseconds downtime: 7252 milliseconds address@hidden ~]# ./timer delay of 12713 ms <- freeze (@ start of migration) delay of 6099 ms <- freeze (due to downtime) Paolo's branch: total time: 603252 milliseconds downtime: 6452 milliseconds address@hidden ~]# ./timer delay of 6724 ms <- freeze (due to downtime) -- 6) 80vcpu/784G - *idle* guest QEMU 1.4.50: total time: 1003210 milliseconds downtime: 8932 milliseconds address@hidden ~]# ./timer delay of 18941 ms <- freeze (@ start of migration.) delay of 8395 ms <- freeze (due to downtime) delay of 2451 ms <- freeze (on new host...why?) Paolo's branch: total time: 959378 milliseconds downtime: 8416 milliseconds address@hidden ~]# ./timer delay of 8938 ms <- freeze (due to downtime) delay of 935 ms <- freeze (on new host...why?) ------- B) Guest with an OLTP workload : --------------------------------------------- Guest : 80vcpu / 784GB (yes i know that typical guests sizes today aren't this huge...but this is just an experiment keeping in mind that guests are continuing to get fatter) OLTP workload with 100 users doing writes/reads. Using tmpfs...as I don't yet have access to real I/O :-( Host was ~70% busy and the guest was ~60% busy. The migration speed was set to 10G and the downtime was set to 4s. No guest freezes observed but there were significant drops in the TPS at the start of migration etc. Observed about ~30-40% improvement in the bandwidth utilization during the iterative pre-copy phase. The workload did NOT converge even after 30 mins or so...with either upstream qemu or with Paolo's changes (Note: the lack of convergence issue needs to be pursued separately...based on ideas proposed in the past). |
| [Prev in Thread] | Current Thread | [Next in Thread] |