Re: [PATCH RFC 2/2] migration: abort on destination if switchover limit

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH RFC 2/2] migration: abort on destination if switchover limit

From:	Elena Ufimtseva
Subject:	Re: [PATCH RFC 2/2] migration: abort on destination if switchover limit exceeded
Date:	Fri, 26 Jul 2024 00:41:57 -0700
On Wed, Jun 26, 2024 at 02:41:34PM -0400, Peter Xu wrote:
> On Wed, Jun 26, 2024 at 12:04:43PM +0100, Joao Martins wrote:
> > Are you thinking in something specifically?
> 
> Not really. I don't think I have any idea on how to make it better,
> unfortunately, but we did some measurement too quite some time ago and I
> can share some below.


Hello Peter

I apologize for such a long delay with the reply.

> 
> > 
> > Many "variables" affect this from the point we decide switchover, and at the
> > worst (likely) case it means having qemu subsystems declare empirical 
> > values on
> > how long it takes to suspend/resume/transfer-state to migration expected
> > downtime prediction equation. Part of the reason that having headroom within
> > downtime-limit was a simple 'catch-all' (from our PoV) in terms of
> > maintainability while giving user something to fallback for characterizing 
> > its
> > SLA.
> 
> Yes, I think this might be a way to go, by starting with something that can
> catch all.


Possibly the title "strict SLA" is not the best choice of
words as it creates impression that the guarantees will be met.
But essentially this switchover limit is a safeguard against the unknowns
that can contribute to the downtime during the stop-copy and can be not
that easy to account for (or even impossible due to hardware
implementation or other issues).

To show what kind of statistics we see in our environments and what
are the main contributors please see below.

Example 1: host migration, default downtime set to 300:

Checkpoints analysis:

  checkpoint=src-downtime-start -> checkpoint=src-vm-stopped:                   
74244 (us)
  checkpoint=src-vm-stopped -> checkpoint=src-iterable-saved:                   
154493 (us)
  checkpoint=src-iterable-saved -> checkpoint=src-non-iterable-saved:           
4746 (us)
  checkpoint=src-non-iterable-saved -> checkpoint=dst-precopy-loadvm-completed: 
224981 (us)
  checkpoint=dst-precopy-loadvm-completed -> checkpoint=dst-precopy-bh-enter:   
36 (us)
  checkpoint=dst-precopy-bh-enter -> checkpoint=dst-precopy-bh-announced:       
7859 (us)
  checkpoint=dst-precopy-bh-announced -> checkpoint=dst-precopy-bh-vm-started:  
15995 (us)
  checkpoint=dst-precopy-bh-vm-started -> checkpoint=src-downtime-end:          
236 (us)

Iterable device analysis:

  Device SAVE of                                      ram:  0 took     151054 
(us)
  Device LOAD of                                      ram:  0 took     146855 
(us)
  Device SAVE of              0000:20:04.0:00.0:00.0/vfio:  0 took       2127 
(us)
  Device LOAD of              0000:20:04.0:00.0:00.0/vfio:  0 took     144202 
(us)

Non-iterable device analysis:

  Device LOAD of              0000:20:04.0:00.0:00.0/vfio:  0 took      67470 
(us)
  Device LOAD of                         0000:00:01.0/vga:  0 took       7527 
(us)
  Device LOAD of                      0000:00:02.0/e1000e:  0 took       1715 
(us)
  Device LOAD of                              kvm-tpr-opt:  0 took       1697 
(us)
  Device LOAD of                  0000:00:03.0/virtio-blk:  0 took       1340 
(us)
  Device SAVE of                      0000:00:02.0/e1000e:  0 took       1036 
(us)
  Device LOAD of                         0000:00:00.0/mch:  0 took       1035 
(us)
  Device LOAD of         0000:20:04.0:00.0/pcie-root-port:  0 took        976 
(us)
  Device LOAD of                     0000:00:1f.0/ICH9LPC:  0 took        851 
(us)
  Device LOAD of                   0000:00:1f.2/ich9_ahci:  0 took        578 
(us)

(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
clear-bitmap-shift: 18
Migration status: completed
total time: 5927 ms
downtime: 483 ms
setup: 78 ms
transferred ram: 883709 kbytes
throughput: 1237.71 mbps
remaining ram: 0 kbytes
total ram: 33571656 kbytes
duplicate: 8192488 pages
skipped: 0 pages
normal: 201300 pages
normal bytes: 805200 kbytes
dirty sync count: 3
page size: 4 kbytes
multifd bytes: 0 kbytes
pages-per-second: 958776
precopy ram: 480464 kbytes
downtime ram: 398313 kbytes
vfio device transferred: 4496 kbytes

Example 2: different system than above,  live migration over 100Gbit/s 
connection and 2 vfio virtual functions (the guest has no workload and
vfio devices are not engaged in VM and have same amount of data to live
migrate).

Displayed outliers that are larger than 3 us.

Save:
252812@1721976657.700972:vmstate_downtime_checkpoint src-downtime-start
252812@1721976657.829180:vmstate_downtime_checkpoint src-vm-stopped
252812@1721976657.967987:vmstate_downtime_save type=iterable idstr=ram 
instance_id=0 downtime=138005
252812@1721976658.093218:vmstate_downtime_save type=iterable 
idstr=0000:00:02.0/vfio instance_id=0 downtime=125188
252812@1721976658.318101:vmstate_downtime_save type=iterable 
idstr=0000:00:03.0/vfio instance_id=0 downtime=224857
252812@1721976658.318125:vmstate_downtime_checkpoint src-iterable-saved
...

Load:
353062@1721976488.995582:vmstate_downtime_load type=iterable idstr=ram 
instance_id=0 downtime=117294
353062@1721976489.235227:vmstate_downtime_load type=iterable 
idstr=0000:00:02.0/vfio instance_id=0 downtime=239586
353062@1721976489.449736:vmstate_downtime_load type=iterable 
idstr=0000:00:03.0/vfio instance_id=0 downtime=214462
353062@1721976489.463260:vmstate_downtime_load type=non-iterable 
idstr=0000:00:01.0/vga instance_id=0 downtime=7522
353062@1721976489.575383:vmstate_downtime_load type=non-iterable 
idstr=0000:00:02.0/vfio instance_id=0 downtime=112113
353062@1721976489.686961:vmstate_downtime_load type=non-iterable 
idstr=0000:00:03.0/vfio instance_id=0 downtime=111545
...

(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
clear-bitmap-shift: 18
Migration status: completed
total time: 23510 ms
downtime: 1018 ms
setup: 380 ms
transferred ram: 5317587 kbytes
throughput: 1883.34 mbps
remaining ram: 0 kbytes
total ram: 209732424 kbytes
duplicate: 51628634 pages
skipped: 0 pages
normal: 1159697 pages
normal bytes: 4638788 kbytes
dirty sync count: 4
page size: 4 kbytes
multifd bytes: 4653988 kbytes
pages-per-second: 1150272
precopy ram: 453652 kbytes
downtime ram: 118 kbytes
vfio device transferred: 209431 kbytes

As it can be seen from above, the downtime gets violated and the main
contributors are vfio devices. Also it can very depending on the
firmare version.
I have to note that in one of 10 tests, the ram downtime is being much
larger and becomes then the outlier. This is being investigated
currently.

This switchover overshoot is usually being reported as timed out queries.
And to comment on precision, yes, this overshoot safeguard is not
precise.
In fact, the current implementation is not precise and has a granularity of the 
savevm_
handlers as it would only check the downtime overshoot after it is being
completed. Maybe this part can be improved by delegating to some of the known
abusers to observe the downtime overshoot on its own.

> 
> > Personally, I think there's a tiny bit disconnect between what the user
> > desires when setting downtime-limit vs what it really does. downtime-limit 
> > right
> > now looks to be best viewed as 'precopy-ram-downtime-limit' :)
> 
> That's fair to say indeed.. QEMU can try to do better on this, it's just
> not yet straightforward to know how.

Could be that the better known part of the downtime (and predictable
taken the bandwidth is accurate), i.e. currently what is downtime-limit, serve
as a starting point and be named as ram-downtime-limit and everything
else would have switchover allowance of downtime_limit - ram-downtime-limit?
The correct value of ram-downtime-limit would take few iterations
after dirty sync to get established.

I think that is somewhat similar to what Joao is thinking below in 1) and 2)?

Thanks!

Elena
> 
> > Unless the accuracy work you're thinking is just having a better
> > migration algorithm at obtaining the best possible downtime for
> > outstanding-data/RAM *even if* downtime-limit is set at a high limit,
> > like giving 1) a grace period in the beginning of migration post first
> > dirty sync
> 
> Can you elaborate on this one a bit?

> 
> > or 2) a measured value with continually incrementing target downtime
> > limit until max downtime-limit set by user hits ... before defaulting to
> > the current behaviour of migrating as soon as expected downtime is within
> > the downtime-limit. As discussed in the last response, this could create
> > the 'downtime headroom' for getting the enforcement/SLA better
> > honored. Is this maybe your line of thinking?
>

> Not what I was referring, but I think such logic existed for years, it was
> just not implemented in QEMU.  I know at least OpenStack implemented
> exactly that, where instead of keeping an internal smaller downtime_limit
> and keep increasing that, OpenStack will keep adjusting downtime_limit
> parameter from time to time, starting with a relatively low value.
> 
> That is also what I would suggest to most people who cares about downtime,
> because QEMU does treat it pretty simple: if QEMU thinks it can switchover
> within the downtime specified, QEMU will just do it, even if it's not the
> best it can do.
> 
> Do you think such idea should be instead implemented in QEMU, too?  Note
> that this will also be not about "making downtime accurate", but "reducing
> downtime", it may depend on how we define downtime_limit in the context,
> perhaps, where in OpenStack's case it simply won't directly feed that
> parameter with the real max downtime the user allows.
> 
> Since that wasn't my original purpose, what I meant is simply see ways to
> make downtime_limit accurate, and by analyzing the current downtimes (as
> you mentioned, using the downtime tracepoints; and I'd say kudos to you on
> suggesting that in a formal patch).
> 
> Here's something we collected by our QE team, for example, on a pretty
> loaded system of 384 cores + 12TB:
> 
> Checkpoints analysis:
> 
>             downtime-start ->               vm-stopped:             267635.2 
> (us)
>                 vm-stopped ->           iterable-saved:            3558506.2 
> (us)
>             iterable-saved ->       non-iterable-saved:             270352.2 
> (us)
>         non-iterable-saved ->             downtime-end:             144264.2 
> (us)
>                                         total downtime:            4240758.0 
> (us)
> 
> Iterable device analysis:
> 
>   Device SAVE of                                      ram:  0 took    3470420 
> (us)
> 
> Non-iterable device analysis:
> 
>   Device SAVE of                                      cpu:121 took     118090 
> (us)
>   Device SAVE of                                     apic:167 took       6899 
> (us)
>   Device SAVE of                                      cpu:296 took       3795 
> (us)
>   Device SAVE of             0000:00:02.2:00.0/virtio-blk:  0 took        638 
> (us)
>   Device SAVE of                                      cpu:213 took        630 
> (us)
>   Device SAVE of             0000:00:02.0:00.0/virtio-net:  0 took        534 
> (us)
>   Device SAVE of                                      cpu:374 took        517 
> (us)
>   Device SAVE of                                      cpu: 31 took        503 
> (us)
>   Device SAVE of                                      cpu:346 took        497 
> (us)
>   Device SAVE of             0000:00:02.0:00.1/virtio-net:  0 took        492 
> (us)
>   (1341 vmsd omitted)
> 
> In this case we also see the SLA violations since we specified something
> much lower than 4.2sec as downtime_limit.
> 
> This might not be a good example, as I think when capturing the traces we
> used to still have the issue on inaccurate bw estimations, and that was why
> I introduced switchover-bandwidth parameter, I wished after that the result
> can be closer to downtime_limit but we never tried to test again.  I am not
> sure either on whether that's the best way to address this.
> 
> But let's just ignore the iterable save() huge delays (which can be
> explained, and hopefully will still be covered by downtime_limit
> calculations when it can try to get closer to right), and we can also see
> at least a few things we didn't account:
> 
>   - stop vm: 268ms
>   - non-iterables: 270ms
>   - dest load until complete: 144ms
> 
> For the last one, we did see another outlier where it can only be seen from
> dest:
> 
> Non-iterable device analysis:
> 
>   Device LOAD of                              kvm-tpr-opt:  0 took     123976 
> (us)  <----- this one
>   Device LOAD of              0000:00:02.0/pcie-root-port:  0 took       6362 
> (us)
>   Device LOAD of             0000:00:02.0:00.0/virtio-net:  0 took       4583 
> (us)
>   Device LOAD of             0000:00:02.0:00.1/virtio-net:  0 took       4440 
> (us)
>   Device LOAD of                         0000:00:01.0/vga:  0 took       3740 
> (us)
>   Device LOAD of                         0000:00:00.0/mch:  0 took       3557 
> (us)
>   Device LOAD of             0000:00:02.2:00.0/virtio-blk:  0 took       3530 
> (us)
>   Device LOAD of                   0000:00:02.1:00.0/xhci:  0 took       2712 
> (us)
>   Device LOAD of              0000:00:02.1/pcie-root-port:  0 took       2046 
> (us)
>   Device LOAD of              0000:00:02.2/pcie-root-port:  0 took       1890 
> (us)
> 
> So we found either cpu save() taking 100+ms, or kvm-tpr-opt load() taking
> 100+ms.  None of them sounds normal, and I didn't look into them.
>



> Now with a global ratio perhaps start to reflect "how much ratio of
> downtime_limit should we account into data transfer", then we'll also need
> to answer how the user should set that ratio value, and maybe there's a
> sane way to calculate that by the VM setup?  I'm not sure, but those
> questions may need to be answered together in the next post, so that such
> parameter can be consumable.

> 
> The answer doesn't need to be accurate, but I hope that can be based on
> some similar analysis like above (where I didn't do it well; as I don't
> think I looked into any of the issues, and maybe they're fix-able).  But
> just to show what I meant.  It'll be also great when doing the analysis we
> found issues fix-able, then it'll be great we fix the issues intead.
> That's the part when I mentioned "I still prefer fixing downtime_limit
> itself".
> 
> Thanks,
> 
> -- 
> Peter Xu
>
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [PATCH RFC 2/2] migration: abort on destination if switchover limit exceeded, Elena Ufimtseva <=
Prev by Date: Re: [RFC PATCH v5 0/8] Add Rust support, implement ARM PL011
Next by Date: Re: [PATCH v2 01/13] target/riscv: Add properties for Indirect CSR Access extension
Previous by thread: [PATCH 0/1] Brief description of the patch series
Next by thread: [PATCH v5] osdep: add a qemu_close_all_open_fd() helper
Index(es):
- Date
- Thread