|
From: | Hailiang Zhang |
Subject: | Re: [Qemu-devel] [PATCH COLO-Frame v15 00/38] COarse-grain LOck-stepping(COLO) Virtual Machines for Non-stop Service (FT) |
Date: | Mon, 29 Feb 2016 20:16:19 +0800 |
User-agent: | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 |
On 2016/2/29 17:47, Dr. David Alan Gilbert wrote:
* Hailiang Zhang (address@hidden) wrote:On 2016/2/27 0:36, Dr. David Alan Gilbert wrote:* Dr. David Alan Gilbert (address@hidden) wrote:* zhanghailiang (address@hidden) wrote:From: root <address@hidden> This is the 15th version of COLO (Still only support periodic checkpoint). Here is only COLO frame part, you can get the whole codes from github: https://github.com/coloft/qemu/commits/colo-v2.6-periodic-mode There are little changes for this series except the network releated part.I was looking at the time the guest is paused during COLO and was surprised to find one of the larger chunks was the time to reset the guest before loading each checkpoint; I've traced it part way, the biggest contributors for my test VM seem to be: 3.8ms pcibus_reset: VGA 1.8ms pcibus_reset: virtio-net-pci 1.5ms pcibus_reset: virtio-blk-pci 1.5ms qemu_devices_reset: piix4_reset 1.1ms pcibus_reset: piix3-ide 1.1ms pcibus_reset: virtio-rng-pci I've not looked deeper yet, but some of these are very silly; I'm running with -nographic so why it's taking 3.8ms to reset VGA is going to be interesting. Also, my only block device is the virtio-blk, so while I understand the standard PC machine has the IDE controller, why it takes it over a ms to reset an unused device.OK, so I've dug a bit deeper, and it appears that it's the changes in PCI bars that actually take the time; every time we do a reset we reset all the BARs, this causes it to do a pci_update_mappings and end up doing a memory_region_del_subregion. Then we load the config space of the PCI device as we do the vmstate_load, and this recreates all the mappings again. I'm not sure what the fix is, but that sounds like it would speed up the checkpoints usefully if we can avoid the map/remap when they're the same.Interesting, and thanks for your report. We already known qemu_system_reset() is a time-consuming function, we shouldn't call it here, but if we didn't do that, there will be a bug, which we have reported before in the previous COLO series, the bellow is the copy of the related patch comment: COLO VMstate: Load VM state into qsb before restore it We should not destroy the state of secondary until we receive the whole state from the primary, in case the primary fails in the middle of sending the state, so, here we cache the device state in Secondary before restore it. Besides, we should call qemu_system_reset() before load VM state, which can ensure the data is intact. Note: If we discard qemu_system_reset(), there will be some odd error, For exmple, qemu in slave side crashes and reports: KVM: entry failed, hardware error 0x7 EAX=00000000 EBX=0000e000 ECX=00009578 EDX=0000434f ESI=0000fc10 EDI=0000434f EBP=00000000 ESP=00001fca EIP=00009594 EFL=00010246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0040 00000400 0000ffff 00009300 CS =f000 000f0000 0000ffff 00009b00 SS =434f 000434f0 0000ffff 00009300 DS =434f 000434f0 0000ffff 00009300 FS =0000 00000000 0000ffff 00009300 GS =0000 00000000 0000ffff 00009300 LDT=0000 00000000 0000ffff 00008200 TR =0000 00000000 0000ffff 00008b00 GDT= 0002dcc8 00000047 IDT= 00000000 0000ffff CR0=00000010 CR2=ffffffff CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Code=c0 74 0f 66 b9 78 95 00 00 66 31 d2 66 31 c0 e9 47 e0 fb 90 <f3> 90 fa fc 66 c3 66 53 66 89 c3 ERROR: invalid runstate transition: 'internal-error' -> 'colo' The reason is, some of the device state will be ignored when saving device state to slave, if the corresponding data is in its initial value, such as 0. But the device state in slave maybe in initialized value, after a loop of checkpoint, there will be inconsistent for the value of device state. This will happen when the PVM reboot or SVM run ahead of PVM in the startup process. Signed-off-by: zhanghailiang <address@hidden> Signed-off-by: Yang Hongyang <address@hidden> Signed-off-by: Gonglei <address@hidden> Reviewed-by: Dr. David Alan Gilbert <address@hidden As described above, some values of the device state are zero, they will be ignored during migration, it has no problem for normal migration, because for the VM in destination, the initial values will be zero too. But for COLO, there are more than one round of migration, the related values may be changed from no-zero to zero, they will be ignored too in the next checkpoint, the VMstate will be inconsistent for SVM.Yes, this doesn't really surprise me; a lot of the migration code does weird things like this, so reset is the safest way.
The above error is caused directly by wrong value of 'async_pf_en_msr'. static const VMStateDescription vmstate_async_pf_msr = { .name = "cpu/async_pf_msr", .version_id = 1, .minimum_version_id = 1, .needed = async_pf_msr_needed, .fields = (VMStateField[]) { VMSTATE_UINT64(env.async_pf_en_msr, X86CPU), VMSTATE_END_OF_LIST() } }; static bool async_pf_msr_needed(void *opaque) { X86CPU *cpu = opaque; return cpu->env.async_pf_en_msr != 0; } Some other VMstate of registers in CPUX86State have the same problem, we can't make sure they won't cause any problems if the values of them are incorrect. So here, we just simply call qemu_system_reset() to avoid the inconsistent problem. Besides, compared with the most time-consuming operation (ram flushed from COLO cache to SVM). The time consuming for qemu_system_reset() seems to be acceptable ;)I've got a patch where I've tried to multithread the flush - it's made it a little faster, but not as much as I hoped (~20ms down to ~16ms using 4 cores)
Hmm, that seems to be a good idea, after switch to COLO (hybrid) mode, in most cases, we will get much more dirtied pages than the periodic mode, because the delay time between two checkpoints is usually longer. The multi-thread flushing way may gain much more in that case, but i doubt, in some bad case, users still can't bear the pause time. Actually, we have thought about this problem for a long time, In our early test based on Kernel COLO-proxy, we can easily got more than one seconds' flushing time, IMHO, uses can't bear the long pausing time of VM if they choose to use COLO. We have designed another scenario which based on userfault's page-miss capability. The base idea is to convert the flushing action to marking action, the flush action will be processed during SVM's running time. For now it is only an idea, we'd like to verify the idea first. (I'm not quite sure if userfaults' page-miss feature is good performance designed, while we use it to mark one page to be MISS a time). Thanks, Hailiang
Another choice to fix the problem is to save the VMstate ignoring the needed() return value, but this method is not so graceful. diff --git a/migration/vmstate.c b/migration/vmstate.c index e5388f0..7d15bba 100644 --- a/migration/vmstate.c +++ b/migration/vmstate.c @@ -409,7 +409,7 @@ static void vmstate_subsection_save(QEMUFile *f, const VMStateDescription *vmsd, bool subsection_found = false; while (sub && sub->needed) { - if (sub->needed(opaque)) { + if (sub->needed(opaque) || migrate_in_colo_state()) { const VMStateDescription *vmsd = sub->vmsd; uint8_t len;Maybe, but I suspect this will also find other strange cases in devices. For example, without the reset I wouldn't really trust devices to load the new state in; they wouldn't have been tested with all the states they might have left themselves in previously. DaveThanks, HailiangDaveI guess reset is normally off anyones radar since it's outside the time anyone cares about, but I guess perhaps the guys trying to make qemu start really quickly would be interested. DavePatch status: Unreviewed: patch 21,27,28,29,33,38 Updated: patch 31,34,35,37 TODO: 1. Checkpoint based on proxy in qemu 2. The capability of continuous FT 3. Optimize the VM's downtime during checkpoint v15: - Go on the shutdown process if encounter error while sending shutdown message to SVM. (patch 24) - Rename qemu_need_skip_netfilter to qemu_netfilter_can_skip and Remove some useless comment. (patch 31, Jason) - Call object_new_with_props() directly to add filter in colo_add_buffer_filter. (patch 34, Jason) - Re-implement colo_set_filter_status() based on COLOBufferFilters list. (patch 35) - Re-implement colo_flush_filter_packets() based on COLOBufferFilters list. (patch 37) v14: - Re-implement the network processing based on netfilter (Jason Wang) - Rename 'COLOCommand' to 'COLOMessage'. (Markus's suggestion) - Split two new patches (patch 27/28) from patch 29 - Fix some other comments from Dave and Markus. v13: - Refactor colo_*_cmd helper functions to use 'Error **errp' parameter instead of return value to indicate success or failure. (patch 10) - Remove the optional error message for COLO_EXIT event. (patch 25) - Use semaphore to notify colo/colo incoming loop that failover work is finished. (patch 26) - Move COLO shutdown related codes to colo.c file. (patch 28) - Fix memory leak bug for colo incoming loop. (new patch 31) - Re-use some existed helper functions to realize the process of saving/loading ram and device. (patch 32) - Fix some other comments from Dave and Markus. zhanghailiang (38): configure: Add parameter for configure to enable/disable COLO support migration: Introduce capability 'x-colo' to migration COLO: migrate colo related info to secondary node migration: Integrate COLO checkpoint process into migration migration: Integrate COLO checkpoint process into loadvm COLO/migration: Create a new communication path from destination to source COLO: Implement colo checkpoint protocol COLO: Add a new RunState RUN_STATE_COLO QEMUSizedBuffer: Introduce two help functions for qsb COLO: Save PVM state to secondary side when do checkpoint COLO: Load PVM's dirty pages into SVM's RAM cache temporarily ram/COLO: Record the dirty pages that SVM received COLO: Load VMState into qsb before restore it COLO: Flush PVM's cached RAM into SVM's memory COLO: Add checkpoint-delay parameter for migrate-set-parameters COLO: synchronize PVM's state to SVM periodically COLO failover: Introduce a new command to trigger a failover COLO failover: Introduce state to record failover process COLO: Implement failover work for Primary VM COLO: Implement failover work for Secondary VM qmp event: Add COLO_EXIT event to notify users while exited from COLO COLO failover: Shutdown related socket fd when do failover COLO failover: Don't do failover during loading VM's state COLO: Process shutdown command for VM in COLO state COLO: Update the global runstate after going into colo state savevm: Introduce two helper functions for save/find loadvm_handlers entry migration/savevm: Add new helpers to process the different stages of loadvm migration/savevm: Export two helper functions for savevm process COLO: Separate the process of saving/loading ram and device state COLO: Split qemu_savevm_state_begin out of checkpoint process net/filter: Add a 'status' property for filter object filter-buffer: Accept zero interval net: Add notifier/callback for netdev init COLO/filter: add each netdev a buffer filter COLO: manage the status of buffer filters for PVM filter-buffer: make filter_buffer_flush() public COLO: flush buffered packets in checkpoint process or exit COLO COLO: Add block replication into colo process configure | 11 + docs/qmp-events.txt | 16 + hmp-commands.hx | 15 + hmp.c | 15 + hmp.h | 1 + include/exec/ram_addr.h | 1 + include/migration/colo.h | 42 ++ include/migration/failover.h | 33 ++ include/migration/migration.h | 16 + include/migration/qemu-file.h | 3 +- include/net/filter.h | 5 + include/net/net.h | 4 + include/sysemu/sysemu.h | 9 + migration/Makefile.objs | 2 + migration/colo-comm.c | 76 ++++ migration/colo-failover.c | 83 ++++ migration/colo.c | 866 ++++++++++++++++++++++++++++++++++++++++++ migration/migration.c | 109 +++++- migration/qemu-file-buf.c | 61 +++ migration/ram.c | 175 ++++++++- migration/savevm.c | 114 ++++-- net/filter-buffer.c | 14 +- net/filter.c | 40 ++ net/net.c | 33 ++ qapi-schema.json | 104 ++++- qapi/event.json | 15 + qemu-options.hx | 4 +- qmp-commands.hx | 23 +- stubs/Makefile.objs | 1 + stubs/migration-colo.c | 54 +++ trace-events | 8 + vl.c | 31 +- 32 files changed, 1908 insertions(+), 76 deletions(-) create mode 100644 include/migration/colo.h create mode 100644 include/migration/failover.h create mode 100644 migration/colo-comm.c create mode 100644 migration/colo-failover.c create mode 100644 migration/colo.c create mode 100644 stubs/migration-colo.c -- 1.8.3.1-- Dr. David Alan Gilbert / address@hidden / Manchester, UK-- Dr. David Alan Gilbert / address@hidden / Manchester, UK .-- Dr. David Alan Gilbert / address@hidden / Manchester, UK .
[Prev in Thread] | Current Thread | [Next in Thread] |