[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v5 00/28] Migration: postcopy failure recovery
From: |
Peter Xu |
Subject: |
Re: [Qemu-devel] [PATCH v5 00/28] Migration: postcopy failure recovery |
Date: |
Wed, 6 Dec 2017 10:39:45 +0800 |
User-agent: |
Mutt/1.9.1 (2017-09-22) |
On Tue, Dec 05, 2017 at 06:43:42PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (address@hidden) wrote:
> > Tree is pushed here for better reference and testing (online tree
> > includes monitor OOB series):
> >
> > https://github.com/xzpeter/qemu/tree/postcopy-recover-all
> >
> > This version removed quite a few patches related to migrate-incoming,
> > instead I introduced a new command "migrate-recover" to trigger the
> > recovery channel on destination side to simplify the code.
> >
> > To test this two series altogether, please checkout above tree and
> > build. Note: to test on small and single host, one need to disable
> > full bandwidth postcopy migration otherwise it'll complete very fast.
> > Basically a simple patch like this would help:
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 4de3b551fe..c0206023d7 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1904,7 +1904,7 @@ static int postcopy_start(MigrationState *ms, bool
> > *old_vm_running)
> > * will notice we're in POSTCOPY_ACTIVE and not actually
> > * wrap their state up here
> > */
> > - qemu_file_set_rate_limit(ms->to_dst_file, INT64_MAX);
> > + // qemu_file_set_rate_limit(ms->to_dst_file, INT64_MAX);
> > if (migrate_postcopy_ram()) {
> > /* Ping just for debugging, helps line traces up */
> > qemu_savevm_send_ping(ms->to_dst_file, 2);
> >
> > This patch is included already in above github tree. Please feel free
> > to drop this patch when want to test on big machines and between real
> > hosts.
> >
> > Detailed Test Procedures (QMP only)
> > ===================================
> >
> > 1. start source QEMU.
> >
> > $qemu -M q35,kernel-irqchip=split -enable-kvm -snapshot \
> > -smp 4 -m 1G -qmp stdio \
> > -name peter-vm,debug-threads=on \
> > -netdev user,id=net0 \
> > -device e1000,netdev=net0 \
> > -global migration.x-max-bandwidth=4096 \
> > -global migration.x-postcopy-ram=on \
> > /images/fedora-25.qcow2
>
> I suspect -snapshot isn't doing the right thing to the storage when
> combined with the migration - I'm assuming the destination isn't using
> the same temporary file.
> (Also any reason for specifying split irqchip?)
Ah yes. Sorry we should not use "-snapshot" here. Please remove it.
I think my smoke test just didn't try to fetch anything on that temp
storage so nothing went wrong.
And, no reason for split irqchip - I just fetched this command line
somewhere where I was testing IOMMUs. :-) Please feel free to remove
it too if you want.
(so basically I was just pasting my smoke test command lines, not
really command line required to run the tests)
>
> > 2. start destination QEMU.
> >
> > $qemu -M q35,kernel-irqchip=split -enable-kvm -snapshot \
> > -smp 4 -m 1G -qmp stdio \
> > -name peter-vm,debug-threads=on \
> > -netdev user,id=net0 \
> > -device e1000,netdev=net0 \
> > -global migration.x-max-bandwidth=4096 \
> > -global migration.x-postcopy-ram=on \
> > -incoming tcp:0.0.0.0:5555 \
> > /images/fedora-25.qcow2
> >
> > 3. On source, do QMP handshake as normal:
> >
> > {"execute": "qmp_capabilities"}
> > {"return": {}}
> >
> > 4. On destination, do QMP handshake to enable OOB:
> >
> > {"execute": "qmp_capabilities", "arguments": { "enable": [ "oob" ] } }
> > {"return": {}}
> >
> > 5. On source, trigger initial migrate command, switch to postcopy:
> >
> > {"execute": "migrate", "arguments": { "uri": "tcp:localhost:5555" } }
> > {"return": {}}
> > {"execute": "query-migrate"}
> > {"return": {"expected-downtime": 300, "status": "active", ...}}
> > {"execute": "migrate-start-postcopy"}
> > {"return": {}}
> > {"timestamp": {"seconds": 1512454728, "microseconds": 768096}, "event":
> > "STOP"}
> > {"execute": "query-migrate"}
> > {"return": {"expected-downtime": 44472, "status": "postcopy-active", ...}}
> >
> > 6. On source, manually trigger a "fake network down" using
> > "migrate-cancel" command:
> >
> > {"execute": "migrate_cancel"}
> > {"return": {}}
> >
> > During postcopy, it'll not really cancel the migration, but pause
> > it. On both sides, we should see this on stderr:
> >
> > qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
> >
> > It means now both sides are in postcopy-pause state.
> >
> > 7. (Optional) On destination side, let's try to hang the main thread
> > using the new x-oob-test command, providing a "lock=true" param:
> >
> > {"execute": "x-oob-test", "id": "lock-dispatcher-cmd",
> > "arguments": { "lock": true } }
> >
> > After sending this command, we should not see any "return", because
> > main thread is blocked already. But we can still use the monitor
> > since the monitor now has dedicated IOThread.
> >
> > 8. On destination side, provide a new incoming port using the new
> > command "migrate-recover" (note that if step 7 is carried out, we
> > _must_ use OOB form, otherwise the command will hang. With OOB,
> > this command will return immediately):
> >
> > {"execute": "migrate-recover", "id": "recover-cmd",
> > "arguments": { "uri": "tcp:localhost:5556" },
> > "control": { "run-oob": true } }
> > {"timestamp": {"seconds": 1512454976, "microseconds": 186053},
> > "event": "MIGRATION", "data": {"status": "setup"}}
> > {"return": {}, "id": "recover-cmd"}
> >
> > We can see that the command will success even if main thread is
> > locked up.
> >
> > 9. (Optional) This step is only needed if step 7 is carried out. On
> > destination, let's unlock the main thread before resuming the
> > migration, this time with "lock=false" to unlock the main thread
> > (since system running needs the main thread). Note that we _must_
> > use OOB command here too:
> >
> > {"execute": "x-oob-test", "id": "unlock-dispatcher",
> > "arguments": { "lock": false }, "control": { "run-oob": true } }
> > {"return": {}, "id": "unlock-dispatcher"}
> > {"return": {}, "id": "lock-dispatcher-cmd"}
> >
> > Here the first "return" is the reply to the unlock command, the
> > second "return" is the reply to the lock command. After this
> > command, main thread is released.
> >
> > 10. On source, resume the postcopy migration:
> >
> > {"execute": "migrate", "arguments": { "uri": "tcp:localhost:5556",
> > "resume": true }}
> > {"return": {}}
> > {"execute": "query-migrate"}
> > {"return": {"status": "completed", ...}}
>
> The use of x-oob-test to lock things is a bit different to reality
> and that means the ordering is different.
> When the destination is blocked by a page request, that page won't
> become unstuck until sometime after (10) happens and delivers the page
> to the target.
>
> You could try an 'info cpu' on the destination at (7) - although it's
> not guaranteed to lock, depending whether the page needed has arrived.
Yes info cpus (or say "query-cpus", in QMP) would work too. The
"return" will be delayed until sending the resuming command, but it's
the same thing - here I just want to make sure main thread is totally
hang death, so I can know whether the new accept() port and the whole
workflow will work even with that.
Btw, IMHO "info cpus" should guarantee a block, if not, we just do
something in guest to make sure guest hangs, then at least one vcpu
must be waiting for a page. Thanks!
--
Peter Xu
- [Qemu-devel] [PATCH v5 21/28] migration: setup ramstate for resume, (continued)
- [Qemu-devel] [PATCH v5 21/28] migration: setup ramstate for resume, Peter Xu, 2017/12/05
- [Qemu-devel] [PATCH v5 23/28] migration: free SocketAddress where allocated, Peter Xu, 2017/12/05
- [Qemu-devel] [PATCH v5 24/28] migration: init dst in migration_object_init too, Peter Xu, 2017/12/05
- [Qemu-devel] [PATCH v5 25/28] io: let watcher of the channel run in same ctx, Peter Xu, 2017/12/05
- [Qemu-devel] [PATCH v5 26/28] migration: allow migrate_cancel to pause postcopy, Peter Xu, 2017/12/05
- [Qemu-devel] [PATCH v5 27/28] qmp/migration: new command migrate-recover, Peter Xu, 2017/12/05
- [Qemu-devel] [PATCH v5 28/28] hmp/migration: add migrate_recover command, Peter Xu, 2017/12/05
- Re: [Qemu-devel] [PATCH v5 00/28] Migration: postcopy failure recovery, Peter Xu, 2017/12/05
- Re: [Qemu-devel] [PATCH v5 00/28] Migration: postcopy failure recovery, Dr. David Alan Gilbert, 2017/12/05
- Re: [Qemu-devel] [PATCH v5 00/28] Migration: postcopy failure recovery,
Peter Xu <=