[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v1 1/1] migration: Fix yank on postcopy multifd crashing gues
|
From: |
Leonardo Bras Soares Passos |
|
Subject: |
Re: [PATCH v1 1/1] migration: Fix yank on postcopy multifd crashing guest after migration |
|
Date: |
Wed, 9 Nov 2022 13:59:51 -0300 |
On Wed, Nov 9, 2022 at 10:31 AM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Leonardo Bras (leobras@redhat.com) wrote:
> > When multifd and postcopy-ram capabilities are enabled, if a
> > migrate-start-postcopy is attempted, the migration will finish sending the
> > memory pages and then crash with the following error:
>
> How does that happen? Isn't multifd+postcopy still disabled, I see in
> migrate_caps_check
>
> if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
> ....
> if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
> error_setg(errp, "Postcopy is not yet compatible with multifd");
> return false;
> }
> }
>
I can't see this happening in upstream code (v7.2.0-rc0). Could you
please tell me the lines where this happens?
I mean, I see cap_list[MIGRATION_CAPABILITY_MULTIFD] and
cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM] in migrate_caps_check()
but I can't see them nested like this, so I am probably missing
something.
This procedure to reproduce was shared by Xiaohui Li (I added a few tweaks):
1.Boot a guest with any qemu command on source host;
2.Boot a guest with same qemu command but append '-incoming defer' on
destination host;
3.Enable multifd and postcopy capabilities on src and dst hosts:
{"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"multifd","state":true}]}}
{"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"postcopy-ram","state":true}]}}
4.During migration is active, switch to postcopy mode:
{"execute":"migrate-start-postcopy"}
Best regards,
Leo
>
> Dave
>
> > qemu-system-x86_64: ../util/yank.c:107: yank_unregister_instance: Assertion
> > `QLIST_EMPTY(&entry->yankfns)' failed.
> >
> > This happens because even though all multifd channels could
> > yank_register_function(), none of them could unregister it before
> > unregistering the MIGRATION_YANK_INSTANCE, causing the assert to fail.
> >
> > Fix that by calling multifd_load_cleanup() on postcopy_ram_listen_thread()
> > before MIGRATION_YANK_INSTANCE is unregistered.
> >
> > Fixes: b5eea99ec2 ("migration: Add yank feature")
> > Reported-by: Li Xiaohui <xiaohli@redhat.com>
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > ---
> > migration/migration.h | 1 +
> > migration/migration.c | 18 +++++++++++++-----
> > migration/savevm.c | 2 ++
> > 3 files changed, 16 insertions(+), 5 deletions(-)
> >
> > diff --git a/migration/migration.h b/migration/migration.h
> > index cdad8aceaa..240f64efb0 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -473,6 +473,7 @@ void migration_make_urgent_request(void);
> > void migration_consume_urgent_request(void);
> > bool migration_rate_limit(void);
> > void migration_cancel(const Error *error);
> > +bool migration_load_cleanup(void);
> >
> > void populate_vfio_info(MigrationInfo *info);
> > void postcopy_temp_page_reset(PostcopyTmpPage *tmp_page);
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 739bb683f3..4f363b2a95 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -486,6 +486,17 @@ void migrate_add_address(SocketAddress *address)
> > QAPI_CLONE(SocketAddress, address));
> > }
> >
> > +bool migration_load_cleanup(void)
> > +{
> > + Error *local_err = NULL;
> > +
> > + if (multifd_load_cleanup(&local_err)) {
> > + error_report_err(local_err);
> > + return true;
> > + }
> > + return false;
> > +}
> > +
> > static void qemu_start_incoming_migration(const char *uri, Error **errp)
> > {
> > const char *p = NULL;
> > @@ -540,8 +551,7 @@ static void process_incoming_migration_bh(void *opaque)
> > */
> > qemu_announce_self(&mis->announce_timer, migrate_announce_params());
> >
> > - if (multifd_load_cleanup(&local_err) != 0) {
> > - error_report_err(local_err);
> > + if (migration_load_cleanup()) {
> > autostart = false;
> > }
> > /* If global state section was not received or we are in running
> > @@ -646,9 +656,7 @@ fail:
> > migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
> > MIGRATION_STATUS_FAILED);
> > qemu_fclose(mis->from_src_file);
> > - if (multifd_load_cleanup(&local_err) != 0) {
> > - error_report_err(local_err);
> > - }
> > + migration_load_cleanup();
> > exit(EXIT_FAILURE);
> > }
> >
> > diff --git a/migration/savevm.c b/migration/savevm.c
> > index a0cdb714f7..250caff7f4 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -1889,6 +1889,8 @@ static void *postcopy_ram_listen_thread(void *opaque)
> > exit(EXIT_FAILURE);
> > }
> >
> > + migration_load_cleanup();
> > +
> > migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> > MIGRATION_STATUS_COMPLETED);
> > /*
> > --
> > 2.38.1
> >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>