qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH COLO-Frame v11 23/39] COLO: Implement failover w


From: Hailiang Zhang
Subject: Re: [Qemu-devel] [PATCH COLO-Frame v11 23/39] COLO: Implement failover work for Primary VM
Date: Fri, 11 Dec 2015 15:54:46 +0800
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0

On 2015/12/11 2:34, Dr. David Alan Gilbert wrote:
* zhanghailiang (address@hidden) wrote:
For PVM, if there is failover request from users.
The colo thread will exit the loop while the failover BH does the
cleanup work and resumes VM.

Signed-off-by: zhanghailiang <address@hidden>
Signed-off-by: Li Zhijian <address@hidden>
---
v11:
- Don't call migration_end() in primary_vm_do_failover(),
  The cleanup work will be done in migration_thread().
- Remove vm_start() in primary_vm_do_failover() which also been done
   in migraiton_thread()
v10:
- Call migration_end() in primary_vm_do_failover()
---
  include/migration/colo.h     |  3 +++
  include/migration/failover.h |  1 +
  migration/colo-failover.c    |  7 +++++-
  migration/colo.c             | 56 ++++++++++++++++++++++++++++++++++++++++++--
  4 files changed, 64 insertions(+), 3 deletions(-)

diff --git a/include/migration/colo.h b/include/migration/colo.h
index ba27719..0b02e95 100644
--- a/include/migration/colo.h
+++ b/include/migration/colo.h
@@ -32,4 +32,7 @@ void *colo_process_incoming_thread(void *opaque);
  bool migration_incoming_in_colo_state(void);

  COLOMode get_colo_mode(void);
+
+/* failover */
+void colo_do_failover(MigrationState *s);
  #endif
diff --git a/include/migration/failover.h b/include/migration/failover.h
index 882c625..fba3931 100644
--- a/include/migration/failover.h
+++ b/include/migration/failover.h
@@ -26,5 +26,6 @@ void failover_init_state(void);
  int failover_set_state(int old_state, int new_state);
  int failover_get_state(void);
  void failover_request_active(Error **errp);
+bool failover_request_is_active(void);

  #endif
diff --git a/migration/colo-failover.c b/migration/colo-failover.c
index 1b1be24..0c525da 100644
--- a/migration/colo-failover.c
+++ b/migration/colo-failover.c
@@ -32,7 +32,7 @@ static void colo_failover_bh(void *opaque)
          error_report("Unkown error for failover, old_state=%d", old_state);
          return;
      }
-    /*TODO: Do failover work */
+    colo_do_failover(NULL);
  }

  void failover_request_active(Error **errp)
@@ -67,6 +67,11 @@ int failover_get_state(void)
      return atomic_read(&failover_state);
  }

+bool failover_request_is_active(void)
+{
+    return ((failover_get_state() != FAILOVER_STATUS_NONE));
+}
+
  void qmp_x_colo_lost_heartbeat(Error **errp)
  {
      if (get_colo_mode() == COLO_MODE_UNKNOWN) {
diff --git a/migration/colo.c b/migration/colo.c
index cedfc63..7a42fc6 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -41,6 +41,42 @@ bool migration_incoming_in_colo_state(void)
      return mis && (mis->state == MIGRATION_STATUS_COLO);
  }

+static bool colo_runstate_is_stopped(void)
+{
+    return runstate_check(RUN_STATE_COLO) || !runstate_is_running();
+}
+
+static void primary_vm_do_failover(void)
+{
+    MigrationState *s = migrate_get_current();
+    int old_state;
+
+    if (s->state != MIGRATION_STATUS_FAILED) {
+        migrate_set_state(&s->state, MIGRATION_STATUS_COLO,
+                          MIGRATION_STATUS_COMPLETED);
+    }

That's a little odd; it will only move to completed if the current
state is MIGRATION_STATUS_COLO, but you only do it if the state isn't
FAILED.  You could remove the if and just call migrate_set_state
like that and it would be safe as long as you really only expect
to have to deal with MIGRATION_STATUS_COLO.


Yes, you are right, i will remove the judgement, and set the state directly.

+    old_state = failover_set_state(FAILOVER_STATUS_HANDLING,
+                                   FAILOVER_STATUS_COMPLETED);
+    if (old_state != FAILOVER_STATUS_HANDLING) {
+        error_report("Serious error while do failover for Primary VM,"
+                     "old_state: %d", old_state);

It's generally better to state the reason rather than say it's 'serious';
so something like:
    'Incorrect state (%d) while doing failover for Primary VM'
tells you more, and looks less scary!


Ha, OK, i will fix it.

+        return;
+    }
+}
+
+void colo_do_failover(MigrationState *s)
+{
+    /* Make sure vm stopped while failover */
+    if (!colo_runstate_is_stopped()) {
+        vm_stop_force_state(RUN_STATE_COLO);
+    }

That feels like a race? couldn't we be just at the end
of taking a checkpoint and about to restart when you do
the if, so it reads that it's currently stopped but
then it restarts it by the time you have a chance to
do anything?

Do you mean, after we stopped VM in failover() but before done the failover 
work,
the migration (checkpoint) thread may starts VM just in the middle time ?
The colo_do_failover() is in the context of BH, it holds
the __lock__, so checkpoint thread has no chance to execute vm_start().

I see in patch 13 (Save PVM....) you have:
      qemu_mutex_lock_iothread();
      vm_start();
      qemu_mutex_unlock_iothread();

So maybe if that code is executed just before the
failover, then it would stop at the _lock_, we would
run here but then as soon as we finish wouldn't it vm_start
there?


But it seems that, we don't need to stop VM to do the failover work.
I'm not so sure, i will investigate if we can do it without stopping VM.

+    if (get_colo_mode() == COLO_MODE_PRIMARY) {
+        primary_vm_do_failover();
+    }
+}
+
  /* colo checkpoint control helper */
  static int colo_ctl_put(QEMUFile *f, uint32_t cmd, uint64_t value)
  {
@@ -122,9 +158,22 @@ static int colo_do_checkpoint_transaction(MigrationState 
*s,
      }

      qemu_mutex_lock_iothread();
+    if (failover_request_is_active()) {
+        qemu_mutex_unlock_iothread();
+        ret = -1;
+        goto out;
+    }
      vm_stop_force_state(RUN_STATE_COLO);
      qemu_mutex_unlock_iothread();
      trace_colo_vm_state_change("run", "stop");
+    /*
+     * failover request bh could be called after
+     * vm_stop_force_state so we check failover_request_is_active() again.
+     */
+    if (failover_request_is_active()) {
+        ret = -1;
+        goto out;
+    }

I'm confused about why the check is needed specifically here;
for example can't that happen at any point where we're ousite of the
iothread lock?  e.g. couldn't we set failover just a couple of
lines lower, lets say just after the s->params.blk= 0 ?


Yes, you are right, failover could happen at any place where we are not
holding iothread lock. We should checkpoint the failover status after every
important steps. I'll add more check in next version ...


      /* Disable block migration */
      s->params.blk = 0;
@@ -221,6 +270,11 @@ static void colo_process_checkpoint(MigrationState *s)
      trace_colo_vm_state_change("stop", "run");

      while (s->state == MIGRATION_STATUS_COLO) {
+        if (failover_request_is_active()) {
+            error_report("failover request");
+            goto out;
+        }
+
          current_time = qemu_clock_get_ms(QEMU_CLOCK_HOST);
          if (current_time - checkpoint_time <
              s->parameters[MIGRATION_PARAMETER_CHECKPOINT_DELAY]) {
@@ -242,8 +296,6 @@ out:
      if (ret < 0) {
          error_report("%s: %s", __func__, strerror(-ret));
      }
-    migrate_set_state(&s->state, MIGRATION_STATUS_COLO,
-                      MIGRATION_STATUS_COMPLETED);

I had to think about this; but yes, I guess the only way out is via the failover
which does the completed above.


Yes, thanks.

Hailiang

Dave

      qsb_free(buffer);
      buffer = NULL;
--
1.8.3.1


--
Dr. David Alan Gilbert / address@hidden / Manchester, UK

.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]