[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC 29/29] migration: reset migrate thread vars when r
From: |
Peter Xu |
Subject: |
Re: [Qemu-devel] [RFC 29/29] migration: reset migrate thread vars when resumed |
Date: |
Fri, 4 Aug 2017 16:52:16 +0800 |
User-agent: |
Mutt/1.5.24 (2015-08-30) |
On Thu, Aug 03, 2017 at 02:54:35PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (address@hidden) wrote:
> > Firstly, MigThrError enumeration is introduced to describe the error in
> > migration_detect_error() better. This gives the migration_thread() a
> > chance to know whether a recovery has happened.
> >
> > Then, if a recovery is detected, migration_thread() will reset its local
> > variables to prepare for that.
> >
> > Signed-off-by: Peter Xu <address@hidden>
> > ---
> > migration/migration.c | 40 +++++++++++++++++++++++++++++-----------
> > 1 file changed, 29 insertions(+), 11 deletions(-)
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index ecebe30..439bc22 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -2159,6 +2159,15 @@ static bool postcopy_should_start(MigrationState *s)
> > return atomic_read(&s->start_postcopy) || s->start_postcopy_fast;
> > }
> >
> > +typedef enum MigThrError {
> > + /* No error detected */
> > + MIG_THR_ERR_NONE = 0,
> > + /* Detected error, but resumed successfully */
> > + MIG_THR_ERR_RECOVERED = 1,
> > + /* Detected fatal error, need to exit */
> > + MIG_THR_ERR_FATAL = 2,
> > +} MigThrError;
> > +
>
> Could you move this patch earlier to when postcopy_pause is created
> so it's created with this enum?
Sure.
[...]
> > @@ -2319,6 +2327,7 @@ static void *migration_thread(void *opaque)
> > /* The active state we expect to be in; ACTIVE or POSTCOPY_ACTIVE */
> > enum MigrationStatus current_active_state = MIGRATION_STATUS_ACTIVE;
> > bool enable_colo = migrate_colo_enabled();
> > + MigThrError thr_error;
> >
> > rcu_register_thread();
> >
> > @@ -2395,8 +2404,17 @@ static void *migration_thread(void *opaque)
> > * Try to detect any kind of failures, and see whether we
> > * should stop the migration now.
> > */
> > - if (migration_detect_error(s)) {
> > + thr_error = migration_detect_error(s);
> > + if (thr_error == MIG_THR_ERR_FATAL) {
> > + /* Stop migration */
> > break;
> > + } else if (thr_error == MIG_THR_ERR_RECOVERED) {
> > + /*
> > + * Just recovered from a e.g. network failure, reset all
> > + * the local variables.
> > + */
> > + initial_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > + initial_bytes = 0;
>
> They don't seem that important to reset?
The problem is that we have this in migration_thread():
if (current_time >= initial_time + BUFFER_DELAY) {
uint64_t transferred_bytes = qemu_ftell(s->to_dst_file) -
initial_bytes;
uint64_t time_spent = current_time - initial_time;
double bandwidth = (double)transferred_bytes / time_spent;
threshold_size = bandwidth * s->parameters.downtime_limit;
...
}
Here qemu_ftell() would possibly be very small since we have just
resumed... and then transferred_bytes will be extremely huge since
"qemu_ftell(s->to_dst_file) - initial_bytes" is actually negative...
Then, with luck, we'll got extremely huge "bandwidth" as well.
--
Peter Xu