[Qemu-devel] [Bug 626781] [NEW] Live migration: bandwitdth calculation a

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] [Bug 626781] [NEW] Live migration: bandwitdth calculation a

From:	eslay
Subject:	[Qemu-devel] [Bug 626781] [NEW] Live migration: bandwitdth calculation and rate limiting not working
Date:	Mon, 30 Aug 2010 12:16:07 -0000
Public bug reported:

I am using QEMU 0.12.5 to perform live migration between 2 Linux hosts.
One Linux Host has 6 cores and 24G RAM, the other has 2 cores and 16G
RAM. For each host, I have one Ethernet interface for NFS storage,
another interface for live migration and a third interface for the VM to
communicate to outside network. Each interface has 1G bandwidth.

It is observed that programs like below (which generates dirty pages
very quickly) will hang the live migration:

#include <stdio.h>
#include <stdlib.h>

    main()
{ 
    unsigned char *array; 
    long int i,j,k;
    unsigned char c;
    long int loop=0;
    array=malloc(1024*1024*1024);
    while(1)
    {
        for(i=0;i<1024;i++)
        {
            c=0;
            for(j=0;j<1024;j++)
            {
                c++;
                for(k=0;k<1024;k++)
                {
                    array[i*1024*1024+j*1024+k]=c;
                }
            }
        }
        loop++;
        if(loop%256==0) printf("%ld\n",loop);
   }
}


It is observed that the traffic down time (measured by "ping -f" from a 3rd 
host) has dependency on RAM size of the virtual machine:

RAM Size    Traffic Down Time   Total Migration Time
1024M       0.5s                          33s
2048M       0.7s                          34s
4096M       2.7s                          39s
8912M       5.3s                          45s
16384M      7.2s                         61s

Using the command "migrate_set_downtime" in QEMU console won't improve
the problem.

Function ram_save_live() in "vl.c" shows that live migration has three
stages:

Stage 1 is some preparation work.
 
Stage 2 is to transfer VM RAM to target host and keep the VM alive at source 
host. In Stage 2, the realtime migration bandwidth is calculated (Line 
3099~3117 in vl.c). At the end of Stage 2 (Line 3130), the expected left time 
of RAM transmission is calculated (Left RAM Size / Calculated Bandwidth). If 
the expected left time is less than the max migration down time, Stage 2 is 
ended and Stage 3 starts.
 
Stage 3 is to stop the VM at the source host, transfer left RAM at full speed, 
and then start the VM at the target host. The period of Stage 3 is believed to 
be the period when the outside lost connection to the VM.
 
This is how live migration is supposed to work.

There is a parameter max_throttle in "migration.c", which sets the max
allowed bandwidth for rate limiting. The default value of this parameter
is 32Mb/s (if not using command "migrate_set_speed" to change the
value). But it does not matter because the rate limiting faction does
not work anyway. There is another parameter max_downtime in
"migration.c", which sets the max allowed traffic down time for live
migration.By default the value is set to 30ms (if not using command
"migrate_set_downtime" to change the value). This value to way too
small, so if using the source code above, live migration will hang.
Stage 2 will never end since the expected left time would never be less
than 30ms. However, changing the parameter to something like 1000ms will
solve the hanging problem.

After changing the default value of max_downtime, the long traffic down time 
problem still exits. The following faults are found:
 
a) The bandwidth calculation in ram_save_live() (the first attachment) is 
wrong. The bandwidth should equal to data transferred divided by the period of 
transmission time. The period of transmission time should be the interval 
between two consecutive calls of function ram_save_live(), which is usually 
100ms (There should be a timer interrupt to control this). However, what the 
code use is the execution time of the while loop between Line 3102 and 3109. 
That is usually 2~5ms! This will yield to unreasonable large bandwidth 
(6~12Gb/s), and in turn will make the estimated execution time of Stage 3 
inaccurate. For example, if the estimated execution time of Stage 3 is 900ms, 
the actual execution time can be  like10s!
 
b) The rate limiting function (qemu_file_rate_limit() which calls 
buffered_rate_limit() in "buffered_file.c") does not work at all. No matter 
what parameters are set, the rate limiting function behaves the same: in Stage 
2, in most time the migration bandwidth is ~400 Mb/s. When a certain condition 
is fulfilled (I don't know exactly what condition but definitely not the number 
of iteration times), QEMU will read the VM RAM at full speed and throw 
everything to the Ethernet link. This stalls the CPU and extend the execution 
time of ram_save_live() to up to 6 seconds. Correspondingly, 6 seconds of 
traffic downtime is seen during Stage 2. (What the algorithm assumes is 
actually no traffic down time during Stage 2).
 
So the fundamental functions do not work when it comes to traffic down time of 
live migration. However, problem a) and b) make the algorithm very easy to 
enter Stage 3. The calculated bandwidth is ridiculously large, so as long as 
the max down time is not set to a value like 30ms and no extensive memory 
modification during migration, things will be finished.

I have a dirty fix to the problem. I assume in Stage 2, ram_save_live()
is called every 100ms and each time ram_save_live() is called, no more
than 100Mb data can be transferred:

static int count=0;
static int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
{
    ram_addr_t addr;
    uint64_t bytes_transferred_last;
    double bwidth = 0;
    uint64_t expected_time = 0;
//    int64_t interval=0;
    bool flag=true;
    uint64_t bytes_transferred2=0;

    if (stage < 0) {
        cpu_physical_memory_set_dirty_tracking(0);
        return 0;
    }
//    printf("ram_save_live: stage= %d\n",stage);
    if (cpu_physical_sync_dirty_bitmap(0, TARGET_PHYS_ADDR_MAX) != 0) {
        qemu_file_set_error(f);
        return 0;
    }

    if (stage == 1) {
        bytes_transferred = 0;

        /* Make sure all dirty bits are set */
        for (addr = 0; addr < last_ram_offset; addr += TARGET_PAGE_SIZE) {
            if (!cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG))
                cpu_physical_memory_set_dirty(addr);
        }

        /* Enable dirty memory tracking */
        cpu_physical_memory_set_dirty_tracking(1);

        qemu_put_be64(f, last_ram_offset | RAM_SAVE_FLAG_MEM_SIZE);
    }

    bytes_transferred_last = bytes_transferred;
    bwidth = get_clock();
    while(bytes_transferred2<=1024*1024*1024/8/10) {
//    while 
((!qemu_file_rate_limit(f))&&(bytes_transferred2<=1024*1024*1024/8/10)) {
        int ret;

        ret = ram_save_block(f);
        bytes_transferred += ret * TARGET_PAGE_SIZE;
        bytes_transferred2 += ret * TARGET_PAGE_SIZE;
        if (ret == 0) /* no more blocks */
            break;
    }

    count ++;
    bwidth = get_clock()-bwidth;
    if(bwidth<100000000) {
        bwidth=100000000;
        flag=false;
    }
    if(flag) printf("ram_save_live: interval = %ld ms, count= 
%d\n",(int64_t)bwidth/1000000,count);
    bwidth = (bytes_transferred - bytes_transferred_last) / bwidth ;

    /* if we haven't transferred anything this round, force expected_time to a
     * a very high value, but without crashing */
    if (bwidth == 0)
        bwidth = 0.000001;

    if (bwidth > 1024*1024*1024/1000000000/8)
        bwidth = 1.024/8;

    /* try transferring iterative blocks of memory */
    if (stage == 3) {
        /* flush all remaining blocks regardless of rate limiting */
        while (ram_save_block(f) != 0) {
            bytes_transferred += TARGET_PAGE_SIZE;
        }
        cpu_physical_memory_set_dirty_tracking(0);
    }

    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);

    expected_time = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth;

    printf("ram_save_live: stage = %d, bwidth = %lf Mb/s, expected_time
= %ld ms, migrate_max_downtime = %ld
ms\n",stage,bwidth*1000*8,expected_time/1000000,
migrate_max_downtime()/1000000);

    return (stage == 2) && (expected_time <= migrate_max_downtime());
}

For an empty VM, the dirty fix extends the total migration time to ~2
minutes (with 15G RAM). But the traffic down time can be controlled to
~1 second. The actual migration bandwidth is ~700Mb/s (all the time).

This fix is very environment specific (won't work with, say, with 10G
link).  A thorough fix is needed for this problem.

** Affects: qemu
     Importance: Undecided
         Status: New

-- 
Live migration: bandwitdth calculation and rate limiting not working
https://bugs.launchpad.net/bugs/626781
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.

Status in QEMU: New

Bug description:
I am using QEMU 0.12.5 to perform live migration between 2 Linux hosts. One 
Linux Host has 6 cores and 24G RAM, the other has 2 cores and 16G RAM. For each 
host, I have one Ethernet interface for NFS storage, another interface for live 
migration and a third interface for the VM to communicate to outside network. 
Each interface has 1G bandwidth.

It is observed that programs like below (which generates dirty pages very 
quickly) will hang the live migration:

#include <stdio.h>
#include <stdlib.h>

    main()
{ 
    unsigned char *array; 
    long int i,j,k;
    unsigned char c;
    long int loop=0;
    array=malloc(1024*1024*1024);
    while(1)
    {
        for(i=0;i<1024;i++)
        {
            c=0;
            for(j=0;j<1024;j++)
            {
                c++;
                for(k=0;k<1024;k++)
                {
                    array[i*1024*1024+j*1024+k]=c;
                }
            }
        }
        loop++;
        if(loop%256==0) printf("%ld\n",loop);
   }
}


It is observed that the traffic down time (measured by "ping -f" from a 3rd 
host) has dependency on RAM size of the virtual machine:

RAM Size    Traffic Down Time   Total Migration Time
1024M       0.5s                          33s
2048M       0.7s                          34s
4096M       2.7s                          39s
8912M       5.3s                          45s
16384M      7.2s                         61s

Using the command "migrate_set_downtime" in QEMU console won't improve the 
problem.

Function ram_save_live() in "vl.c" shows that live migration has three stages:

Stage 1 is some preparation work.
 
Stage 2 is to transfer VM RAM to target host and keep the VM alive at source 
host. In Stage 2, the realtime migration bandwidth is calculated (Line 
3099~3117 in vl.c). At the end of Stage 2 (Line 3130), the expected left time 
of RAM transmission is calculated (Left RAM Size / Calculated Bandwidth). If 
the expected left time is less than the max migration down time, Stage 2 is 
ended and Stage 3 starts.
 
Stage 3 is to stop the VM at the source host, transfer left RAM at full speed, 
and then start the VM at the target host. The period of Stage 3 is believed to 
be the period when the outside lost connection to the VM.
 
This is how live migration is supposed to work.

There is a parameter max_throttle in "migration.c", which sets the max allowed 
bandwidth for rate limiting. The default value of this parameter is 32Mb/s (if 
not using command "migrate_set_speed" to change the value). But it does not 
matter because the rate limiting faction does not work anyway. There is another 
parameter max_downtime in "migration.c", which sets the max allowed traffic 
down time for live migration.By default the value is set to 30ms (if not using 
command "migrate_set_downtime" to change the value). This value to way too 
small, so if using the source code above, live migration will hang. Stage 2 
will never end since the expected left time would never be less than 30ms. 
However, changing the parameter to something like 1000ms will solve the hanging 
problem.

After changing the default value of max_downtime, the long traffic down time 
problem still exits. The following faults are found:
 
a) The bandwidth calculation in ram_save_live() (the first attachment) is 
wrong. The bandwidth should equal to data transferred divided by the period of 
transmission time. The period of transmission time should be the interval 
between two consecutive calls of function ram_save_live(), which is usually 
100ms (There should be a timer interrupt to control this). However, what the 
code use is the execution time of the while loop between Line 3102 and 3109. 
That is usually 2~5ms! This will yield to unreasonable large bandwidth 
(6~12Gb/s), and in turn will make the estimated execution time of Stage 3 
inaccurate. For example, if the estimated execution time of Stage 3 is 900ms, 
the actual execution time can be  like10s!
 
b) The rate limiting function (qemu_file_rate_limit() which calls 
buffered_rate_limit() in "buffered_file.c") does not work at all. No matter 
what parameters are set, the rate limiting function behaves the same: in Stage 
2, in most time the migration bandwidth is ~400 Mb/s. When a certain condition 
is fulfilled (I don't know exactly what condition but definitely not the number 
of iteration times), QEMU will read the VM RAM at full speed and throw 
everything to the Ethernet link. This stalls the CPU and extend the execution 
time of ram_save_live() to up to 6 seconds. Correspondingly, 6 seconds of 
traffic downtime is seen during Stage 2. (What the algorithm assumes is 
actually no traffic down time during Stage 2).
 
So the fundamental functions do not work when it comes to traffic down time of 
live migration. However, problem a) and b) make the algorithm very easy to 
enter Stage 3. The calculated bandwidth is ridiculously large, so as long as 
the max down time is not set to a value like 30ms and no extensive memory 
modification during migration, things will be finished.

I have a dirty fix to the problem. I assume in Stage 2, ram_save_live() is 
called every 100ms and each time ram_save_live() is called, no more than 100Mb 
data can be transferred:

static int count=0;
static int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
{
    ram_addr_t addr;
    uint64_t bytes_transferred_last;
    double bwidth = 0;
    uint64_t expected_time = 0;
//    int64_t interval=0;
    bool flag=true;
    uint64_t bytes_transferred2=0;

    if (stage < 0) {
        cpu_physical_memory_set_dirty_tracking(0);
        return 0;
    }
//    printf("ram_save_live: stage= %d\n",stage);
    if (cpu_physical_sync_dirty_bitmap(0, TARGET_PHYS_ADDR_MAX) != 0) {
        qemu_file_set_error(f);
        return 0;
    }

    if (stage == 1) {
        bytes_transferred = 0;

        /* Make sure all dirty bits are set */
        for (addr = 0; addr < last_ram_offset; addr += TARGET_PAGE_SIZE) {
            if (!cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG))
                cpu_physical_memory_set_dirty(addr);
        }

        /* Enable dirty memory tracking */
        cpu_physical_memory_set_dirty_tracking(1);

        qemu_put_be64(f, last_ram_offset | RAM_SAVE_FLAG_MEM_SIZE);
    }

    bytes_transferred_last = bytes_transferred;
    bwidth = get_clock();
    while(bytes_transferred2<=1024*1024*1024/8/10) {
//    while 
((!qemu_file_rate_limit(f))&&(bytes_transferred2<=1024*1024*1024/8/10)) {
        int ret;

        ret = ram_save_block(f);
        bytes_transferred += ret * TARGET_PAGE_SIZE;
        bytes_transferred2 += ret * TARGET_PAGE_SIZE;
        if (ret == 0) /* no more blocks */
            break;
    }

    count ++;
    bwidth = get_clock()-bwidth;
    if(bwidth<100000000) {
        bwidth=100000000;
        flag=false;
    }
    if(flag) printf("ram_save_live: interval = %ld ms, count= 
%d\n",(int64_t)bwidth/1000000,count);
    bwidth = (bytes_transferred - bytes_transferred_last) / bwidth ;

    /* if we haven't transferred anything this round, force expected_time to a
     * a very high value, but without crashing */
    if (bwidth == 0)
        bwidth = 0.000001;

    if (bwidth > 1024*1024*1024/1000000000/8)
        bwidth = 1.024/8;

    /* try transferring iterative blocks of memory */
    if (stage == 3) {
        /* flush all remaining blocks regardless of rate limiting */
        while (ram_save_block(f) != 0) {
            bytes_transferred += TARGET_PAGE_SIZE;
        }
        cpu_physical_memory_set_dirty_tracking(0);
    }

    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);

    expected_time = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth;

    printf("ram_save_live: stage = %d, bwidth = %lf Mb/s, expected_time = %ld 
ms, migrate_max_downtime = %ld ms\n",stage,bwidth*1000*8,expected_time/1000000, 
migrate_max_downtime()/1000000);

    return (stage == 2) && (expected_time <= migrate_max_downtime());
}

For an empty VM, the dirty fix extends the total migration time to ~2 minutes 
(with 15G RAM). But the traffic down time can be controlled to ~1 second. The 
actual migration bandwidth is ~700Mb/s (all the time).

This fix is very environment specific (won't work with, say, with 10G link).  A 
thorough fix is needed for this problem.
[Prev in Thread]
Current Thread
[Next in Thread]
[Qemu-devel] [resend]Crashing early in XP i386 guest install with a virtio block driver enabled., Brad Campbell, 2010/08/31
Prev by Date: Re: [Qemu-devel] Re: [PATCH 0/5] RFC: distinguish warm reset from cold reset.
Next by Date: [Qemu-devel] Re: [PATCH 5/5] RFC: distinguish warm reset from cold reset.
Previous by thread: [Qemu-devel] KVM call minutes for August 31
Index(es):
- Date
- Thread