qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Issue with QEMU Live Migration


From: Fabiano Rosas
Subject: Re: Issue with QEMU Live Migration
Date: Fri, 23 Aug 2024 11:42:27 -0300

"Arisetty, Chakri" <carisett@akamai.com> writes:

> Hello,
>
> Here is more data if my earlier mail did not provide enough details. I 
> apologize for not providing the critical data points in my previous mail.
>
> - Created a file (dd if=/dev/urandom of=/orig.img bs=1M count=1000) before 
> starting live migration
> - Started migration with block-job-cancel command before entering into 
> pre-switchover

Is this a type of migration that you have attempted before and it used
to work? Or is this the first time you're using the mirror job for live
migration?

I was expecting something like:

- start the mirror job
- qmp_migrate
- once PRE_SWITCHOVER is reached, issue block-job-cancel
- qmp_migrate_continue

To be clear, at this point I don't think probing the job status from the
migration code to wait for the job to finish is the right thing to
do. Let's first attempt to rule out any bugs or incorrect usage.

> - During the RAM migration, I copied the original file to new file (dd 
> of=/migration.img if=/orig.img bs=1M count=1000)
> - During the RAM migration, I also started stress-ng (stress-ng --hdd 10 
> --hdd-bytes 4G -i 8 -t 72000s)
> - Issued sync command to flush the new buffer contents into the disk. VM 
> stalled completely
> - Migration was completed successfully
> - Rebooted the VM and checked for the file (/migration.img). The file does 
> not exist. So, block device contents are NOT synced.
>
> So, we have a potential for customer data loss. This is the problem we 
> currently have.
>
> Can someone advice?
>
> Thanks
> Chakri
>
>
> On 8/23/24, 6:30 AM, "Arisetty, Chakri" <carisett@akamai.com 
> <mailto:carisett@akamai.com>> wrote:
>
>
> Hi,
>
>
> Thank you once again!
>
>
>> It's still not entirely clear to me what the situation is here. When the
>> migration reaches pre-switchover state the VM is stopped, so there would
>> be no more IO happening. Is this a matter of a race condition (of sorts)
>> because pre-switchover happens while the block mirror job is still
>> transferring the final blocks? Or is it instead about the data being in
>> traffic over the netword and not yet reaching the destination machine?
>
>
> When the migration reaches to pre-switchover with block-job-cancelled, there 
> are no dirty blocks, But, there are dirty blocks if the block-job is NOT 
> cancelled
> and there are dirty blocks, and those blocks are transferred to NBD server.
>
>
> # When the block mirror job is running before enter pre-switchover state, the 
> dirty count is '0' and job entered into 'ready' state from 'running' state.
> # block-job-cancel is NOT issued with the test.
> 1695226@1724348063.794485:mirror_run < s 0x55e5b9ffbe40 in_flight: 0 
> bytes_in_flight: 0 dirty count 0 active_write_bytes_in_flight 0 total 
> 5368709120 current 5368709120 deltla 1630 iostatus 0
>
>
> 1695226@1724348063.795152:job_state_transition job 0x55e5b9ffbe40 (ret: 0) 
> attempting allowed transition (running-->ready)
>
>
> # QMP command for 'query-block-jobs'
> 1695226@1724348063.845789:qmp_exit_query_block_jobs [{"auto-finalize": true, 
> "io-status": "ok", "device": "drive-scsi-disk-0", "auto-dismiss": true, 
> "busy": false, "len": 5368709120, "offset": 5368709120, "status": "ready", 
> "paused": false, "speed": 100000000, "ready": true, "type": "mirror"}] 1
>
>
> # RAM migration enters 'pre-switchover', dirty count keeps incrementing and 
> NBD client sending the block pages to NBD server.
>
>
> 1695226@1724348070.968831:mirror_run < s 0x55e5b9ffbe40 in_flight: 0 
> bytes_in_flight: 0 dirty count 131072 active_write_bytes_in_flight 0 total 
> 5368840192 current 5368709120 deltla 950 iostatus 0
> ...
> 1695226@1724348070.970540:mirror_run < s 0x55e5b9ffbe40 in_flight: 0 
> bytes_in_flight: 0 dirty count 2162688 active_write_bytes_in_flight 0 total 
> 5371002880 current 5368840192 deltla 1547585 iostatus 0
> ..
>
>
> RAM migration to enter 'completion' state from 'pre-switchover' takes a very 
> long time for VM with bigger RAM. Stopping/Cancelling block-job during the 
> period causes the disk contents to be lost entire duration.
>
>
> Is there a way or API/callback in qemu to indicate there are no dirty blocks 
> and invoke the API from RAM migration code?
>
>
> I'd appreciate if anyone can help me with it.
>
>
> Thanks
> Chakri
>
>
>
>
> On 8/22/24, 6:47 AM, "Fabiano Rosas" <farosas@suse.de 
> <mailto:farosas@suse.de> <mailto:farosas@suse.de <mailto:farosas@suse.de>>> 
> wrote:
>
>
>
>
> !-------------------------------------------------------------------|
> This Message Is From an External Sender
> This message came from outside your organization.
> |-------------------------------------------------------------------!
>
>
>
>
> "Arisetty, Chakri" <carisett@akamai.com <mailto:carisett@akamai.com> 
> <mailto:carisett@akamai.com <mailto:carisett@akamai.com>>> writes:
>
>
>
>
> Ugh, it seems I messed up the CC addresses, let's see if this time they
> go out right. For those new to the thread, we're discussing this bug:
>
>
>
>
> https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!R6zmb2ufwcYOXxYJf4aaguOQTMFPQZ0ErAQ0ekFW2yr8pLLIFJF1mw_uQnwBSdKtUuJad2phm7sE4ME$
>  
> <https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!R6zmb2ufwcYOXxYJf4aaguOQTMFPQZ0ErAQ0ekFW2yr8pLLIFJF1mw_uQnwBSdKtUuJad2phm7sE4ME$>
>  
> <https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!R6zmb2ufwcYOXxYJf4aaguOQTMFPQZ0ErAQ0ekFW2yr8pLLIFJF1mw_uQnwBSdKtUuJad2phm7sE4ME$
>  
> <https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!R6zmb2ufwcYOXxYJf4aaguOQTMFPQZ0ErAQ0ekFW2yr8pLLIFJF1mw_uQnwBSdKtUuJad2phm7sE4ME$>>
>  
>
>
>
>
>> Hi,
>>
>> Thank you for getting back to me.
>>
>> Yes, I have opened the ticket 
>> https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!SCg-a5LiuAGlWyQ6Opd9urNAW4_Z-tUtzPZARWB1d3Ulg_ws87yL3iJcxuZPktLeHNNtPztJTJZNJdE$
>>  
>> <https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!SCg-a5LiuAGlWyQ6Opd9urNAW4_Z-tUtzPZARWB1d3Ulg_ws87yL3iJcxuZPktLeHNNtPztJTJZNJdE$>
>>  
>> <https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!SCg-a5LiuAGlWyQ6Opd9urNAW4_Z-tUtzPZARWB1d3Ulg_ws87yL3iJcxuZPktLeHNNtPztJTJZNJdE$
>>  
>> <https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!SCg-a5LiuAGlWyQ6Opd9urNAW4_Z-tUtzPZARWB1d3Ulg_ws87yL3iJcxuZPktLeHNNtPztJTJZNJdE$>>
>>  
>> <https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!SCg-a5LiuAGlWyQ6Opd9urNAW4_Z-tUtzPZARWB1d3Ulg_ws87yL3iJcxuZPktLeHNNtPztJTJZNJdE$
>>  
>> <https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!SCg-a5LiuAGlWyQ6Opd9urNAW4_Z-tUtzPZARWB1d3Ulg_ws87yL3iJcxuZPktLeHNNtPztJTJZNJdE$>
>>  
>> <https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!SCg-a5LiuAGlWyQ6Opd9urNAW4_Z-tUtzPZARWB1d3Ulg_ws87yL3iJcxuZPktLeHNNtPztJTJZNJdE$
>>  
>> <https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!SCg-a5LiuAGlWyQ6Opd9urNAW4_Z-tUtzPZARWB1d3Ulg_ws87yL3iJcxuZPktLeHNNtPztJTJZNJdE$>>>
>>
>>> So the core of the issue here is that the block job is transitioning to
>>> ready while the migration is still ongoing so there's still dirtying
>>> happening.
>>
>> Yes, this is the problem I have. RAM migration state is already moved to 
>> pre-switchover and mirror block job is moved to "READY" state assuming that 
>> there are no more dirty blocks.
>> But there are still dirty blocks and these dirty block blocks are being 
>> transferred to destination host.
>>
>> I've created a small patch(attached) in mirror.c to put the mirror job back 
>> into the "RUNNING" state if there are any dirty pages.
>> But I still would like to prevent RAM migration state to be moved to 
>> pre-switchover when there are dirty blocks.
>
>
>
>
> It's still not entirely clear to me what the situation is here. When the
> migration reaches pre-switchover state the VM is stopped, so there would
> be no more IO happening. Is this a matter of a race condition (of sorts)
> because pre-switchover happens while the block mirror job is still
> transferring the final blocks? Or is it instead about the data being in
> traffic over the netword and not yet reaching the destination machine?
>
>
>
>
> Is the disk inactivation after the pre-switchover affecting this at all?
>
>
>
>
>>
>>> docs/interop/live-block-operations.rst? Section "QMP invocation for live
>>> storage migration with ``drive-mirror`` + NBD", point 4 says that a
>>> block-job-cancel should be issues after BLOCK_JOB_READY is
>>> reached. Although there is mention of when the migration should be
>>> performed.
>>
>> Thanks for the pointer, I've looked at this part (block-job-cancel). The 
>> problem is that QEMU on the source host is still transferring the dirty 
>> blocks.
>> That is the reason I am trying to avoid moving RAM migration state to 
>> pre-switchover when there are any dirty pages.
>>
>> is there a way in QEMU to know if the disk transfer is completed and stop 
>> dirty pages being transferred?
>
>
>
>
> Sorry, I can't help here. We have block layer people in CC, they might
> be able to advise.
>
>
>
>
>>
>> Thanks
>> Chakri
>>
>>
>> On 8/21/24, 6:56 AM, "Fabiano Rosas" <farosas@suse.de 
>> <mailto:farosas@suse.de> <mailto:farosas@suse.de <mailto:farosas@suse.de>> 
>> <mailto:farosas@suse.de <mailto:farosas@suse.de> <mailto:farosas@suse.de 
>> <mailto:farosas@suse.de>>>> wrote:
>>
>>
>> !-------------------------------------------------------------------|
>> This Message Is From an External Sender
>> This message came from outside your organization.
>> |-------------------------------------------------------------------!
>>
>>
>> "Arisetty, Chakri" <carisett@akamai.com <mailto:carisett@akamai.com> 
>> <mailto:carisett@akamai.com <mailto:carisett@akamai.com>> 
>> <mailto:carisett@akamai.com <mailto:carisett@akamai.com> 
>> <mailto:carisett@akamai.com <mailto:carisett@akamai.com>>>> writes:
>>
>>
>>> Hello,
>>>
>>> I’m having trouble with live migration and I’m using QEMU 7.2.0 on Debian 
>>> 11.
>>>
>>> Migration state switches to pre-switchover state during the RAM migration.
>>>
>>> My assumption is that disks are already migrated and there are no further 
>>> dirty pages to be transferred from source host to destination host. 
>>> Therefore, NBD client on the source host closes the connection to the NBD 
>>> server on the destination host. But we observe that there are still some 
>>> dirty pages being transferred.
>>> Closing prematurely NBD connection results in BLOCK JOB error.
>>> In the RAM migration code (migration/migration.c), I’d like to check for 
>>> block mirror job’s status before RAM migration state is moved to 
>>> pre-switchover. I’m unable to find any block job related code in RAM 
>>> migration code.
>>>
>>> Could someone help me figuring out what might be going wrong or suggest any 
>>> troubleshooting steps or advice to get around the issue?
>>>
>>> Thanks
>>> Chakri
>>
>>
>> Hi, I believe it was you who opened this bug as well? 
>>
>>
>> https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!SCg-a5LiuAGlWyQ6Opd9urNAW4_Z-tUtzPZARWB1d3Ulg_ws87yL3iJcxuZPktLeHNNtPztJTJZNJdE$
>>  
>> <https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!SCg-a5LiuAGlWyQ6Opd9urNAW4_Z-tUtzPZARWB1d3Ulg_ws87yL3iJcxuZPktLeHNNtPztJTJZNJdE$>
>>  
>> <https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!SCg-a5LiuAGlWyQ6Opd9urNAW4_Z-tUtzPZARWB1d3Ulg_ws87yL3iJcxuZPktLeHNNtPztJTJZNJdE$
>>  
>> <https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!SCg-a5LiuAGlWyQ6Opd9urNAW4_Z-tUtzPZARWB1d3Ulg_ws87yL3iJcxuZPktLeHNNtPztJTJZNJdE$>>
>>  
>> <https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!SCg-a5LiuAGlWyQ6Opd9urNAW4_Z-tUtzPZARWB1d3Ulg_ws87yL3iJcxuZPktLeHNNtPztJTJZNJdE$
>>  
>> <https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!SCg-a5LiuAGlWyQ6Opd9urNAW4_Z-tUtzPZARWB1d3Ulg_ws87yL3iJcxuZPktLeHNNtPztJTJZNJdE$>
>>  
>> <https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!SCg-a5LiuAGlWyQ6Opd9urNAW4_Z-tUtzPZARWB1d3Ulg_ws87yL3iJcxuZPktLeHNNtPztJTJZNJdE$
>>  
>> <https://urldefense.com/v3/__https://gitlab.com/qemu-project/qemu/-/issues/2482__;!!GjvTz_vk!SCg-a5LiuAGlWyQ6Opd9urNAW4_Z-tUtzPZARWB1d3Ulg_ws87yL3iJcxuZPktLeHNNtPztJTJZNJdE$>>>
>>  
>>
>>
>> So the core of the issue here is that the block job is transitioning to
>> ready while the migration is still ongoing so there's still dirtying
>> happening.
>>
>>
>> Have you looked at the documentation at
>> docs/interop/live-block-operations.rst? Section "QMP invocation for live
>> storage migration with ``drive-mirror`` + NBD", point 4 says that a
>> block-job-cancel should be issues after BLOCK_JOB_READY is
>> reached. Although there is mention of when the migration should be
>> performed.
>>
>>
>>
>> diff --git a/block/mirror.c b/block/mirror.c
>> index 251adc5ae..3457afe1d 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -1089,6 +1089,10 @@ static int coroutine_fn mirror_run(Job *job, Error 
>> **errp)
>> break;
>> }
>> 
>> + if (cnt != 0 && job_is_ready(&s->common.job)) {
>> + job_transition_to_running(&s->common.job);
>> + }
>> +
>> if (job_is_ready(&s->common.job) && !should_complete) {
>> delay_ns = (s->in_flight == 0 &&
>> cnt == 0 ? BLOCK_JOB_SLICE_TIME : 0);
>> diff --git a/include/qemu/job.h b/include/qemu/job.h
>> index e502787dd..87dbef0d2 100644
>> --- a/include/qemu/job.h
>> +++ b/include/qemu/job.h
>> @@ -641,6 +641,12 @@ int job_apply_verb_locked(Job *job, JobVerb verb, Error 
>> **errp);
>> */
>> void job_early_fail(Job *job);
>> 
>> +/**
>> + * Moves the @job from RUNNING to READY.
>> + * Called with job_mutex *not* held.
>> + */
>> +void job_transition_to_running(Job *job);
>> +
>> /**
>> * Moves the @job from RUNNING to READY.
>> * Called with job_mutex *not* held.
>> diff --git a/job.c b/job.c
>> index 72d57f093..298d90817 100644
>> --- a/job.c
>> +++ b/job.c
>> @@ -62,7 +62,7 @@ bool JobSTT[JOB_STATUS__MAX][JOB_STATUS__MAX] = {
>> /* C: */ [JOB_STATUS_CREATED] = {0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1},
>> /* R: */ [JOB_STATUS_RUNNING] = {0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0},
>> /* P: */ [JOB_STATUS_PAUSED] = {0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0},
>> - /* Y: */ [JOB_STATUS_READY] = {0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0},
>> + /* Y: */ [JOB_STATUS_READY] = {0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0},
>> /* S: */ [JOB_STATUS_STANDBY] = {0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0},
>> /* W: */ [JOB_STATUS_WAITING] = {0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0},
>> /* D: */ [JOB_STATUS_PENDING] = {0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0},
>> @@ -1035,6 +1035,12 @@ static int job_transition_to_pending_locked(Job *job)
>> return 0;
>> }
>> 
>> +void job_transition_to_running(Job *job)
>> +{
>> + JOB_LOCK_GUARD();
>> + job_state_transition_locked(job, JOB_STATUS_RUNNING);
>> +}
>> +
>> void job_transition_to_ready(Job *job)
>> {
>> JOB_LOCK_GUARD();



reply via email to

[Prev in Thread] Current Thread [Next in Thread]