[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH RFC] memory: Don't allow to resize RAM while migrating
From: |
David Hildenbrand |
Subject: |
Re: [PATCH RFC] memory: Don't allow to resize RAM while migrating |
Date: |
Fri, 14 Feb 2020 17:45:57 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.4.1 |
>> a) In precopy during the second migration.
>> b) In postcopy during the first migration.
>
> After reading your reply - even the 1st migration of precopy? Say,
> when source QEMU resets and found changed FW during the precopy?
I think the FW will only change during migration (depends on the other
QEMU version) - but yeah, might be possible - no expert.
>
>>
>>>
>>> And is this patch trying to fix/warn when there's a reboot during (3)
>>> so the new size is discovered at a wrong time? Is my understanding
>>> correct?
>>
>> It's trying to bail out early instead of failing at other random points
>> (with an unclear outcome).
>
> Yeah, I am just uncertain on whether in some cases it could be a
> silent success (when used_length changed, however migration still
> completed without error reported) and now we're changing it to a VM
> crash... Could that happen?
>
> - before the patch, when precopy triggers this,
>
> - when it didn't encounter issue with the changed used_length, it
> could get silently ignored. Lucky enough & good case.
>
> - when it triggered an error, precopy failed. _However_, we can
> simply restart... so still not so bad.
>
> - after the patch, when precopy detects this, we abort
> immediately... Which is really not good...
Se the other sub-thread (see below), we're thinking about canceling
pre-copy, which could work just fine.
>
> If you see, that's the major thing I was worrying about...
>
> And since used_length is growing in most cases as you said (at least
> before virtio-mem comes? :), I'm suspecting that could be the major
hah! :) The think about virtio-mem is that it can actually decide to not
resize during migration (and I have that implemented right now) - acpi
code can't.
> case that there will be a silent success.
The thing is, it might not be a silent success but a very strange
error/crash. We have a data race here. But yeah, I agree that we should
at least precopy not crashing.
>>>> In the precopy case it would be easier to abort (although, not simple
>>>> AFAIKS), in the postcopy not so easy - because you're already partially
>>>> running on the migration target.
>>>
>>> Prior to this patch, would a precopy still survive with such an
>>> accident (asked because I _feel_ like migrating a ramblock with
>>> smaller used_length to the same ramblock with bigger used_length seems
>>> to be fine?)? Or we can stop the precopy and restart. After this
>>
>> I assume growing the region is the usual case (not shrinking). FW blobs
>> tend to get bigger.
>>
>> Migrating while growing a ram block on the source won't work. The source
>> would try to send a dirt page that's outside of the used_length on the
>> target, making e.g., ram_load_postcopy()/ram_load_precopy() fail with
>> "Illegal RAM offset...".
>
> Right.
>
>>
>> In the postcopy case, e.g., ram_dirty_bitmap_reload() will fail in case
>> there is a mismatch between ram block size on source/target.
>
> IMHO that's an extreme rare case when (one example I can think of):
>
> - we start a postcopy after a precopy
> - system reset, noticed a firmware update
> - we got a network failure, postcopy interrupted
> - we try to recover a postcopy
>
> So are you using postcopy recovery? I will be surprised if so because
> then you'll be the first user I know that really used that besides QE. :)
One of my strengths is to read code and find flaws :P
Good to know that that should be "barely" affected for now :)
>> Another issue is if the used_length changes while in ram_save_setup(),
>> just between storing ram_bytes_total_common(true) and storing
>> block->used_length. A mismatch will screw up the migration stream.
>
> Yes this seems to be another issue then. IIUC the ramblocks are
> protected by RCU, the migration code has always been with the read
> lock there so logically it should see a consistent view of system
> ramblocks in ram_save_setup(). IMHO the thing that was inconsistent
> is that RCU is not safe enough for changing used_length for a ramblock.
Yes.
>
>>
>> But these are just the immediately visible issues. I am more concerned
>> about used_length changing at random points in time, resulting in more
>> harm. (e.g., non-obvious load-store tearing when accessing the used length)
>>
>> Migration code is inherently racy when it comes to ram block resizes.
>> And that might become more dangerous once we want to size the migration
>> bitmaps smaller (used_length instead of max_length) or disallow access
>> to ram blocks beyond the used_length. Both are things I am working on :)
>
> Right. Now I start to wonder whether migration is the only special guy
> here. I noticed at least we've got:
>
> struct RAMBlockNotifier {
> void (*ram_block_added)(RAMBlockNotifier *n, void *host, size_t size);
> void (*ram_block_removed)(RAMBlockNotifier *n, void *host, size_t size);
> QLIST_ENTRY(RAMBlockNotifier) next;
> };
>
> I suspect at least all these users could also break in some way if
> resize happens.
Hah! You should read
https://lore.kernel.org/qemu-devel/address@hidden/
:)
VFIO is indeed broken on resizes - and fixed in that series (I assume
nobody migrates ...). HAX and SEV simply pin all memory and don't care
about any used_length changes. The callbacks were for now called with
max_length, which works but is not extensible.
See my suggestion in
https://lore.kernel.org/qemu-devel/address@hidden/
which builds up on a ram resize notifier.
>
>>
>>> patch, it'll crash the source VM (&error_abort specified in
>>> memory_region_ram_resize()), which seems a bit more harsh?
>>
>> There seems to be no easy way to abort migration from outside the
>> migration thread. As Juan said, you actually don't want to fail
>> migration but instead soft-abort migration and continue running the
>> guest on the target on a reset. But that's not easy as well.
>>
>> One could think about extending ram block notifiers to notify migration
>> code (before) resizes, so that migration code can work around the
>> resize (how is TBD). Not easy as well :)
>
> True. But if you see my worry still stands, on whether such a patch
> would make things worse by crashing it when it could still have a
> chance to survive. Shall we loose the penalty of that even if we want
> to warn the user earlier?
Canceling migration in precopy case should be fine. Postcopy needs more
thought.
I certainly don't want to live with strange data races in migration code
because "it could work sometimes eventually".
Thanks for all the comments and thoughts!
--
Thanks,
David / dhildenb
Re: [PATCH RFC] memory: Don't allow to resize RAM while migrating, Juan Quintela, 2020/02/13
Re: [PATCH RFC] memory: Don't allow to resize RAM while migrating, Dr. David Alan Gilbert, 2020/02/14
- Re: [PATCH RFC] memory: Don't allow to resize RAM while migrating, David Hildenbrand, 2020/02/14
- Re: [PATCH RFC] memory: Don't allow to resize RAM while migrating, Dr. David Alan Gilbert, 2020/02/14
- Re: [PATCH RFC] memory: Don't allow to resize RAM while migrating, David Hildenbrand, 2020/02/14
- Re: [PATCH RFC] memory: Don't allow to resize RAM while migrating, Dr. David Alan Gilbert, 2020/02/14
- Re: [PATCH RFC] memory: Don't allow to resize RAM while migrating, David Hildenbrand, 2020/02/14
- Re: [PATCH RFC] memory: Don't allow to resize RAM while migrating, David Hildenbrand, 2020/02/14
- Re: [PATCH RFC] memory: Don't allow to resize RAM while migrating, Juan Quintela, 2020/02/14