Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

From:	David Hildenbrand
Subject:	Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project
Date:	Thu, 12 Sep 2024 00:07:51 +0200
User-agent:	Mozilla Thunderbird

Hi again,

This is a Qemu RFC to introduce the possibility to deal with hardware
memory errors impacting hugetlbfs memory backed VMs. When using
hugetlbfs large pages, any large page location being impacted by an
HW memory error results in poisoning the entire page, suddenly making
a large chunk of the VM memory unusable.

The implemented proposal is simply a memory mapping change when an HW
error
is reported to Qemu, to transform a hugetlbfs large page into a set of
standard sized pages. The failed large page is unmapped and a set of
standard sized pages are mapped in place.
This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is
received
by qemu and the reported location corresponds to a large page.

One clarifying question: you simply replace the hugetlb page by multiplesmall pages using mmap(MAP_FIXED). So you


(a) are not able to recover any memory of the original page (as of now)
(b) no longer have a hugetlb page and, therefore, possibly a performance
    degradation, relevant in low-latency applications that really care
    about the usage of hugetlb pages.
(c) run into the described inconsistency issues

Why is what you propose beneficial over just fallocate(PUNCH_HOLE) thefull page and get a fresh, non-poisoned page instead?

Sure, you have to reserve some pages if that ever happens, but what isthe big selling point over PUNCH_HOLE + realloc? (sorry if I missed itand it was spelled out)


This gives the possibility to:
- Take advantage of newer hypervisor kernel providing a way to retrieve
still valid data on the impacted hugetlbfs poisoned large page.

Reading that again, that shouldn't have to be hypervisor-specific.Really, if someone were to extract data from a poisoned hugetlb folio,it shouldn't be hypervisor-specific. The kernel should be able to knowwhich regions are accessible and could allow ways for reading these, oneway or the other.

It could just be a fairly hugetlb-special feature that would replace thepoisoned page by a fresh hugetlb page where as much page content aspossible has been recoverd from the old one.

If the backend file is MAP_SHARED, we can copy the valid data into the



Thank you David for this first reaction on this proposal.

How are you dealing with other consumers of the shared memory,
such as vhost-user processes,



In the current proposal, I don't deal with this aspect.
In fact, any other process sharing the changed memory will
continue to map the poisoned large page. So any access to
this page will generate a SIGBUS to this other process.

In this situation vhost-user processes should continue to receive
SIGBUS signals (and probably continue to die because of that).


That's ... suboptimal. :)

Assume you have a 1 GiB page. The guest OS can happily allocate buffersin there so they can end up in vhost-user and crash that process.Without any warning.


So I do see a real problem if 2 qemu processes are sharing the
same hugetlbfs segment -- in this case, error recovery should not
occur on this piece of the memory. Maybe dealing with this situation
with "ivshmem" options is doable (marking the shared segment
"not eligible" to hugetlbfs recovery, just like not "share=on"
hugetlbfs entries are not eligible)
-- I need to think about this specific case.

Please let me know if there is a better way to deal with this
shared memory aspect and have a better system reaction.


Not creating the inconsistency in the first place :)

vm migration whereby RAM is migrated using file content,



Migration doesn't currently work with memory poisoning.
You can give a look at the already integrated following commit:

06152b89db64 migration: prevent migration when VM has poisoned memory

This proposal doesn't change anything on this side.

That commit is fairly fresh and likely missed the option to *not*migrate RAM by reading it, but instead by migrating it through a sharedfile. For example, VM life-upgrade (CPR) wants to use that (or isalready using that), to avoid RAM migration completely.

vfio that might have these pages pinned?


AFAIK even pinned memory can be impacted by memory error and poisoned
by the kernel. Now as I said in the cover letter, I'd like to know if
we should take extra care for IO memory, vfio configured memory buffers...

Assume your GPU has a hugetlb folio pinned via vfio. As soon as you makethe guest RAM point at anything else as VFIO is aware of, we end up inthe same problem we had when we learned about having to disable ballooninflation (MADVISE_DONTNEED) as soon as VFIO pinned pages.

We'd have to inform VFIO that the mapping is now different. Otherwiseit's really better to crash the VM than having your GPU read/writedifferent data than your CPU reads/writes,

In general, you cannot simply replace pages by private copies
when somebody else might be relying on these pages to go to
actual guest RAM.


This is correct, but the current proposal is dealing with a specific
shared memory type: poisoned large pages. So any other process mapping
this type of page can't access it without generating a SIGBUS.

Right, and that's the issue. Because, for example, how should the VM beaware that this memory is now special and must not be used for somepurposes without leading to problems elsewhere?

It sounds very hacky and incomplete at first.


As you can see, RAS features need to be completed.
And if this proposal is incomplete, what other changes should be
done to complete it ?

I do hope we can discuss this RFC to adapt what is incorrect, or
find a better way to address this situation.

One long-term goal people are working on is to allow remapping thehugetlb folios in smaller granularity, such that only a single affectedPTE can be marked as poisoned. (used to be called high-granularity-mapping)

However, at the same time, the focus hseems to shift towards usingguest_memfd instead of hugetlb, once it supports 1 GiB pages and sharedmemory. It will likely be easier to support mapping 1 GiB pages usingPTEs that way, and there are ongoing discussions how that can beachieved more easily.

There are also discussions [1] about not poisoning the mappings at alland handling it differently. But I haven't yet digested how exactly thatcould look like in reality.



[1] https://lkml.kernel.org/r/20240828234958.GE3773488@nvidia.com

--
Cheers,

David / dhildenb

[Prev in Thread]

Current Thread

[Next in Thread]

[RFC 5/6] system/hugetlb_ras: Handle madvise SIGBUS signal on listener, (continued)
- [RFC 5/6] system/hugetlb_ras: Handle madvise SIGBUS signal on listener, “William Roche, 2024/09/10
- [RFC RESEND 0/6] hugetlbfs largepage RAS project, “William Roche, 2024/09/10
  - [RFC RESEND 1/6] accel/kvm: SIGBUS handler should also deal with si_addr_lsb, “William Roche, 2024/09/10
  - [RFC RESEND 2/6] accel/kvm: Keep track of the HWPoisonPage sizes, “William Roche, 2024/09/10
  - [RFC RESEND 3/6] system/physmem: Remap memory pages on reset based on the page size, “William Roche, 2024/09/10
  - [RFC RESEND 4/6] system: Introducing hugetlbfs largepage RAS feature, “William Roche, 2024/09/10
  - [RFC RESEND 5/6] system/hugetlb_ras: Handle madvise SIGBUS signal on listener, “William Roche, 2024/09/10
  - [RFC RESEND 6/6] system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume, “William Roche, 2024/09/10
  - Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project, David Hildenbrand, 2024/09/10
    - Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project, William Roche, 2024/09/10
    - Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project, David Hildenbrand <=
    - Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project, William Roche, 2024/09/12
    - Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project, William Roche, 2024/09/19

Prev by Date: Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle
Next by Date: [PATCH 0/2] Fixes for standard conformance
Previous by thread: Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project
Next by thread: Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project
Index(es):
- Date
- Thread