qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH V3 01/16] machine: anon-alloc option


From: David Hildenbrand
Subject: Re: [PATCH V3 01/16] machine: anon-alloc option
Date: Thu, 7 Nov 2024 14:23:41 +0100
User-agent: Mozilla Thunderbird

On 06.11.24 21:12, Steven Sistare wrote:


On 11/4/2024 4:36 PM, David Hildenbrand wrote:
On 04.11.24 21:56, Steven Sistare wrote:
On 11/4/2024 3:15 PM, David Hildenbrand wrote:
On 04.11.24 20:51, David Hildenbrand wrote:
On 04.11.24 18:38, Steven Sistare wrote:
On 11/4/2024 5:39 AM, David Hildenbrand wrote:
On 01.11.24 14:47, Steve Sistare wrote:
Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
on the value of the anon-alloc machine property.  This option applies to
memory allocated as a side effect of creating various devices. It does
not apply to memory-backend-objects, whether explicitly specified on
the command line, or implicitly created by the -m command line option.

The memfd option is intended to support new migration modes, in which the
memory region can be transferred in place to a new QEMU process, by sending
the memfd file descriptor to the process.  Memory contents are preserved,
and if the mode also transfers device descriptors, then pages that are
locked in memory for DMA remain locked.  This behavior is a pre-requisite
for supporting vfio, vdpa, and iommufd devices with the new modes.

A more portable, non-Linux specific variant of this will be using shm,
similar to backends/hostmem-shm.c.

Likely we should be using that instead of memfd, or try hiding the
details. See below.

For this series I would prefer to use memfd and hide the details.  It's a
concise (and well tested) solution albeit linux only.  The code you supply
for posix shm would be a good follow on patch to support other unices.

Unless there is reason to use memfd we should start with the more
generic POSIX variant that is available even on systems without memfd.
Factoring stuff out as I drafted does look quite compelling.

I can help with the rework, and send it out separately, so you can focus
on the "machine toggle" as part of this series.

Of course, if we find out we need the memfd internally instead under
Linux for whatever reason later, we can use that instead.

But IIUC, the main selling point for memfd are additional features
(hugetlb, memory sealing) that you aren't even using.

FWIW, I'm looking into some details, and one difference is that shmem_open() 
under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal 
tmpfs mount. There is not a big difference, but there can be some difference 
(e.g., sizing of the /dev/shm mount).

Sizing is a non-trivial difference.  One can by default allocate all memory 
using memfd_create.
To do so using shm_open requires configuration on the mount.  One step harder 
to use.

Yes.


This is a real issue for memory-backend-ram, and becomes an issue for the 
internal RAM
if memory-backend-ram has hogged all the memory.

Regarding memory-backend-ram,share=on, I assume we can use memfd if available, 
but then fallback to shm_open().

Yes, and if that is a good idea, then the same should be done for internal RAM
-- memfd if available and fallback to shm_open.

Yes.


I'm hoping we can find a way where it just all is rather intuitive, like

"default-ram-share=on": behave for internal RAM just like 
"memory-backend-ram,share=on"

"memory-backend-ram,share=on": use whatever mechanism we have to give us 
"anonymous" memory that can be shared using an fd with another process.

Thoughts?

Agreed, though I thought I had already landed at the intuitive specification in 
my patch.
The user must explicitly configure memory-backend-* to be usable with CPR, and 
anon-alloc
controls everything else.  Now we're just riffing on the details: memfd vs 
shm_open, spelling
of options and words to describe them.

Well, yes, and making it all a bit more consistent and the "machine option" behave just 
like "memory-backend-ram,share=on".

Hi David and Peter,

I have implemented and tested the following, for both qemu_memfd_create
and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
for simplicity.

Any comments before I submit a complete patch?

----
qemu-options.hx:
      ``aux-ram-share=on|off``
          Allocate auxiliary guest RAM as an anonymous file that is
          shareable with an external process.  This option applies to
          memory allocated as a side effect of creating various devices.
          It does not apply to memory-backend-objects, whether explicitly
          specified on the command line, or implicitly created by the -m
          command line option.

          Some migration modes require aux-ram-share=on.

qapi/migration.json:
      @cpr-transfer:
           ...
           Memory-backend objects must have the share=on attribute, but
           memory-backend-epc is not supported.  The VM must be started
           with the '-machine aux-ram-share=on' option.

Define RAM_PRIVATE

Define qemu_shm_alloc(), from David's tmp patch

ram_backend_memory_alloc()
      ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
      memory_region_init_ram_flags_nomigrate(ram_flags)

qemu_ram_alloc_internal()
      ...
      if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
          new_block->flags |= RAM_SHARED;

      if (!host && (new_block->flags & RAM_SHARED)) {
          qemu_ram_alloc_shared(new_block);
      } else
          new_block->fd = -1;
          new_block->host = host;
      }
      ram_block_add(new_block);

qemu_ram_alloc_shared()
      if qemu_memfd_check()
          new_block->fd = qemu_memfd_create()
      else
          new_block->fd = qemu_shm_alloc()

Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.

memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.

MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...

So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.


We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.

So maybe something like

qemu_ram_alloc_shared()
        fd = -1;

        if (qemu_memfd_avilable()) {
                fd = qemu_memfd_create();
                if (fd < 0)
                        ... error
        } else if (qemu_shm_available())
                fd = qemu_shm_alloc();
                if (fd < 0)
                        ... error
        } else {
                /*
                 * Old behavior: try fd-less shared memory. We might
                 * just end up with non-shared memory on Windows, but
                 * nobody can make sure of this shared memory either way
                 * ... should we just use non-shared memory? Or should
                 * we simply bail out? But then, if there is no shared
                 * memory nobody could possible use it.
                 */
                qemu_anon_ram_alloc(share=true)
        }
--
Cheers,

David / dhildenb




reply via email to

[Prev in Thread] Current Thread [Next in Thread]