[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v2 3/3] Lift max memory slots limit imposed by vhost-user
From: |
Raphael Norwitz |
Subject: |
Re: [PATCH v2 3/3] Lift max memory slots limit imposed by vhost-user |
Date: |
Thu, 20 Feb 2020 02:03:08 -0500 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
Ping
On Sun, Feb 09, 2020 at 12:43:35PM -0500, Raphael Norwitz wrote:
>
> On Thu, Feb 06, 2020 at 03:32:38AM -0500, Michael S. Tsirkin wrote:
> >
> > On Wed, Jan 15, 2020 at 09:57:06PM -0500, Raphael Norwitz wrote:
> > > The current vhost-user implementation in Qemu imposes a limit on the
> > > maximum number of memory slots exposed to a VM using a vhost-user
> > > device. This change provides a new protocol feature
> > > VHOST_USER_F_CONFIGURE_SLOTS which, when enabled, lifts this limit and
> > > allows a VM with a vhost-user device to expose a configurable number of
> > > memory slots, up to the ACPI defined maximum. Existing backends which
> > > do not support this protocol feature are unaffected.
> >
> > Hmm ACPI maximum seems to be up to 512 - is this too much to fit in a
> > single message? So can't we just increase the number (after negotiating
> > with remote) and be done with it, instead of add/remove? Or is there
> > another reason to prefer add/remove?
> >
>
> As mentioned in my cover letter, we experimented with simply increasing the
> message size and it didn’t work on our setup. We debugged down to the socket
> layer and found that on the receiving end the messages were truncated at
> around 512 bytes, or around 16 memory regions. To support 512 memory regions
> we would need a message size of around 32 <bytes per region> * 512 <regions>
> + 8 <bytes for padding and region count> ~= 16k packet size. That would be 64
> times larger than the next largest message size. We thought it would be
> cleaner
> and more in line with the rest of the protocol to keep the message sizes
> smaller. In particular, we thought memory regions should be treated like the
> rings, which are sent over one message at a time instead of in one large
> message.
> Whether or not such a large message size can be made to work in our case,
> separate messages will always work on Linux, and most likely all other UNIX
> platforms QEMU is used on.
>
> > >
> > > This feature works by using three new messages,
> > > VHOST_USER_GET_MAX_MEM_SLOTS, VHOST_USER_ADD_MEM_REG and
> > > VHOST_USER_REM_MEM_REG. VHOST_USER_GET_MAX_MEM_SLOTS gets the
> > > number of memory slots the backend is willing to accept when the
> > > backend is initialized. Then, when the memory tables are set or updated,
> > > a series of VHOST_USER_ADD_MEM_REG and VHOST_USER_REM_MEM_REG messages
> > > are sent to transmit the regions to map and/or unmap instead of trying
> > > to send all the regions in one fixed size VHOST_USER_SET_MEM_TABLE
> > > message.
> > >
> > > The vhost_user struct maintains a shadow state of the VM’s memory
> > > regions. When the memory tables are modified, the
> > > vhost_user_set_mem_table() function compares the new device memory state
> > > to the shadow state and only sends regions which need to be unmapped or
> > > mapped in. The regions which must be unmapped are sent first, followed
> > > by the new regions to be mapped in. After all the messages have been
> > > sent, the shadow state is set to the current virtual device state.
> > >
> > > The current feature implementation does not work with postcopy migration
> > > and cannot be enabled if the VHOST_USER_PROTOCOL_F_REPLY_ACK feature has
> > > also been negotiated.
> >
> > Hmm what would it take to lift the restrictions?
> > conflicting features like this makes is very hard for users to make
> > an informed choice what to support.
> >
>
> We would need a setup with a backend which supports these features (REPLY_ACK
> and postcopy migration). At first glance it looks like DPDK could work but
> I'm not sure how easy it will be to test postcopy migration with the resources
> we have.
>
> > > Signed-off-by: Raphael Norwitz <address@hidden>
> > > Signed-off-by: Peter Turschmid <address@hidden>
> > > Suggested-by: Mike Cui <address@hidden>
> > > ---
> > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > index af83fdd..fed6d02 100644
> > > --- a/hw/virtio/vhost-user.c
> > > +++ b/hw/virtio/vhost-user.c
> > > @@ -35,11 +35,29 @@
> > > #include <linux/userfaultfd.h>
> > > #endif
> > >
> > > -#define VHOST_MEMORY_MAX_NREGIONS 8
> > > +#define VHOST_MEMORY_LEGACY_NREGIONS 8
> >
> > Hardly legacy when this is intended to always be used e.g. with
> > postcopy, right?
> >
>
> How about 'BASELINE'?
> > > + msg->hdr.size = sizeof(msg->payload.mem_reg.padding);
> > > + msg->hdr.size += sizeof(VhostUserMemoryRegion);
> > > +
> > > + /*
> > > + * Send VHOST_USER_REM_MEM_REG for memory regions in our shadow state
> > > + * which are not found not in the device's memory state.
> >
> > double negation - could not parse this.
> >
>
> Apologies - typo here. It should say “Send VHOST_USER_REM_MEM_REG for memory
> regions in our shadow state which are not found in the device's memory
> state.”
> i.e. send messages to remove regions in the shadow state but not in the
> updated
> device state.
>
> > > + */
> > > + for (i = 0; i < u->num_shadow_regions; ++i) {
> > > + reg = dev->mem->regions;
> > > +
> > > + for (j = 0; j < dev->mem->nregions; j++) {
> > > + reg = dev->mem->regions + j;
> > > +
> > > + assert((uintptr_t)reg->userspace_addr ==
> > > reg->userspace_addr);
> > > + mr = memory_region_from_host((void
> > > *)(uintptr_t)reg->userspace_addr,
> > > + &offset);
> > > + fd = memory_region_get_fd(mr);
> > > +
> > > + if (reg_equal(&u->shadow_regions[i], reg)) {
> > > + matching = true;
> > > + found[j] = true;
> > > + break;
> > > + }
> > > + }
> > > +
> > > + if (fd > 0 && !matching) {
> > > + msg->hdr.request = VHOST_USER_REM_MEM_REG;
> > > + msg->payload.mem_reg.region.userspace_addr =
> > > reg->userspace_addr;
> > > + msg->payload.mem_reg.region.memory_size = reg->memory_size;
> > > + msg->payload.mem_reg.region.guest_phys_addr =
> > > + reg->guest_phys_addr;
> > > + msg->payload.mem_reg.region.mmap_offset = offset;
> > > +
> > > + if (vhost_user_write(dev, msg, &fd, 1) < 0) {
> > > + return -1;
> > > + }
> > > + }
> > > + }
> > > +