qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v6 0/8] Vhost and vhost-net support for userspac


From: Antonios Motakis
Subject: Re: [Qemu-devel] [PATCH v6 0/8] Vhost and vhost-net support for userspace based backends
Date: Wed, 29 Jan 2014 13:04:28 +0100

Hello,

On Mon, Jan 27, 2014 at 5:49 PM, Michael S. Tsirkin <address@hidden> wrote:
>
> On Mon, Jan 27, 2014 at 05:37:02PM +0100, Antonios Motakis wrote:
> > Hello again,
> >
> >
> > On Wed, Jan 15, 2014 at 3:49 PM, Michael S. Tsirkin <address@hidden> wrote:
> > >
> > > On Wed, Jan 15, 2014 at 01:50:47PM +0100, Antonios Motakis wrote:
> > > >
> > > >
> > > >
> > > > On Wed, Jan 15, 2014 at 10:07 AM, Michael S. Tsirkin <address@hidden> 
> > > > wrote:
> > > >
> > > >     On Tue, Jan 14, 2014 at 07:13:43PM +0100, Antonios Motakis wrote:
> > > >     >
> > > >     >
> > > >     >
> > > >     > On Tue, Jan 14, 2014 at 12:33 PM, Michael S. Tsirkin 
> > > > <address@hidden>
> > > >     wrote:
> > > >     >
> > > >     >     On Mon, Jan 13, 2014 at 03:25:11PM +0100, Antonios Motakis 
> > > > wrote:
> > > >     >     > In this patch series we would like to introduce our 
> > > > approach for
> > > >     putting
> > > >     >     a
> > > >     >     > virtio-net backend in an external userspace process. Our 
> > > > eventual
> > > >     target
> > > >     >     is to
> > > >     >     > run the network backend in the Snabbswitch ethernet switch, 
> > > > while
> > > >     >     receiving
> > > >     >     > traffic from a guest inside QEMU/KVM which runs an 
> > > > unmodified
> > > >     virtio-net
> > > >     >     > implementation.
> > > >     >     >
> > > >     >     > For this, we are working into extending vhost to allow 
> > > > equivalent
> > > >     >     functionality
> > > >     >     > for userspace. Vhost already passes control of the data 
> > > > plane of
> > > >     >     virtio-net to
> > > >     >     > the host kernel; we want to realize a similar model, but for
> > > >     userspace.
> > > >     >     >
> > > >     >     > In this patch series the concept of a vhost-backend is 
> > > > introduced.
> > > >     >     >
> > > >     >     > We define two vhost backend types - vhost-kernel and 
> > > > vhost-user.
> > > >     The
> > > >     >     former is
> > > >     >     > the interface to the current kernel module implementation. 
> > > > Its
> > > >     control
> > > >     >     plane is
> > > >     >     > ioctl based. The data plane is the kernel directly 
> > > > accessing the
> > > >     QEMU
> > > >     >     allocated,
> > > >     >     > guest memory.
> > > >     >     >
> > > >     >     > In the new vhost-user backend, the control plane is based on
> > > >     >     communication
> > > >     >     > between QEMU and another userspace process using a unix 
> > > > domain
> > > >     socket.
> > > >     >     This
> > > >     >     > allows to implement a virtio backend for a guest running in 
> > > > QEMU,
> > > >     inside
> > > >     >     the
> > > >     >     > other userspace process.
> > > >     >     >
> > > >     >     > We change -mem-path to QemuOpts and add prealloc, share and 
> > > > unlink
> > > >     as
> > > >     >     properties
> > > >     >     > to it. HugeTLBFS requirements of -mem-path are relaxed, so 
> > > > any
> > > >     valid path
> > > >     >     can
> > > >     >     > be used now. The new properties allow more fine grained 
> > > > control
> > > >     over the
> > > >     >     guest
> > > >     >     > RAM backing store.
> > > >     >     >
> > > >     >     > The data path is realized by directly accessing the vrings 
> > > > and the
> > > >     buffer
> > > >     >     data
> > > >     >     > off the guest's memory.
> > > >     >     >
> > > >     >     > The current user of vhost-user is only vhost-net. We add 
> > > > new netdev
> > > >     >     backend
> > > >     >     > that is intended to initialize vhost-net with vhost-user 
> > > > backend.
> > > >     >
> > > >     >     Some meta comments.
> > > >     >
> > > >     >     Something that makes this patch harder to review is how it's
> > > >     >     split up. Generally IMHO it's not a good idea to repeatedly
> > > >     >     edit same part of file adding stuff in patch after patch,
> > > >     >     it's only making things harder to read if you add stubs, then 
> > > > fill
> > > >     them up.
> > > >     >     (we do this sometimes when we are changing existing code, but
> > > >     >     it is generally not needed when adding new code)
> > > >     >
> > > >     >     Instead, split it like this:
> > > >     >
> > > >     >     1. general refactoring, split out linux specific and generic 
> > > > parts
> > > >     >        and add the ops indirection
> > > >     >     2. add new files for vhost-user with complete implementation.
> > > >     >        without command line to support it, there will be no way 
> > > > to use
> > > >     it,
> > > >     >        but should build fine.
> > > >     >     3. tie it all up with option parsing
> > > >     >
> > > >     >
> > > >     >     Generic vhost and vhost net files should be kept separate.
> > > >     >     Don't let vhost net stuff seep back into generic files,
> > > >     >     we have vhost-scsi too.
> > > >     >     I would also prefer that userspace vhost has its own files.
> > > >     >
> > > >     >
> > > >     > Ok, we'll keep this into account.
> > > >     >
> > > >     >
> > > >     >
> > > >     >     We need a small test server qemu can talk to, to verify things
> > > >     >     actually work.
> > > >     >
> > > >     >
> > > >     > We have implemented such a test app: https://github.com/
> > > >     virtualopensystems/vapp
> > > >     >
> > > >     > We use it for testing, and also as a reference implementation. A 
> > > > client
> > > >     is also
> > > >     > included.
> > > >     >
> > > >
> > > >     Sounds good. Can we include this in qemu and tie
> > > >     it into the qtest framework?
> > > >     >From a brief look, it merely needs to be tweaked for portability,
> > > >     unless
> > > >
> > > >     >
> > > >     >     Already commented on: reuse the chardev syntax and preferably 
> > > > code.
> > > >     >     We already support a bunch of options there for
> > > >     >     domain sockets that will be useful here, they should
> > > >     >     work here as well.
> > > >     >
> > > >     >
> > > >     > We adapted the syntax for this to be consistent with chardev. 
> > > > What we
> > > >     didn't
> > > >     > use, it is not obvious at all to us on how they should be used; a 
> > > > lot of
> > > >     the
> > > >     > chardev options just don't apply to us.
> > > >     >
> > > >
> > > >     Well server option should work at least.
> > > >     nowait can work too?
> > > >
> > > >     Also, if reconnect is useful it should be for chardevs too, so if 
> > > > we don't
> > > >     share code, need to code it in two places to stay consistent.
> > > >
> > > >     Overall sharing some code might be better ...
> > > >
> > > >
> > > >
> > > > What you have in mind is to use the functions chardev uses from 
> > > > qemu-sockets.c
> > > > right? Chardev itself doesn't look to have anything else that can be 
> > > > shared.
> > >
> > > Yes.
> > >
> > > > The problem with reconnect is that it is implemented at the protocol 
> > > > level; we
> > > > are not just transparently reconnecting the socket. So the same 
> > > > approach would
> > > > most likely not apply for chardev.
> > >
> > > Chardev mostly just could use transparent reconnect.
> > > vhost-user could use that and get a callback to reconfigure
> > > everything after reconnect.
> > >
> > > Once you write up the protocol in some text file we can
> > > discuss this in more detail.
> > > For example I wonder how would feature negotiation work
> > > with reconnect: new connection could be from another
> > > application that does not support same features, but
> > > virtio assumes that device features never change.
> > >
> >
> > I attach the text document that we will include in the next version of
> > the series, which describes the vhost-user protocol.
> >
> > The protocol is based on and very close to the vhost kernel protocol.
> > Of note is the VHOST_USER_ECHO message, which is the only one that
> > doesn't have an equivalent ioctl in the kernel version of vhost; this
> > is the message that is being used to detect that the remote party is
> > not on the socket anymore. At that point QEMU will close the session
> > and try to initiate a new one on the same socket.
>
> What if e.g. features change in between?
> Everything just goes south, doesn't it?
>
> Is this detection and reconnect a must for your project?
>
> I think it would be simpler to
>         - generalize char unix socket handling code and reuse for vhost-user


In our next version we will completely reuse the chardev
infrastructure. In the process of doing that we are adding features we
need to chardev (most specifically, support for ancillary data on the
socket). So the end user will use something along those lines:
    -chardev socket,path=/path,id=chr0 -netdev vhost-user,chadev=chr0

Of course, this will only be usable with a socket chardev, otherwise
we will fail gracefully.

>
>         - as a separate step, add live detection and reconnect abilities
>           to the generic code

So far we did live detection with a special ECHO message. Is it
possible to detect if there is another listener on a unix socket in a
generic way?

Best regards,
Antonios

>
> > > >
> > > >
> > > >
> > > >     >     In particular you shouldn't require filesystem access by qemu,
> > > >     >     passing fd for domain socket should work.
> > > >     >
> > > >     >
> > > >     > We can add an option to pass an fd for the domain socket if 
> > > > needed.
> > > >     However as
> > > >     > far as we understand, chardev doesn't do that either (at least 
> > > > form
> > > >     looking at
> > > >     > the man page). Maybe we misunderstand what you mean.
> > > >
> > > >     Sorry. I got confused with e.g. tap which has this. This might be
> > > >     useful but does not have to block this patch.
> > > >
> > > >     >
> > > >     >
> > > >     >     > Example usage:
> > > >     >     >
> > > >     >     > qemu -m 1024 -mem-path /hugetlbfs,prealloc=on,share=on \
> > > >     >     >      -netdev 
> > > > type=vhost-user,id=net0,path=/path/to/sock,poll_time=
> > > >     2500 \
> > > >     >     >      -device virtio-net-pci,netdev=net0
> > > >     >
> > > >     >     It's not clear which parts of -mem-path are required for 
> > > > vhost-user.
> > > >     >     It should be documented somewhere, made clear in -help
> > > >     >     and should fail gracefully if misconfigured.
> > > >     >
> > > >     >
> > > >     >
> > > >     > Ok.
> > > >     >
> > > >     >
> > > >     >
> > > >     >     >
> > > >     >     > Changes from v5:
> > > >     >     >  - Split -mem-path unlink option to a separate patch
> > > >     >     >  - Fds are passed only in the ancillary data
> > > >     >     >  - Stricter message size checks on receive/send
> > > >     >     >  - Netdev vhost-user now includes path and poll_time options
> > > >     >     >  - The connection probing interval is configurable
> > > >     >     >
> > > >     >     > Changes from v4:
> > > >     >     >  - Use error_report for errors
> > > >     >     >  - VhostUserMsg has new field `size` indicating the 
> > > > following
> > > >     payload
> > > >     >     length.
> > > >     >     >    Field `flags` now has version and reply bits. The 
> > > > structure is
> > > >     packed.
> > > >     >     >  - Send data is of variable length (`size` field in message)
> > > >     >     >  - Receive in 2 steps, header and payload
> > > >     >     >  - Add new message type VHOST_USER_ECHO, to check 
> > > > connection status
> > > >     >     >
> > > >     >     > Changes from v3:
> > > >     >     >  - Convert -mem-path to QemuOpts with prealloc, share and 
> > > > unlink
> > > >     >     properties
> > > >     >     >  - Set 1 sec timeout when read/write to the unix domain 
> > > > socket
> > > >     >     >  - Fix file descriptor leak
> > > >     >     >
> > > >     >     > Changes from v2:
> > > >     >     >  - Reconnect when the backend disappears
> > > >     >     >
> > > >     >     > Changes from v1:
> > > >     >     >  - Implementation of vhost-user netdev backend
> > > >     >     >  - Code improvements
> > > >     >     >
> > > >     >     > Antonios Motakis (8):
> > > >     >     >   Convert -mem-path to QemuOpts and add prealloc and share
> > > >     properties
> > > >     >     >   New -mem-path option - unlink.
> > > >     >     >   Decouple vhost from kernel interface
> > > >     >     >   Add vhost-user skeleton
> > > >     >     >   Add domain socket communication for vhost-user backend
> > > >     >     >   Add vhost-user calls implementation
> > > >     >     >   Add new vhost-user netdev backend
> > > >     >     >   Add vhost-user reconnection
> > > >     >     >
> > > >     >     >  exec.c                            |  57 +++-
> > > >     >     >  hmp-commands.hx                   |   4 +-
> > > >     >     >  hw/net/vhost_net.c                | 144 +++++++---
> > > >     >     >  hw/net/virtio-net.c               |  42 ++-
> > > >     >     >  hw/scsi/vhost-scsi.c              |  13 +-
> > > >     >     >  hw/virtio/Makefile.objs           |   2 +-
> > > >     >     >  hw/virtio/vhost-backend.c         | 556
> > > >     >     ++++++++++++++++++++++++++++++++++++++
> > > >     >     >  hw/virtio/vhost.c                 |  46 ++--
> > > >     >     >  include/exec/cpu-all.h            |   3 -
> > > >     >     >  include/hw/virtio/vhost-backend.h |  40 +++
> > > >     >     >  include/hw/virtio/vhost.h         |   4 +-
> > > >     >     >  include/net/vhost-user.h          |  17 ++
> > > >     >     >  include/net/vhost_net.h           |  15 +-
> > > >     >     >  net/Makefile.objs                 |   2 +-
> > > >     >     >  net/clients.h                     |   3 +
> > > >     >     >  net/hub.c                         |   1 +
> > > >     >     >  net/net.c                         |   2 +
> > > >     >     >  net/tap.c                         |  16 +-
> > > >     >     >  net/vhost-user.c                  | 177 ++++++++++++
> > > >     >     >  qapi-schema.json                  |  21 +-
> > > >     >     >  qemu-options.hx                   |  24 +-
> > > >     >     >  vl.c                              |  41 ++-
> > > >     >     >  22 files changed, 1106 insertions(+), 124 deletions(-)
> > > >     >     >  create mode 100644 hw/virtio/vhost-backend.c
> > > >     >     >  create mode 100644 include/hw/virtio/vhost-backend.h
> > > >     >     >  create mode 100644 include/net/vhost-user.h
> > > >     >     >  create mode 100644 net/vhost-user.c
> > > >     >     >
> > > >     >     > --
> > > >     >     > 1.8.3.2
> > > >     >     >
> > > >     >
> > > >     >
> > > >
> > > >
>
> > Vhost-user Protocol
> > ===================
> >
> > This protocol is aiming to complement the ioctl interface used to control 
> > the
> > vhost implementation in the Linux kernel. It implements the control plane 
> > needed
> > to establish virtqueue sharing with a user space process on the same host. 
> > It
> > uses communication over a Unix domain socket to share file descriptors in 
> > the
> > ancillary data of the message.
> >
> > The protocol defines 2 sides of the communication, master and slave. Master 
> > is
> > the application that shares it's virtqueues, in our case QEMU. Slave is the
> > consumer of the virtqueues.
> >
> > In the current implementation QEMU is the Master, and the Slave is intended 
> > to
> > be a software ethernet switch running in user space, such as Snabbswitch.
> >
> > Master and slave can be either a client (i.e. connecting) or server 
> > (listening)
> > in the socket communication.
> >
> > Message Specification
> > ---------------------
> >
> > Note that all numbers are in the machine native byte order. A vhost-user 
> > message
> > consists of 3 header fields and a payload:
> >
> > ------------------------------------
> > | request | flags | size | payload |
> > ------------------------------------
> >
> >  * Request: 32-bit type of the request
> >  * Flags: 32-bit bit field:
> >    - Lower 2 bits are the version (currently 0x01)
> >    - Bit 2 is the reply flag - needs to be sent on each reply from the slave
> >  * Size - 32-bit size of the payload
> >
> >
> > Depending on the request type, payload can be:
> >
> >  * A single 64-bit integer
> >    -------
> >    | u64 |
> >    -------
> >
> >    u64: a 64-bit unsigned integer
> >
> >  * A vring state description
> >    ---------------
> >   | index | num |
> >   ---------------
> >
> >    Index: a 32-bit index
> >    Num: a 32-bit number
> >
> >  * A vring address description
> >    --------------------------------------------------------------
> >    | index | flags | size | descriptor | used | available | log |
> >    --------------------------------------------------------------
> >
> >    Index: a 32-bit vring index
> >    Flags: a 32-bit vring flags
> >    Descriptor: a 64-bit user address of the vring descriptor table
> >    Used: a 64-bit user address of the vring used ring
> >    Available: a 64-bit user address of the vring available ring
> >    Log: a 64-bit guest address for logging
> >
> >  * Memory regions description
> >    ---------------------------------------------------
> >    | num regions | padding | region0 | ... | region7 |
> >    ---------------------------------------------------
> >
> >    Num regions: a 32-bit number of regions
> >    Padding: 32-bit
> >
> >    A region is:
> >    ---------------------------------------
> >    | guest address | size | user address |
> >    ---------------------------------------
> >
> >    Guest address: a 64-bit guest address of the region
> >    Size: a 64-bit size
> >    User address: a 64-bit user address
> >
> >
> > In QEMU the vhost-user message is implemented with the following struct:
> >
> > typedef struct VhostUserMsg {
> >     VhostUserRequest request;
> >     uint32_t flags;
> >     uint32_t size;
> >     union {
> >         uint64_t u64;
> >         struct vhost_vring_state state;
> >         struct vhost_vring_addr addr;
> >         VhostUserMemory memory;
> >     };
> > } QEMU_PACKED VhostUserMsg;
> >
> > Communication
> > -------------
> >
> > The protocol for vhost-user is based on the existing implementation of vhost
> > for the Linux Kernel. Most messages that can be send via the Unix domain 
> > socket
> > implementing vhost-user have an equivalent ioctl to the kernel 
> > implementation.
> >
> > The communication consists of master sending message requests and slave 
> > sending
> > message replies. Most of the requests don't require replies. Here is a list 
> > of
> > the ones that do:
> >
> >  * VHOST_USER_ECHO
> >  * VHOST_GET_FEATURES
> >  * VHOST_GET_VRING_BASE
> >
> > There are several messages that the master sends with file descriptors 
> > passed
> > in the ancillary data:
> >
> >  * VHOST_SET_MEM_TABLE
> >  * VHOST_SET_LOG_FD
> >  * VHOST_SET_VRING_KICK
> >  * VHOST_SET_VRING_CALL
> >  * VHOST_SET_VRING_ERR
> >
> > If Master is unable to send the full message or receives a wrong reply it 
> > will
> > close the connection. An optional reconnection mechanism can be implemented.
> >
> > Message types
> > -------------
> >
> >  * VHOST_USER_ECHO
> >
> >       Id: 1
> >       Equivalent ioctl: N/A
> >       Master payload: N/A
> >
> >       ECHO request that is used to periodically probe the connection. When
> >       received by the slave, it is expected that he will send back an ECHO
> >       packet with the REPLY flag set.
> >
> >  * VHOST_USER_GET_FEATURES
> >
> >       Id: 2
> >       Equivalent ioctl: VHOST_GET_FEATURES
> >       Master payload: N/A
> >       Slave payload: u64
> >
> >       Get from the underlying vhost implementation the features bitmask.
> >
> >  * VHOST_USER_SET_FEATURES
> >
> >       Id: 3
> >       Ioctl: VHOST_SET_FEATURES
> >       Master payload: u64
> >
> >       Enable features in the underlying vhost implementation using a 
> > bitmask.
> >
> >  * VHOST_USER_SET_OWNER
> >
> >       Id: 4
> >       Equivalent ioctl: VHOST_SET_OWNER
> >       Master payload: N/A
> >
> >       Issued when a new connection is established. It sets the current 
> > Master
> >       as an owner of the session. This can be used on the Slave as a
> >       "session start" flag.
> >
> >  * VHOST_USER_RESET_OWNER
> >
> >       Id: 5
> >       Equivalent ioctl: VHOST_RESET_OWNER
> >       Master payload: N/A
> >
> >       Issued when a new connection is about to be closed. The Master will no
> >       longer own this connection (and will usually close it).
> >
> >  * VHOST_USER_SET_MEM_TABLE
> >
> >       Id: 6
> >       Equivalent ioctl: VHOST_SET_MEM_TABLE
> >       Master payload: memory regions description
> >
> >       Sets the memory map regions on the slave so it can translate the vring
> >       addresses. In the ancillary data there is an array of file descriptors
> >       for each memory mapped region. The size and ordering of the fds 
> > matches
> >       the number and ordering of memory regions.
> >
> >  * VHOST_USER_SET_LOG_BASE
> >
> >       Id: 7
> >       Equivalent ioctl: VHOST_SET_LOG_BASE
> >       Master payload: u64
> >
> >       Sets the logging base address.
> >
> >  * VHOST_USER_SET_LOG_FD
> >
> >       Id: 8
> >       Equivalent ioctl: VHOST_SET_LOG_FD
> >       Master payload: N/A
> >
> >       Sets the logging file descriptor, which is passed as ancillary data.
> >
> >  * VHOST_USER_SET_VRING_NUM
> >
> >       Id: 9
> >       Equivalent ioctl: VHOST_SET_VRING_NUM
> >       Master payload: vring state description
> >
> >       Sets the number of vrings for this owner.
> >
> >  * VHOST_USER_SET_VRING_ADDR
> >
> >       Id: 10
> >       Equivalent ioctl: VHOST_SET_VRING_ADDR
> >       Master payload: vring address description
> >       Slave payload: N/A
> >
> >       Sets the addresses of the different aspects of the vring.
> >
> >  * VHOST_USER_SET_VRING_BASE
> >
> >       Id: 11
> >       Equivalent ioctl: VHOST_SET_VRING_BASE
> >       Master payload: vring state description
> >
> >       Sets the base address where the available descriptors are.
> >
> >  * VHOST_USER_GET_VRING_BASE
> >
> >       Id: 12
> >       Equivalent ioctl: VHOST_USER_GET_VRING_BASE
> >       Master payload: vring state description
> >       Slave payload: vring state description
> >
> >       Get the vring base address.
> >
> >  * VHOST_USER_SET_VRING_KICK
> >
> >       Id: 13
> >       Equivalent ioctl: VHOST_SET_VRING_KICK
> >       Master payload: N/A
> >
> >       Set the event file descriptor for adding buffers to the vring. It
> >       is passed in the ancillary data.
> >
> >  * VHOST_USER_SET_VRING_CALL
> >
> >       Id: 14
> >       Equivalent ioctl: VHOST_SET_VRING_CALL
> >       Master payload: N/A
> >
> >       Set the event file descriptor to signal when buffers are used. It
> >       is passed in the ancillary data.
> >
> >  * VHOST_USER_SET_VRING_ERR
> >
> >       Id: 15
> >       Equivalent ioctl: VHOST_SET_VRING_ERR
> >       Master payload: N/A
> >
> >       Set the event file descriptor to signal when error occurs. It
> >       is passed in the ancillary data.
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]