qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 08/38] ivshmem: Rewrite specification document


From: Marc-André Lureau
Subject: Re: [Qemu-devel] [PATCH 08/38] ivshmem: Rewrite specification document
Date: Tue, 1 Mar 2016 12:25:22 +0100

On Mon, Feb 29, 2016 at 7:40 PM, Markus Armbruster <address@hidden> wrote:
> This started as an attempt to update ivshmem_device_spec.txt for
> clarity, accuracy and completeness while working on its code, and
> quickly became a full rewrite.  Since the diff would be useless
> anyway, I'm using the opportunity to rename the file to
> ivshmem-spec.txt.
>
> I tried hard to ensure the new text contradicts neither the old text
> nor the code.  If the new text contradicts the old text but not the
> code, it's probably a bug in the old text.  If the new text
> contradicts both, its probably a bug in the new text.
>
> Signed-off-by: Markus Armbruster <address@hidden>

Reviewed-by: Marc-André Lureau <address@hidden>


> ---
>  docs/specs/ivshmem-spec.txt        | 244 
> +++++++++++++++++++++++++++++++++++++
>  docs/specs/ivshmem_device_spec.txt | 161 ------------------------
>  2 files changed, 244 insertions(+), 161 deletions(-)
>  create mode 100644 docs/specs/ivshmem-spec.txt
>  delete mode 100644 docs/specs/ivshmem_device_spec.txt
>
> diff --git a/docs/specs/ivshmem-spec.txt b/docs/specs/ivshmem-spec.txt
> new file mode 100644
> index 0000000..0835ba1
> --- /dev/null
> +++ b/docs/specs/ivshmem-spec.txt
> @@ -0,0 +1,244 @@
> += Device Specification for Inter-VM shared memory device =
> +
> +The Inter-VM shared memory device (ivshmem) is designed to share a
> +memory region between multiple QEMU processes running different guests
> +and the host.  In order for all guests to be able to pick up the
> +shared memory area, it is modeled by QEMU as a PCI device exposing
> +said memory to the guest as a PCI BAR.
> +
> +The device can use a shared memory object on the host directly, or it
> +can obtain one from an ivshmem server.
> +
> +In the latter case, the device can additionally interrupt its peers, and
> +get interrupted by its peers.
> +
> +
> +== Configuring the ivshmem PCI device ==
> +
> +There are two basic configurations:
> +
> +- Just shared memory: -device ivshmem,shm=NAME,...
> +
> +  This uses shared memory object NAME.
> +
> +- Shared memory plus interrupts: -device ivshmem,chardev=CHR,vectors=N,...
> +
> +  An ivshmem server must already be running on the host.  The device
> +  connects to the server's UNIX domain socket via character device
> +  CHR.
> +
> +  Each peer gets assigned a unique ID by the server.  IDs must be
> +  between 0 and 65535.
> +
> +  Interrupts are message-signaled by default (MSI-X).  With msi=off
> +  the device has no MSI-X capability, and uses legacy INTx instead.
> +  vectors=N configures the number of vectors to use.
> +
> +For more details on ivshmem device properties, see The QEMU Emulator
> +User Documentation (qemu-doc.*).
> +
> +
> +== The ivshmem PCI device's guest interface ==
> +
> +The device has vendor ID 1af4, device ID 1110, revision 0.
> +
> +=== PCI BARs ===
> +
> +The ivshmem PCI device has two or three BARs:
> +
> +- BAR0 holds device registers (256 Byte MMIO)
> +- BAR1 holds MSI-X table and PBA (only when using MSI-X)
> +- BAR2 maps the shared memory object
> +
> +There are two ways to use this device:
> +
> +- If you only need the shared memory part, BAR2 suffices.  This way,
> +  you have access to the shared memory in the guest and can use it as
> +  you see fit.  Memnic, for example, uses ivshmem this way from guest
> +  user space (see http://dpdk.org/browse/memnic).
> +
> +- If you additionally need the capability for peers to interrupt each
> +  other, you need BAR0 and, if using MSI-X, BAR1.  You will most
> +  likely want to write a kernel driver to handle interrupts.  Requires
> +  the device to be configured for interrupts, obviously.
> +
> +If the device is configured for interrupts, BAR2 is initially invalid.
> +It becomes safely accessible only after the ivshmem server provided
> +the shared memory.  Guest software should wait for the IVPosition
> +register (described below) to become non-negative before accessing
> +BAR2.
> +
> +The device is not capable to tell guest software whether it is
> +configured for interrupts.
> +
> +=== PCI device registers ===
> +
> +BAR 0 contains the following registers:
> +
> +    Offset  Size  Access      On reset  Function
> +        0     4   read/write        0   Interrupt Mask
> +                                        bit 0: peer interrupt
> +                                        bit 1..31: reserved
> +        4     4   read/write        0   Interrupt Status
> +                                        bit 0: peer interrupt
> +                                        bit 1..31: reserved
> +        8     4   read-only   0 or -1   IVPosition
> +       12     4   write-only      N/A   Doorbell
> +                                        bit 0..15: vector
> +                                        bit 16..31: peer ID
> +       16   240   none            N/A   reserved
> +
> +Software should only access the registers as specified in column
> +"Access".  Reserved bits should be ignored on read, and preserved on
> +write.
> +
> +Interrupt Status and Mask Register together control the legacy INTx
> +interrupt when the device has no MSI-X capability: INTx is asserted
> +when the bit-wise AND of Status and Mask is non-zero and the device
> +has no MSI-X capability.  Interrupt Status Register bit 0 becomes 1
> +when an interrupt request from a peer is received.  Reading the
> +register clears it.
> +
> +IVPosition Register: if the device is not configured for interrupts,
> +this is zero.  Else, it's -1 for a short while after reset, then
> +changes to the device's ID (between 0 and 65535).
> +
> +There is no good way for software to find out whether the device is
> +configured for interrupts.  A positive IVPosition means interrupts,
> +but zero could be either.  The initial -1 cannot be reliably observed.
> +
> +Doorbell Register: writing this register requests to interrupt a peer.
> +The written value's high 16 bits are the ID of the peer to interrupt,
> +and its low 16 bits select an interrupt vector.
> +
> +If the device is not configured for interrupts, the write is ignored.
> +
> +If the interrupt hasn't completed setup, the write is ignored.  The
> +device is not capable to tell guest software whether setup is
> +complete.  Interrupts can regress to this state on migration.
> +
> +If the peer with the requested ID isn't connected, or it has fewer
> +interrupt vectors connected, the write is ignored.  The device is not
> +capable to tell guest software what peers are connected, or how many
> +interrupt vectors are connected.
> +
> +If the peer doesn't use MSI-X, its Interrupt Status register is set to
> +1.  This asserts INTx unless masked by the Interrupt Mask register.
> +The device is not capable to communicate the interrupt vector to guest
> +software then.
> +
> +If the peer uses MSI-X, the interrupt for this vector becomes pending.
> +There is no way for software to clear the pending bit, and a polling
> +mode of operation is therefore impossible with MSI-X.
> +
> +With multiple MSI-X vectors, different vectors can be used to indicate
> +different events have occurred.  The semantics of interrupt vectors
> +are left to the application.
> +
> +
> +== Interrupt infrastructure ==
> +
> +When configured for interrupts, the peers share eventfd objects in
> +addition to shared memory.  The shared resources are managed by an
> +ivshmem server.
> +
> +=== The ivshmem server ===
> +
> +The server listens on a UNIX domain socket.
> +
> +For each new client that connects to the server, the server
> +- picks an ID,
> +- creates eventfd file descriptors for the interrupt vectors,
> +- sends the ID and the file descriptor for the shared memory to the
> +  new client,
> +- sends connect notifications for the new client to the other clients
> +  (these contain file descriptors for sending interrupts),
> +- sends connect notifications for the other clients to the new client,
> +  and
> +- sends interrupt setup messages to the new client (these contain file
> +  descriptors for receiving interrupts).
> +
> +When a client disconnects from the server, the server sends disconnect
> +notifications to the other clients.
> +
> +The next section describes the protocol in detail.
> +
> +If the server terminates without sending disconnect notifications for
> +its connected clients, the clients can elect to continue.  They can
> +communicate with each other normally, but won't receive disconnect
> +notification on disconnect, and no new clients can connect.  There is
> +no way for the clients to connect to a restarted the server.  The
> +device is not capable to tell guest software whether the server is
> +still up.
> +
> +Example server code is in contrib/ivshmem-server/.  Not to be used in
> +production.  It assumes all clients use the same number of interrupt
> +vectors.
> +
> +A standalone client is in contrib/ivshmem-client/.  It can be useful
> +for debugging.
> +
> +=== The ivshmem Client-Server Protocol ===
> +
> +An ivshmem device configured for interrupts connects to an ivshmem
> +server.  This section details the protocol between the two.
> +
> +The connection is one-way: the server sends messages to the client.
> +Each message consists of a single 8 byte little-endian signed number,
> +and may be accompanied by a file descriptor via SCM_RIGHTS.  Both
> +client and server close the connection on error.
> +
> +On connect, the server sends the following messages in order:
> +
> +1. The protocol version number, currently zero.  The client should
> +   close the connection on receipt of versions it can't handle.
> +
> +2. The client's ID.  This is unique among all clients of this server.
> +   IDs must be between 0 and 65535, because the Doorbell register
> +   provides only 16 bits for them.
> +
> +3. The number -1, accompanied by the file descriptor for the shared
> +   memory.
> +
> +4. Connect notifications for existing other clients, if any.  This is
> +   a peer ID (number between 0 and 65535 other than the client's ID),
> +   repeated N times.  Each repetition is accompanied by one file
> +   descriptor.  These are for interrupting the peer with that ID using
> +   vector 0,..,N-1, in order.  If the client is configured for fewer
> +   vectors, it closes the extra file descriptors.  If it is configured
> +   for more, the extra vectors remain unconnected.
> +
> +5. Interrupt setup.  This is the client's own ID, repeated N times.
> +   Each repetition is accompanied by one file descriptor.  These are
> +   for receiving interrupts from peers using vector 0,..,N-1, in
> +   order.  If the client is configured for fewer vectors, it closes
> +   the extra file descriptors.  If it is configured for more, the
> +   extra vectors remain unconnected.
> +
> +From then on, the server sends these kinds of messages:
> +
> +6. Connection / disconnection notification.  This is a peer ID.
> +
> +  - If the number comes with a file descriptor, it's a connection
> +    notification, exactly like in step 4.
> +
> +  - Else, it's a disconnection notification for the peer with that ID.
> +
> +Known bugs:
> +
> +* The protocol changed incompatibly in QEMU 2.5.  Before, messages
> +  were native endian long, and there was no version number.
> +
> +* The protocol is poorly designed.
> +
> +=== The ivshmem Client-Client Protocol ===
> +
> +An ivshmem device configured for interrupts receives eventfd file
> +descriptors for interrupting peers and getting interrupted by peers
> +from the server, as explained in the previous section.
> +
> +To interrupt a peer, the device writes the 8-byte integer 1 in native
> +byte order to the respective file descriptor.
> +
> +To receive an interrupt, the device reads and discards as many 8-byte
> +integers as it can.
> diff --git a/docs/specs/ivshmem_device_spec.txt 
> b/docs/specs/ivshmem_device_spec.txt
> deleted file mode 100644
> index d318d65..0000000
> --- a/docs/specs/ivshmem_device_spec.txt
> +++ /dev/null
> @@ -1,161 +0,0 @@
> -
> -Device Specification for Inter-VM shared memory device
> -------------------------------------------------------
> -
> -The Inter-VM shared memory device is designed to share a memory region 
> (created
> -on the host via the POSIX shared memory API) between multiple QEMU processes
> -running different guests. In order for all guests to be able to pick up the
> -shared memory area, it is modeled by QEMU as a PCI device exposing said 
> memory
> -to the guest as a PCI BAR.
> -The memory region does not belong to any guest, but is a POSIX memory object 
> on
> -the host. The host can access this shared memory if needed.
> -
> -The device also provides an optional communication mechanism between guests
> -sharing the same memory object. More details about that in the section 
> 'Guest to
> -guest communication' section.
> -
> -
> -The Inter-VM PCI device
> ------------------------
> -
> -From the VM point of view, the ivshmem PCI device supports three BARs.
> -
> -- BAR0 is a 1 Kbyte MMIO region to support registers and interrupts when MSI 
> is
> -  not used.
> -- BAR1 is used for MSI-X when it is enabled in the device.
> -- BAR2 is used to access the shared memory object.
> -
> -It is your choice how to use the device but you must choose between two
> -behaviors :
> -
> -- basically, if you only need the shared memory part, you will map BAR2.
> -  This way, you have access to the shared memory in guest and can use it as 
> you
> -  see fit (memnic, for example, uses it in userland
> -  http://dpdk.org/browse/memnic).
> -
> -- BAR0 and BAR1 are used to implement an optional communication mechanism
> -  through interrupts in the guests. If you need an event mechanism between 
> the
> -  guests accessing the shared memory, you will most likely want to write a
> -  kernel driver that will handle interrupts. See details in the section 
> 'Guest
> -  to guest communication' section.
> -
> -The behavior is chosen when starting your QEMU processes:
> -- no communication mechanism needed, the first QEMU to start creates the 
> shared
> -  memory on the host, subsequent QEMU processes will use it.
> -
> -- communication mechanism needed, an ivshmem server must be started before 
> any
> -  QEMU processes, then each QEMU process connects to the server unix socket.
> -
> -For more details on the QEMU ivshmem parameters, see qemu-doc documentation.
> -
> -
> -Guest to guest communication
> -----------------------------
> -
> -This section details the communication mechanism between the guests accessing
> -the ivhsmem shared memory.
> -
> -*ivshmem server*
> -
> -This server code is available in qemu.git/contrib/ivshmem-server.
> -
> -The server must be started on the host before any guest.
> -It creates a shared memory object then waits for clients to connect on a unix
> -socket. All the messages are little-endian int64_t integer.
> -
> -For each client (QEMU process) that connects to the server:
> -- the server sends a protocol version, if client does not support it, the 
> client
> -  closes the communication,
> -- the server assigns an ID for this client and sends this ID to him as the 
> first
> -  message,
> -- the server sends a fd to the shared memory object to this client,
> -- the server creates a new set of host eventfds associated to the new client 
> and
> -  sends this set to all already connected clients,
> -- finally, the server sends all the eventfds sets for all clients to the new
> -  client.
> -
> -The server signals all clients when one of them disconnects.
> -
> -The client IDs are limited to 16 bits because of the current implementation 
> (see
> -Doorbell register in 'PCI device registers' subsection). Hence only 65536
> -clients are supported.
> -
> -All the file descriptors (fd to the shared memory, eventfds for each client)
> -are passed to clients using SCM_RIGHTS over the server unix socket.
> -
> -Apart from the current ivshmem implementation in QEMU, an ivshmem client has
> -been provided in qemu.git/contrib/ivshmem-client for debug.
> -
> -*QEMU as an ivshmem client*
> -
> -At initialisation, when creating the ivshmem device, QEMU first receives a
> -protocol version and closes communication with server if it does not match.
> -Then, QEMU gets its ID from the server then makes it available through BAR0
> -IVPosition register for the VM to use (see 'PCI device registers' 
> subsection).
> -QEMU then uses the fd to the shared memory to map it to BAR2.
> -eventfds for all other clients received from the server are stored to 
> implement
> -BAR0 Doorbell register (see 'PCI device registers' subsection).
> -Finally, eventfds assigned to this QEMU process are used to send interrupts 
> in
> -this VM.
> -
> -*PCI device registers*
> -
> -From the VM point of view, the ivshmem PCI device supports 4 registers of
> -32-bits each.
> -
> -enum ivshmem_registers {
> -    IntrMask = 0,
> -    IntrStatus = 4,
> -    IVPosition = 8,
> -    Doorbell = 12
> -};
> -
> -The first two registers are the interrupt mask and status registers.  Mask 
> and
> -status are only used with pin-based interrupts.  They are unused with MSI
> -interrupts.
> -
> -Status Register: The status register is set to 1 when an interrupt occurs.
> -
> -Mask Register: The mask register is bitwise ANDed with the interrupt status
> -and the result will raise an interrupt if it is non-zero.  However, since 1 
> is
> -the only value the status will be set to, it is only the first bit of the 
> mask
> -that has any effect.  Therefore interrupts can be masked by setting the first
> -bit to 0 and unmasked by setting the first bit to 1.
> -
> -IVPosition Register: The IVPosition register is read-only and reports the
> -guest's ID number.  The guest IDs are non-negative integers.  When using the
> -server, since the server is a separate process, the VM ID will only be set 
> when
> -the device is ready (shared memory is received from the server and accessible
> -via the device).  If the device is not ready, the IVPosition will return -1.
> -Applications should ensure that they have a valid VM ID before accessing the
> -shared memory.
> -
> -Doorbell Register:  To interrupt another guest, a guest must write to the
> -Doorbell register.  The doorbell register is 32-bits, logically divided into
> -two 16-bit fields.  The high 16-bits are the guest ID to interrupt and the 
> low
> -16-bits are the interrupt vector to trigger.  The semantics of the value
> -written to the doorbell depends on whether the device is using MSI or a 
> regular
> -pin-based interrupt.  In short, MSI uses vectors while regular interrupts set
> -the status register.
> -
> -Regular Interrupts
> -
> -If regular interrupts are used (due to either a guest not supporting MSI or 
> the
> -user specifying not to use them on startup) then the value written to the 
> lower
> -16-bits of the Doorbell register results is arbitrary and will trigger an
> -interrupt in the destination guest.
> -
> -Message Signalled Interrupts
> -
> -An ivshmem device may support multiple MSI vectors.  If so, the lower 16-bits
> -written to the Doorbell register must be between 0 and the maximum number of
> -vectors the guest supports.  The lower 16 bits written to the doorbell is the
> -MSI vector that will be raised in the destination guest.  The number of MSI
> -vectors is configurable but it is set when the VM is started.
> -
> -The important thing to remember with MSI is that it is only a signal, no 
> status
> -is set (since MSI interrupts are not shared).  All information other than the
> -interrupt itself should be communicated via the shared memory region.  
> Devices
> -supporting multiple MSI vectors can use different vectors to indicate 
> different
> -events have occurred.  The semantics of interrupt vectors are left to the
> -user's discretion.
> --
> 2.4.3
>
>



-- 
Marc-André Lureau



reply via email to

[Prev in Thread] Current Thread [Next in Thread]