qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[RFC 4/9] hw/misc/memexpose: Add documentation


From: i . kotrasinsk
Subject: [RFC 4/9] hw/misc/memexpose: Add documentation
Date: Tue, 4 Feb 2020 12:30:46 +0100

From: Igor Kotrasinski <address@hidden>

Signed-off-by: Igor Kotrasinski <address@hidden>
---
 docs/specs/memexpose-spec.txt | 168 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 168 insertions(+)
 create mode 100644 docs/specs/memexpose-spec.txt

diff --git a/docs/specs/memexpose-spec.txt b/docs/specs/memexpose-spec.txt
new file mode 100644
index 0000000..60ccea6
--- /dev/null
+++ b/docs/specs/memexpose-spec.txt
@@ -0,0 +1,168 @@
+= Specification for Inter-VM memory region sharing device =
+
+The inter-VM memory region sharing device (memexpose) is designed to allow two
+QEMU devices to share arbitrary physical memory regions between one another, as
+well as pass simple interrupts. It attempts to share memory regions directly
+when feasible, falling back to MMIO via socket communication when it's not.
+
+The device is modeled by QEMU as a PCI device, as well as a memory
+region/interrupt directly usable on platforms like ARM, with an entry in the
+device tree.
+
+An example use case for memexpose is forwarding ARM Trustzone functionality
+between two VMs running different architectures - one running a rich OS on an
+x86_64 VM, the other running the trusted OS on an ARM VM. In this scenario,
+sharing arbitrary memory regions allows such forwarding to work with minimal
+changes to the trusted OS.
+
+
+== Configuring the memexpose device ==
+
+The device uses two character devices to communicate with the other VM - one 
for
+synchronous memory accesses, another for passing interrupts. A typical
+configuration of the PCI device looks like this:
+
+        -chardev socket,...,path=/tmp/qemu-memexpose-mem,id="mem" \
+        -chardev socket,...,path=/tmp/qemu-memexpose-intr,id="intr" \
+        -device 
memexpose-pci,mem_chardev="mem",intr_chardev="intr",shm_size=0xN...
+
+While the arm-virt machine device can be enabled like this:
+ 
+        -chardev socket,...,path=/tmp/qemu-memexpose-mem,id="mem-mem" \
+        -chardev socket,...,path=/tmp/qemu-memexpose-intr,id="mem-intr" \
+        -machine memexpose-ep=mem,memexpose-size=0xN...
+
+Normally one of the VMs would have 'server,nowait' options set on these
+chardevs.
+
+At the moment the memory exposed to the other device always starts at 0
+(relative to system_memory). The shm_size/memexpose-size property indicates the
+size of the exposed region.
+
+The *_chardev/memexpose-ep properties are used to point the memexpose device to
+chardevs used to communicate with the other VM.
+
+
+== Memexpose PCI device interface ===
+
+The device has vendor ID 1af4, device ID 1111, revision 0.
+
+=== PCI BARs ===
+
+The device has two BARs:
+- BAR0 holds device registers and interrupt data (0x1000 byte MMIO),
+- BAR1 maps memory from the other VM.
+
+To use the device, you must first enable it by writing 1 to BAR0 at address 0.
+This makes QEMU wait for another VM to connect. Once that is done, you can
+access the other machine's memory via BAR1.
+
+Interrupts can be sent and received by configuring the device for interrupts 
and
+reading and writing to registers in BAR0.
+
+=== Device registers ===
+
+BAR 0 has following registers:
+
+    Offset  Size  Access      On reset  Function
+        0     8   read/write        0   Enable/disable device
+                                        bit 0: device enabled / disabled
+                                        bit 1..63: reserved
+    0x400     8   read/write        0   Interrupt RX address
+                                        bit 1: interrupt read
+                                        bit 1..63: reserved
+    0x408     8   read-only        UD   RX Interrupt type
+    0x410   128   read-only        UD   RX Interrupt data
+    0x800     8   read/write        0   Interrupt TX address
+    0x808     8   write-only      N/A   TX Interrupt type
+    0x810   128   write-only      N/A   TX Interrupt data
+
+All other addresses are reserved.
+
+=== Handling interrupts ===
+
+To send interrupts, write to TX interrupt address. Contents of TX interrupt 
type
+and data regions will be send along with the interrupt. The device is holding 
an
+internal queue of 16 interrupts, any extra interrupts are silently dropped.
+
+To receive interrupts, read the interrupt RX address. If the value is 1, then
+RX interrupt type and data registers contain the data / type sent by the other
+VM. Otherwise (the value is 0), no more interrupts are queued and RX interrupt
+type/data register contents are undefined.
+
+
+=== Platform device protocol ===
+
+The other memexpose device type (provided on e.g. ARM via device tree) is
+essentially identical to the PCI device. It provides two memory ranges that 
work
+exactly like the PCI BAR regions and an interrupt for signaling an interrupt
+from the other VM.
+
+== Memexpose peer protocol ==
+
+This section describes the current memexpose protocol. It is a WIP and likely 
to
+change.
+
+A connection between two VMs connected via memexpose happens on two sockets - 
an
+interrupt socket and a memory socket. All communication on the earlier is
+asynchronous, while communication on the latter is synchronous.
+
+When the device is enabled, QEMU waits for memexpose's chardevs to connect. No
+messages are exchanged upon connection. After devices are connected, the
+following messages can be exchanged:
+
+1. Interrupt message, via interrupt socket. This message contains interrupt 
type
+   and data.
+
+2. Memory access request message, via memory socket. It contains a target
+   address, access size and valueto write in case of writes.
+
+3. Memory access return message. This contains an access result (as
+   MemTxResult) and a value in case of reads. If the accessed region can be
+   shared directly, then this region's start, size and shmem file descriptor 
are
+   also sent.
+
+4. Memory invalidation message. This is sent when a VM's memory region changes
+   status and contains such region's start and size. The other VM is expected 
to
+   drop any shared regions overlapping with it.
+
+5. Memory invalidation response. This is sent in response to a memory
+   invalidation message; after receiving this the remote VM is guaranteed have
+   scheduled region invalidation before accessing the region again.
+
+As QEMU performes memory accesses synchronously, we want to perform memory
+invalidation before returning to guest OS and both VMs might try to perform a
+remote memory access at the same time, all messages passed via the memory 
socket
+have an associated priority.
+
+At any time, only one message with a given priority is in flight. After sending
+a message, the VM reads messages on the memory socket, servicing all messages
+with a priority higher than its own. Once it receives a message with a priority
+lower than its own, it waits for a response to its own message before servicing
+it. This guarantees no deadlocks, assuming that messages don't trigger further
+messages. Message priorities, from highest to lowest, are as follows:
+
+1. Memory invalidation message/response.
+2. Memory access message/response.
+
+Additionally, one of the VMs is assigned a sub-priority higher than another, so
+that its messages of the same type have priority over the other VM's messages.
+
+Memory access messages have the lowest priority in order to guarantee that QEMU
+will not attempt to access memory while in the middle of a memory region
+listener.
+
+=== Memexpose memory sharing ===
+
+This section describes the memexpose memory sharing mechanism.
+
+Memory sharing is implemented lazily, initially no memory regions are shared
+between devices. When a memory access is performed via a socket, the remote VM
+checks whether the underlying memory range is backed by shareable memory. If it
+is, the VM finds out the maximum contiguous flat range backed by this region 
and
+sends its file descriptor to the local VM, where it is mapped as a subregion.
+
+The memexpose device registers memory listeners for the memory region it's
+using. Whenever a flat range for this region (that is not this device's
+subregion) changes, that range is sent to the other VM and any directly shared
+memory region intersecting this range is scheduled for removal via a BH.
-- 
2.7.4




reply via email to

[Prev in Thread] Current Thread [Next in Thread]