qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation


From: Marcel Apfelbaum
Subject: Re: [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation
Date: Wed, 20 Dec 2017 17:07:38 +0200
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.4.0

On 19/12/2017 20:05, Michael S. Tsirkin wrote:
On Sun, Dec 17, 2017 at 02:54:52PM +0200, Marcel Apfelbaum wrote:
RFC -> V2:
  - Full implementation of the pvrdma device
  - Backend is an ibdevice interface, no need for the KDBR module

General description
===================
PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
It works with its Linux Kernel driver AS IS, no need for any special guest
modifications.

While it complies with the VMware device, it can also communicate with bare
metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
can work with Soft-RoCE (rxe).

It does not require the whole guest RAM to be pinned


Hi Michael,

What happens if guest attempts to register all its memory?


Then we loose, is not different from bare metal, reg_mr will pin all the RAM.
However this is only one scenario, and hopefully not much used
for RoCE. (I know IPoIB does that, but it doesn't make sense to use it with 
RoCE).

allowing memory
over-commit
and, even if not implemented yet, migration support will be
possible with some HW assistance.

What does "HW assistance" mean here?

Several things:
1. We need to be able to pass resource numbers when we create
them on the destination machine.
2. We also need a way to stall prev connections while starting the new ones.
3. Last, we need the HW to pass resources states.

Can it work with any existing hardware?


Sadly no, however we talked with Mellanox at the last year
Plumbers Conference and all the above are on their plans.
We hope this submission will help, since now we will have
a fast way to test and use it.

For Soft-RoCE backend is doable, but is best to wait first to
see how HCAs are going to expose the changes.


  Design
  ======
  - Follows the behavior of VMware's pvrdma device, however is not tightly
    coupled with it

Everything seems to be in pvrdma. Since it's not coupled, could you
split code to pvrdma specific and generic parts?

and most of the code can be reused if we decide to
    continue to a Virtio based RDMA device.

I suspect that without virtio we won't be able to do any future
extensions.


While I do agree is harder to work with a 3rd party spec, their
Linux driver is open source and we may be able to do sane
modifications.

  - It exposes 3 BARs:
     BAR 0 - MSIX, utilize 3 vectors for command ring, async events and
             completions
     BAR 1 - Configuration of registers

[...]

The pvrdma backend is an ibdevice interface that can be exposed
either by a Soft-RoCE(rxe) device on machines with no RDMA device,
or an HCA SRIOV function(VF/PF).
Note that ibdevice interfaces can't be shared between pvrdma devices,
each one requiring a separate instance (rxe or SRIOV VF).

So what's the advantage of this over pass-through then?


1. We can work also with the same ibdevice for multiple pvrdma
devices using multiple GIDs; it works (tested).
The problem begins when we think about migration, the way
HCAs work today is one resource namespace per ibdevice,
not per GID. I emphasize that this can be changed,  however
we don't have a timeline for it.

2. We do have advantages:
- Guest agnostic device (we can change host HCA)
- Memory over commit (unless the guest registers all the memory)
- Future migration support
- A friendly migration of RDMA VMWare guests to QEMU.

3. In case when live migration is not a must we can
   use multiple GIDs of the same port, so we do not
   depend on SRIOV.

4. We support Soft RoCE backend, people can test their
   software on guest without RDMA hw.


Thanks,
Marcel



Tests and performance
=====================
Tested with SoftRoCE backend (rxe)/Mellanox ConnectX3,
and Mellanox ConnectX4 HCAs with:
   - VMs in the same host
   - VMs in different hosts
   - VMs to bare metal.

The best performance achieved with ConnectX HCAs and buffer size
bigger than 1MB which was the line rate ~ 50Gb/s.
The conclusion is that using the PVRDMA device there are no
actual performance penalties compared to bare metal for big enough
buffers (which is quite common when using RDMA), while allowing
memory overcommit.

Marcel Apfelbaum (3):
   mem: add share parameter to memory-backend-ram
   docs: add pvrdma device documentation.
   MAINTAINERS: add entry for hw/net/pvrdma

Yuval Shaia (2):
   pci/shpc: Move function to generic header file
   pvrdma: initial implementation


[...]



reply via email to

[Prev in Thread] Current Thread [Next in Thread]