qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH RDMA support v3: 03/10] documentation of RDM


From: Michael R. Hines
Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v3: 03/10] documentation of RDMA protocol in docs/rdma.txt
Date: Mon, 11 Mar 2013 12:24:53 -0400
User-agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130106 Thunderbird/17.0.2

Excellent questions: answers inline.........

On 03/11/2013 07:51 AM, Michael S. Tsirkin wrote:
+RDMA-based live migration protocol
+==================================
+
+We use two kinds of RDMA messages:
+
+1. RDMA WRITES (to the receiver)
+2. RDMA SEND (for non-live state, like devices and CPU)
Something's missing here.
Don't you need to know remote addresses before doing RDMA writes?

Yes, It looks like I need to do some more "teaching" about infiniband / RDMA
inside the documentation.

I was trying not to make it too long, but it seems I over-estimated
the ubiquity of RDMA and I'll have to include some background information
about the programming model and memory model used by RDMA.

+
+First, migration-rdma.c does the initial connection establishment
+using the URI 'rdma:host:port' on the QMP command line.
+
+Second, the normal live migration process kicks in for 'pc.ram'.
+
+During iterative phase of the migration, only RDMA WRITE messages
+are used. Messages are grouped into "chunks" which get pinned by
+the hardware in 64-page increments. Each chunk is acknowledged in
+the Queue Pairs completion queue (not the individual pages).
+
+During iteration of RAM, there are no messages sent, just RDMA writes.
+During the last iteration, once the devices and CPU is ready to be
+sent, we begin to use the RDMA SEND messages.
It's unclear whether you are switching modes here, if yes
assuming CPU/device state is only sent during
the last iteration would break post-migration so
is probably not a good choice for a protocol.

I made a bad choice of words ...... I'll correct the documentation.


+Due to the asynchronous nature of RDMA, the receiver of the migration
+must post Receive work requests in the queue *before* a SEND work request
+can be posted.
+
+To achieve this, both sides perform an initial 'barrier' synchronization.
+Before the barrier, we already know that both sides have a receive work
+request posted,
How?

While I was coding last night, I was able to eliminate this barrier.

and then both sides exchange and block on the completion
+queue waiting for each other to know the other peer is alive and ready
+to send the rest of the live migration state (qemu_send/recv_barrier()).
How much?

The remaining migration state is typically < 100K (usually more like 17-32K)

Most of this gets sent during qemu_savevm_state_complete() during the last iteration.

+At this point, the use of QEMUFile between both sides for communication
+proceeds as normal.
+The difference between TCP and SEND comes in migration-rdma.c: Since
+we cannot simply dump the bytes into a socket, instead a SEND message
+must be preceeded by one side instructing the other side *exactly* how
+many bytes the SEND message will contain.
instructing how? Presumably you use some protocol for this?

Yes, I'll be more verbose. Sorry about that =)

(Basically, the length of the SEND is stored inside the SEND message itself.

+Each time a SEND is received, the receiver buffers the message and
+divies out the bytes from the SEND to the qemu_loadvm_state() function
+until all the bytes from the buffered SEND message have been exhausted.
+
+Before the SEND is exhausted, the receiver sends an 'ack' SEND back
+to the sender to let the savevm_state_* functions know that they
+can resume and start generating more SEND messages.
The above two paragraphs seem very opaque to me.
what's an 'ack' SEND, how do you know whether SEND
is exhausted?

More verbosity needed here too =). Exhaustion is detected because
the SEND bytes are copied into a buffer and then whenever
QEMUFile functions request more bytes from the buffer, we check
how many bytes are available from the last SEND message (which
was copied locally) to be handed back to QEMUFile functions.

If there are no bytes left in the buffer, we block and wait for another SEND message.

+This ping-pong of SEND messages
BTW, if by ping-pong you mean something like this:
        source "I have X bytes"
         destination "ok send me X bytes"
        source sends X bytes
then you could put the address in the destination response and
use RDMA for sending X bytes.
It's up to you but it might simplify the protocol as
the only thing you send would be buffer management messages.
No, you can't do that because RDMA writes do not produce
completion queue (CQ) notifications on the receiver side.

Thus, there's no way for the receiver to know data was received.

You still need regular SEND message to handle it.

happens until the live migration completes.
Any way to tear down the connection in case of errors?

Yes, I'll add all these questions to the update documentation ASAP.


+
+USAGE
+===============================
+
+Compiling:
+
+$ ./configure --enable-rdma --target-list=x86_64-softmmu
+
+$ make
+
+Command-line on the Source machine AND Destination:
+
+$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever 
is the MAX of your RDMA device
+
+Finally, perform the actual migration:
+
+$ virsh migrate domain rdma:xx.xx.xx.xx:port
+
+PERFORMANCE
+===================
+
+Using a 40gbps infinband link performing a worst-case stress test:
+
+RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
+Approximately 30 gpbs (little better than the paper)
+1. Average worst-case throughput
+TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
+2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
+
+Average downtime (stop time) ranges between 28 and 33 milliseconds.
+
+An *exhaustive* paper (2010) shows additional performance details
+linked on the QEMU wiki:
+
+http://wiki.qemu.org/Features/RDMALiveMigration
--
1.7.10.4




reply via email to

[Prev in Thread] Current Thread [Next in Thread]