qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH RDMA support v3: 03/10] documentation of RDM


From: Michael S. Tsirkin
Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v3: 03/10] documentation of RDMA protocol in docs/rdma.txt
Date: Mon, 11 Mar 2013 19:05:15 +0200

On Mon, Mar 11, 2013 at 12:24:53PM -0400, Michael R. Hines wrote:
> Excellent questions: answers inline.........
> 
> On 03/11/2013 07:51 AM, Michael S. Tsirkin wrote:
> >+RDMA-based live migration protocol
> >+==================================
> >+
> >+We use two kinds of RDMA messages:
> >+
> >+1. RDMA WRITES (to the receiver)
> >+2. RDMA SEND (for non-live state, like devices and CPU)
> >Something's missing here.
> >Don't you need to know remote addresses before doing RDMA writes?
> 
> Yes, It looks like I need to do some more "teaching" about infiniband / RDMA
> inside the documentation.
> 
> I was trying not to make it too long, but it seems I over-estimated
> the ubiquity of RDMA and I'll have to include some background information
> about the programming model and memory model used by RDMA.

Well that's exactly the question. As far as I remember the
RDMA memory model, you need to know a key and address to
execute RDMA writes. Remote memory also needs to be locked,
so you need some mechanism to lock chunks of memory,
do RDMA write and unlock when done.

> >>+
> >>+First, migration-rdma.c does the initial connection establishment
> >>+using the URI 'rdma:host:port' on the QMP command line.
> >>+
> >>+Second, the normal live migration process kicks in for 'pc.ram'.
> >>+
> >>+During iterative phase of the migration, only RDMA WRITE messages
> >>+are used. Messages are grouped into "chunks" which get pinned by
> >>+the hardware in 64-page increments. Each chunk is acknowledged in
> >>+the Queue Pairs completion queue (not the individual pages).
> >>+
> >>+During iteration of RAM, there are no messages sent, just RDMA writes.
> >>+During the last iteration, once the devices and CPU is ready to be
> >>+sent, we begin to use the RDMA SEND messages.
> >It's unclear whether you are switching modes here, if yes
> >assuming CPU/device state is only sent during
> >the last iteration would break post-migration so
> >is probably not a good choice for a protocol.
> 
> I made a bad choice of words ...... I'll correct the documentation.
> 
> >
> >>+Due to the asynchronous nature of RDMA, the receiver of the migration
> >>+must post Receive work requests in the queue *before* a SEND work request
> >>+can be posted.
> >>+
> >>+To achieve this, both sides perform an initial 'barrier' synchronization.
> >>+Before the barrier, we already know that both sides have a receive work
> >>+request posted,
> >How?
> 
> While I was coding last night, I was able to eliminate this barrier.
> 
> >>and then both sides exchange and block on the completion
> >>+queue waiting for each other to know the other peer is alive and ready
> >>+to send the rest of the live migration state (qemu_send/recv_barrier()).
> >How much?
> 
> The remaining migration state is typically < 100K (usually more like 17-32K)
> 
> Most of this gets sent during qemu_savevm_state_complete() during
> the last iteration.
> 
> >>+At this point, the use of QEMUFile between both sides for communication
> >>+proceeds as normal.
> >>+The difference between TCP and SEND comes in migration-rdma.c: Since
> >>+we cannot simply dump the bytes into a socket, instead a SEND message
> >>+must be preceeded by one side instructing the other side *exactly* how
> >>+many bytes the SEND message will contain.
> >instructing how? Presumably you use some protocol for this?
> 
> Yes, I'll be more verbose. Sorry about that =)
> 
> (Basically, the length of the SEND is stored inside the SEND message itself.
> 
> >>+Each time a SEND is received, the receiver buffers the message and
> >>+divies out the bytes from the SEND to the qemu_loadvm_state() function
> >>+until all the bytes from the buffered SEND message have been exhausted.
> >>+
> >>+Before the SEND is exhausted, the receiver sends an 'ack' SEND back
> >>+to the sender to let the savevm_state_* functions know that they
> >>+can resume and start generating more SEND messages.
> >The above two paragraphs seem very opaque to me.
> >what's an 'ack' SEND, how do you know whether SEND
> >is exhausted?
> 
> More verbosity needed here too =). Exhaustion is detected because
> the SEND bytes are copied into a buffer and then whenever
> QEMUFile functions request more bytes from the buffer, we check
> how many bytes are available from the last SEND message (which
> was copied locally) to be handed back to QEMUFile functions.
> 
> If there are no bytes left in the buffer, we block and wait for
> another SEND message.

You need some way to make sure there's a buffer available
for that SEND message though.

> >>+This ping-pong of SEND messages
> >BTW, if by ping-pong you mean something like this:
> >     source "I have X bytes"
> >         destination "ok send me X bytes"
> >     source sends X bytes
> >then you could put the address in the destination response and
> >use RDMA for sending X bytes.
> >It's up to you but it might simplify the protocol as
> >the only thing you send would be buffer management messages.
> No, you can't do that because RDMA writes do not produce
> completion queue (CQ) notifications on the receiver side.
> 
> Thus, there's no way for the receiver to know data was received.
> 
> You still need regular SEND message to handle it.
> 
> >>happens until the live migration completes.
> >Any way to tear down the connection in case of errors?
> 
> Yes, I'll add all these questions to the update documentation ASAP.
> 
> 
> >>+
> >>+USAGE
> >>+===============================
> >>+
> >>+Compiling:
> >>+
> >>+$ ./configure --enable-rdma --target-list=x86_64-softmmu
> >>+
> >>+$ make
> >>+
> >>+Command-line on the Source machine AND Destination:
> >>+
> >>+$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or 
> >>whatever is the MAX of your RDMA device
> >>+
> >>+Finally, perform the actual migration:
> >>+
> >>+$ virsh migrate domain rdma:xx.xx.xx.xx:port
> >>+
> >>+PERFORMANCE
> >>+===================
> >>+
> >>+Using a 40gbps infinband link performing a worst-case stress test:
> >>+
> >>+RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> >>+Approximately 30 gpbs (little better than the paper)
> >>+1. Average worst-case throughput
> >>+TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> >>+2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
> >>+
> >>+Average downtime (stop time) ranges between 28 and 33 milliseconds.
> >>+
> >>+An *exhaustive* paper (2010) shows additional performance details
> >>+linked on the QEMU wiki:
> >>+
> >>+http://wiki.qemu.org/Features/RDMALiveMigration
> >>-- 
> >>1.7.10.4



reply via email to

[Prev in Thread] Current Thread [Next in Thread]