qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protoc


From: Michael S. Tsirkin
Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
Date: Wed, 10 Apr 2013 16:34:48 +0300

On Wed, Apr 10, 2013 at 09:04:44AM -0400, Michael R. Hines wrote:
> On 04/10/2013 01:27 AM, Michael S. Tsirkin wrote:
> >Below is a great high level overview. the protocol looks correct.
> >A bit more detail would be helpful, as noted below.
> >
> >The main thing I'd like to see changed is that there are already
> >two protocols here: chunk-based and non chunk based.
> >We'll need to use versioning and capabilities going forward but in the
> >first version we don't need to maintain compatibility with legacy so
> >two versions seems like unnecessary pain.  Chunk based is somewhat slower and
> >that is worth fixing longer term, but seems like the way forward. So
> >let's implement a single chunk-based protocol in the first version we
> >merge.
> >
> >Some more minor improvement suggestions below.
> Thanks.
> 
> However, IMHO restricting the policy to only used chunk-based is really
> not an acceptable choice:
> 
> Here's the reason: Using my 10gbs RDMA hardware, throughput takes a
> dive from 10gbps to 6gbps.

Who cares about the throughput really? What we do care about
is how long the whole process takes.



> But if I disable chunk-based registration altogether (forgoing
> overcommit), then performance comes back.
> 
> The reason for this is is the additional control trannel traffic
> needed to ask the server to register
> memory pages on demand - without this traffic, we can easily
> saturate the link.
> But with this traffic, the user needs to know (and be given the
> option) to disable the feature
> in case they want performance instead of flexibility.
> 

IMO that's just because the current control protocol is so inefficient.
You just need to pipeline the registration: request the next chunk
while remote side is handling the previous one(s).

With any protocol, you still need to:
        register all memory
        send addresses and keys to source
        get notification that write is done
what is different with chunk based?
simply that there are several network roundtrips
before the process can start.
So part of the time you are not doing writes,
you are waiting for the next control message.

So you should be doing several in parallel.
This will complicate the procotol though, so I am not asking
for this right away.

But a broken pin-it-all alternative will just confuse matters.  It is
best to keep it out of tree.


> >On Mon, Apr 08, 2013 at 11:04:32PM -0400, address@hidden wrote:
> >>From: "Michael R. Hines" <address@hidden>
> >>
> >>Both the protocol and interfaces are elaborated in more detail,
> >>including the new use of dynamic chunk registration, versioning,
> >>and capabilities negotiation.
> >>
> >>Signed-off-by: Michael R. Hines <address@hidden>
> >>---
> >>  docs/rdma.txt |  313 
> >> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 313 insertions(+)
> >>  create mode 100644 docs/rdma.txt
> >>
> >>diff --git a/docs/rdma.txt b/docs/rdma.txt
> >>new file mode 100644
> >>index 0000000..e9fa4cd
> >>--- /dev/null
> >>+++ b/docs/rdma.txt
> >>@@ -0,0 +1,313 @@
> >>+Several changes since v4:
> >>+
> >>+- Created a "formal" protocol for the RDMA control channel
> >>+- Dynamic, chunked page registration now implemented on *both* the server 
> >>and client
> >>+- Created new 'capability' for page registration
> >>+- Created new 'capability' for is_zero_page() - enabled by default
> >>+  (needed to test dynamic page registration)
> >>+- Created version-check before protocol begins at connection-time
> >>+- no more migrate_use_rdma() !
> >>+
> >>+NOTE: While dynamic registration works on both sides now,
> >>+      it does *not* work with cgroups swap limits. This functionality with 
> >>infiniband
> >>+      remains broken. (It works fine with TCP). So, in order to take full
> >>+      advantage of this feature, a fix will have to be developed on the 
> >>kernel side.
> >>+      Alternative proposed is use /dev/<pid>/pagemap. Patch will be 
> >>submitted.
> >You mean the idea of using pagemap to detect shared pages created by KSM
> >and/or zero pages? That would be helpful for TCP migration, thanks!
> 
> Yes, absolutely. This would *also* help the above registration problem.
> 
> We could use this to *pre-register* pages in advance, but that would be
> an entirely different patch series (which I'm willing to write and submit).
> 
> >>+
> >BTW the above comments belong outside both document and commit log,
> >after --- before diff.
> Acknowledged.
> 
> >>+Contents:
> >>+=================================
> >>+* Compiling
> >>+* Running (please readme before running)
> >>+* RDMA Protocol Description
> >>+* Versioning
> >>+* QEMUFileRDMA Interface
> >>+* Migration of pc.ram
> >>+* Error handling
> >>+* TODO
> >>+* Performance
> >>+
> >>+COMPILING:
> >>+===============================
> >>+
> >>+$ ./configure --enable-rdma --target-list=x86_64-softmmu
> >>+$ make
> >>+
> >>+RUNNING:
> >>+===============================
> >>+
> >>+First, decide if you want dynamic page registration on the server-side.
> >>+This always happens on the primary-VM side, but is optional on the server.
> >>+Doing this allows you to support overcommit (such as cgroups or ballooning)
> >>+with a smaller footprint on the server-side without having to register the
> >>+entire VM memory footprint.
> >>+NOTE: This significantly slows down performance (about 30% slower).
> >Where does the overhead come from? It appears from the description that
> >you have exactly same amount of data to exchange using send messages,
> >either way?
> >Or are you using bigger chunks with upfront registration?
> 
> Answer is above.
> 
> Upfront registration registers the entire VM before migration starts
> where as dynamic registration (on both sides) registers chunks in
> 1 MB increments as they are requested by the migration_thread.
> 
> The extra send messages required to request the server to register
> the memory means that the RDMA must block until those messages
> complete before the RDMA can begin.

So make the protocol smarter and fix this. This is not something
management needs to know about.


If you like, you can teach management to specify the max amount of
memory pinned. It should be specified at the appropriate place:
on the remote for remote, on source for source.

> >>+
> >>+$ virsh qemu-monitor-command --hmp \
> >>+    --cmd "migrate_set_capability chunk_register_destination on" # 
> >>disabled by default
> >I think the right choice is to make chunk based the default, and remove
> >the non chunk based from code.  This will simplify the protocol a tiny bit,
> >and make us focus on improving chunk based long term so that it's as
> >fast as upfront registration.
> Answer above.
> 
> >>+
> >>+Next, if you decided *not* to use chunked registration on the server,
> >>+it is recommended to also disable zero page detection. While this is not
> >>+strictly necessary, zero page detection also significantly slows down
> >>+performance on higher-throughput links (by about 50%), like 40 gbps 
> >>infiniband cards:
> >What is meant by performance here? downtime?
> 
> Throughput. Zero page scanning (and dynamic registration) reduces
> throughput significantly.

Again, not something management should worry about.
Do the right thing internally.

> >>+
> >>+$ virsh qemu-monitor-command --hmp \
> >>+    --cmd "migrate_set_capability check_for_zero off" # always enabled by 
> >>default
> >>+
> >>+Finally, set the migration speed to match your hardware's capabilities:
> >>+
> >>+$ virsh qemu-monitor-command --hmp \
> >>+    --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA 
> >>device
> >>+
> >>+Finally, perform the actual migration:
> >>+
> >>+$ virsh migrate domain rdma:xx.xx.xx.xx:port
> >>+
> >>+RDMA Protocol Description:
> >>+=================================
> >>+
> >>+Migration with RDMA is separated into two parts:
> >>+
> >>+1. The transmission of the pages using RDMA
> >>+2. Everything else (a control channel is introduced)
> >>+
> >>+"Everything else" is transmitted using a formal
> >>+protocol now, consisting of infiniband SEND / RECV messages.
> >>+
> >>+An infiniband SEND message is the standard ibverbs
> >>+message used by applications of infiniband hardware.
> >>+The only difference between a SEND message and an RDMA
> >>+message is that SEND message cause completion notifications
> >>+to be posted to the completion queue (CQ) on the
> >>+infiniband receiver side, whereas RDMA messages (used
> >>+for pc.ram) do not (to behave like an actual DMA).
> >>+
> >>+Messages in infiniband require two things:
> >>+
> >>+1. registration of the memory that will be transmitted
> >>+2. (SEND/RECV only) work requests to be posted on both
> >>+   sides of the network before the actual transmission
> >>+   can occur.
> >>+
> >>+RDMA messages much easier to deal with. Once the memory
> >>+on the receiver side is registered and pinned, we're
> >>+basically done. All that is required is for the sender
> >>+side to start dumping bytes onto the link.
> >When is memory unregistered and unpinned on send and receive
> >sides?
> Only when the migration ends completely. Will update the documentation.
> 
> >>+
> >>+SEND messages require more coordination because the
> >>+receiver must have reserved space (using a receive
> >>+work request) on the receive queue (RQ) before QEMUFileRDMA
> >>+can start using them to carry all the bytes as
> >>+a transport for migration of device state.
> >>+
> >>+To begin the migration, the initial connection setup is
> >>+as follows (migration-rdma.c):
> >>+
> >>+1. Receiver and Sender are started (command line or libvirt):
> >>+2. Both sides post two RQ work requests
> >Okay this could be where the problem is. This means with chunk
> >based receive side does:
> >
> >loop:
> >     receive request
> >     register
> >     send response
> >
> >while with non chunk based it does:
> >
> >receive request
> >send response
> >loop:
> >     register
> No, that's incorrect. With "non" chunk based, the receive side does
> *not* communicate
> during the migration of pc.ram.

It does not matter when this happens. What we care about is downtime and
total time from start of qemu on remote and until migration completes.
Not peak throughput.
If you don't count registration time on remote, that's just wrong.

> The control channel is only used for chunk registration and device
> state, not RAM.
> 
> I will update the documentation to make that more clear.

It's clear enough I think. But it seems you are measuring
the wrong things.

> >In reality each request/response requires two network round-trips
> >with the Ready credit-management messsages.
> >So the overhead will likely be avoided if we add better pipelining:
> >allow multiple registration requests in the air, and add more
> >send/receive credits so the overhead of credit management can be
> >reduced.
> Unfortunately, the migration thread doesn't work that way.
> The thread only generates one page write at-a-time.

Yes but you do not have to block it. Each page is in these states:
        - unpinned not sent
        - pinned no rkey
        - pinned have rkey
        - unpinned sent

Each time you get a new page, it's in unpinned not sent state.
So you can start it on this state machine, and tell migration thread
to proceed tothe next page.

> If someone were to write a patch which submits multiple
> writes at the same time, I would be very interested in
> consuming that feature and making chunk registration more
> efficient by batching multiple registrations into fewer messages.

No changes to migration core is necessary I think.
But assuming they are - your protocol design and
management API should not be driven by internal qemu APIs.

> >There's no requirement to implement these optimizations upfront
> >before merging the first version, but let's remove the
> >non-chunkbased crutch unless we see it as absolutely necessary.
> >
> >>+3. Receiver does listen()
> >>+4. Sender does connect()
> >>+5. Receiver accept()
> >>+6. Check versioning and capabilities (described later)
> >>+
> >>+At this point, we define a control channel on top of SEND messages
> >>+which is described by a formal protocol. Each SEND message has a
> >>+header portion and a data portion (but together are transmitted
> >>+as a single SEND message).
> >>+
> >>+Header:
> >>+    * Length  (of the data portion)
> >>+    * Type    (what command to perform, described below)
> >>+    * Version (protocol version validated before send/recv occurs)
> >What's the expected value for Version field?
> >Also, confusing.  Below mentions using private field in librdmacm instead?
> >Need to add # of bytes and endian-ness of each field.
> 
> Correct, those are two separate versions. One for capability negotiation
> and one for the protocol itself.
> 
> I will update the documentation.

Just drop the all-pinned version, and we'll work to improve
the chunk-based one until it has reasonable performance.
It seems to get a decent speed already: consider that
most people run migration with the default speed limit.
Supporting all-pinned will just be a pain down the road when
we fix performance for chunk based one.


> >>+
> >>+The 'type' field has 7 different command values:
> >0. Unused.
> >
> >>+    1. None
> >you mean this is unused?
> 
> Correct - will update.
> 
> >>+    2. Ready             (control-channel is available)
> >>+    3. QEMU File         (for sending non-live device state)
> >>+    4. RAM Blocks        (used right after connection setup)
> >>+    5. Register request  (dynamic chunk registration)
> >>+    6. Register result   ('rkey' to be used by sender)
> >Hmm, don't you also need a virtual address for RDMA writes?
> >
> 
> The virtual addresses are communicated at the beginning of the
> migration using command #4 "Ram blocks".

Yes but ram blocks are sent source to dest.
virtual address needs to be sent dest to source no?

> >>+    7. Register finished (registration for current iteration finished)
> >What does Register finished mean and how it's used?
> >
> >Need to add which commands have a data portion, and in what format.
> 
> Acknowledged. "finished" signals that a migration round has completed
> and that the receiver side can move to the next iteration.
> 
> 
> >>+
> >>+After connection setup is completed, we have two protocol-level
> >>+functions, responsible for communicating control-channel commands
> >>+using the above list of values:
> >>+
> >>+Logically:
> >>+
> >>+qemu_rdma_exchange_recv(header, expected command type)
> >>+
> >>+1. We transmit a READY command to let the sender know that
> >you call it Ready above, so better be consistent.
> >
> >>+   we are *ready* to receive some data bytes on the control channel.
> >>+2. Before attempting to receive the expected command, we post another
> >>+   RQ work request to replace the one we just used up.
> >>+3. Block on a CQ event channel and wait for the SEND to arrive.
> >>+4. When the send arrives, librdmacm will unblock us.
> >>+5. Verify that the command-type and version received matches the one we 
> >>expected.
> >>+
> >>+qemu_rdma_exchange_send(header, data, optional response header & data):
> >>+
> >>+1. Block on the CQ event channel waiting for a READY command
> >>+   from the receiver to tell us that the receiver
> >>+   is *ready* for us to transmit some new bytes.
> >>+2. Optionally: if we are expecting a response from the command
> >>+   (that we have no yet transmitted),
> >Which commands expect result? Only Register request?
> 
> Yes, only register. In the code, the command is #define
> RDMA_CONTROL_REGISTER_RESULT
> 
> >>let's post an RQ
> >>+   work request to receive that data a few moments later.
> >>+3. When the READY arrives, librdmacm will
> >>+   unblock us and we immediately post a RQ work request
> >>+   to replace the one we just used up.
> >>+4. Now, we can actually post the work request to SEND
> >>+   the requested command type of the header we were asked for.
> >>+5. Optionally, if we are expecting a response (as before),
> >>+   we block again and wait for that response using the additional
> >>+   work request we previously posted. (This is used to carry
> >>+   'Register result' commands #6 back to the sender which
> >>+   hold the rkey need to perform RDMA.
> >>+
> >>+All of the remaining command types (not including 'ready')
> >>+described above all use the aformentioned two functions to do the hard 
> >>work:
> >>+
> >>+1. After connection setup, RAMBlock information is exchanged using
> >>+   this protocol before the actual migration begins.
> >>+2. During runtime, once a 'chunk' becomes full of pages ready to
> >>+   be sent with RDMA, the registration commands are used to ask the
> >>+   other side to register the memory for this chunk and respond
> >>+   with the result (rkey) of the registration.
> >>+3. Also, the QEMUFile interfaces also call these functions (described 
> >>below)
> >>+   when transmitting non-live state, such as devices or to send
> >>+   its own protocol information during the migration process.
> >>+
> >>+Versioning
> >>+==================================
> >>+
> >>+librdmacm provides the user with a 'private data' area to be exchanged
> >>+at connection-setup time before any infiniband traffic is generated.
> >>+
> >>+This is a convenient place to check for protocol versioning because the
> >>+user does not need to register memory to transmit a few bytes of version
> >>+information.
> >>+
> >>+This is also a convenient place to negotiate capabilities
> >>+(like dynamic page registration).
> >This would be a good place to document the format of the
> >private data field.
> 
> Acnkowledged.
> 
> 
> >>+
> >>+If the version is invalid, we throw an error.
> >Which version is valid in this specification?
> Version 1. Will update.
> >>+
> >>+If the version is new, we only negotiate the capabilities that the
> >>+requested version is able to perform and ignore the rest.
> >What are these capabilities and how do we negotiate them?
> There is only one capability right now: dynamic server registration.
> 
> The client must tell the server whether or not the capability was
> enabled or not on the primary VM side.
> 
> Will update the documentation.

Cool, best add an exact structure format.

> >>+QEMUFileRDMA Interface:
> >>+==================================
> >>+
> >>+QEMUFileRDMA introduces a couple of new functions:
> >>+
> >>+1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
> >>+2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
> >>+
> >>+These two functions are very short and simply used the protocol
> >>+describe above to deliver bytes without changing the upper-level
> >>+users of QEMUFile that depend on a bytstream abstraction.
> >>+
> >>+Finally, how do we handoff the actual bytes to get_buffer()?
> >>+
> >>+Again, because we're trying to "fake" a bytestream abstraction
> >>+using an analogy not unlike individual UDP frames, we have
> >>+to hold on to the bytes received from control-channel's SEND
> >>+messages in memory.
> >>+
> >>+Each time we receive a complete "QEMU File" control-channel
> >>+message, the bytes from SEND are copied into a small local holding area.
> >>+
> >>+Then, we return the number of bytes requested by get_buffer()
> >>+and leave the remaining bytes in the holding area until get_buffer()
> >>+comes around for another pass.
> >>+
> >>+If the buffer is empty, then we follow the same steps
> >>+listed above and issue another "QEMU File" protocol command,
> >>+asking for a new SEND message to re-fill the buffer.
> >>+
> >>+Migration of pc.ram:
> >>+===============================
> >>+
> >>+At the beginning of the migration, (migration-rdma.c),
> >>+the sender and the receiver populate the list of RAMBlocks
> >>+to be registered with each other into a structure.
> >>+Then, using the aforementioned protocol, they exchange a
> >>+description of these blocks with each other, to be used later
> >>+during the iteration of main memory. This description includes
> >>+a list of all the RAMBlocks, their offsets and lengths and
> >>+possibly includes pre-registered RDMA keys in case dynamic
> >>+page registration was disabled on the server-side, otherwise not.
> >Worth mentioning here that memory hotplug will require a protocol
> >extension. That's also true of TCP so not a big deal ...
> 
> Acknowledged.
> 
> >>+
> >>+Main memory is not migrated with the aforementioned protocol,
> >>+but is instead migrated with normal RDMA Write operations.
> >>+
> >>+Pages are migrated in "chunks" (about 1 Megabyte right now).
> >Why "about"? This is not dynamic so needs to be exactly same
> >on both sides, right?
> About is a typo =). It is hard-coded to exactly 1MB.

This, by the way, is something management *may* want to control.

> >
> >>+Chunk size is not dynamic, but it could be in a future implementation.
> >>+There's nothing to indicate that this is useful right now.
> >>+
> >>+When a chunk is full (or a flush() occurs), the memory backed by
> >>+the chunk is registered with librdmacm and pinned in memory on
> >>+both sides using the aforementioned protocol.
> >>+
> >>+After pinning, an RDMA Write is generated and tramsmitted
> >>+for the entire chunk.
> >>+
> >>+Chunks are also transmitted in batches: This means that we
> >>+do not request that the hardware signal the completion queue
> >>+for the completion of *every* chunk. The current batch size
> >>+is about 64 chunks (corresponding to 64 MB of memory).
> >>+Only the last chunk in a batch must be signaled.
> >>+This helps keep everything as asynchronous as possible
> >>+and helps keep the hardware busy performing RDMA operations.
> >>+
> >>+Error-handling:
> >>+===============================
> >>+
> >>+Infiniband has what is called a "Reliable, Connected"
> >>+link (one of 4 choices). This is the mode in which
> >>+we use for RDMA migration.
> >>+
> >>+If a *single* message fails,
> >>+the decision is to abort the migration entirely and
> >>+cleanup all the RDMA descriptors and unregister all
> >>+the memory.
> >>+
> >>+After cleanup, the Virtual Machine is returned to normal
> >>+operation the same way that would happen if the TCP
> >>+socket is broken during a non-RDMA based migration.
> >That's on sender side? Presumably this means you respond to
> >completion with error?
> >  How does receive side know
> >migration is complete?
> 
> Yes, on the sender side.
> 
> Migration "completeness" logic has not changed in this patch series.
> 
> Pleas recall that the entire QEMUFile protocol is still
> happening at the upper-level inside of savevm.c/arch_init.c.
> 

So basically receive side detects that migration is complete by
looking at the QEMUFile data?

> 
> >>+
> >>+TODO:
> >>+=================================
> >>+1. Currently, cgroups swap limits for *both* TCP and RDMA
> >>+   on the sender-side is broken. This is more poignant for
> >>+   RDMA because RDMA requires memory registration.
> >>+   Fixing this requires infiniband page registrations to be
> >>+   zero-page aware, and this does not yet work properly.
> >>+2. Currently overcommit for the the *receiver* side of
> >>+   TCP works, but not for RDMA. While dynamic page registration
> >>+   *does* work, it is only useful if the is_zero_page() capability
> >>+   is remained enabled (which it is by default).
> >>+   However, leaving this capability turned on *significantly* slows
> >>+   down the RDMA throughput, particularly on hardware capable
> >>+   of transmitting faster than 10 gbps (such as 40gbps links).
> >>+3. Use of the recent /dev/<pid>/pagemap would likely solve some
> >>+   of these problems.
> >>+4. Also, some form of balloon-device usage tracking would also
> >>+   help aleviate some of these issues.
> >>+
> >>+PERFORMANCE
> >>+===================
> >>+
> >>+Using a 40gbps infinband link performing a worst-case stress test:
> >>+
> >>+RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> >>+Approximately 30 gpbs (little better than the paper)
> >>+1. Average worst-case throughput
> >>+TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> >>+2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
> >>+
> >>+Average downtime (stop time) ranges between 28 and 33 milliseconds.
> >>+
> >>+An *exhaustive* paper (2010) shows additional performance details
> >>+linked on the QEMU wiki:
> >>+
> >>+http://wiki.qemu.org/Features/RDMALiveMigration
> >>-- 
> >>1.7.10.4



reply via email to

[Prev in Thread] Current Thread [Next in Thread]