qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protoc


From: Michael R. Hines
Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
Date: Wed, 10 Apr 2013 11:29:24 -0400
User-agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130106 Thunderbird/17.0.2

On 04/10/2013 09:34 AM, Michael S. Tsirkin wrote:
On Wed, Apr 10, 2013 at 09:04:44AM -0400, Michael R. Hines wrote:
On 04/10/2013 01:27 AM, Michael S. Tsirkin wrote:
Below is a great high level overview. the protocol looks correct.
A bit more detail would be helpful, as noted below.

The main thing I'd like to see changed is that there are already
two protocols here: chunk-based and non chunk based.
We'll need to use versioning and capabilities going forward but in the
first version we don't need to maintain compatibility with legacy so
two versions seems like unnecessary pain.  Chunk based is somewhat slower and
that is worth fixing longer term, but seems like the way forward. So
let's implement a single chunk-based protocol in the first version we
merge.

Some more minor improvement suggestions below.
Thanks.

However, IMHO restricting the policy to only used chunk-based is really
not an acceptable choice:

Here's the reason: Using my 10gbs RDMA hardware, throughput takes a
dive from 10gbps to 6gbps.
Who cares about the throughput really? What we do care about
is how long the whole process takes.


Low latency and high throughput is very important =)

Without these properties of RDMA, many workloads simply either
take to long to finish migrating or do not converge to a stopping
point altogether.

*Not making this a configurable option would defeat the purpose of using RDMA altogether.

Otherwise, you're no better off than just using TCP.



But if I disable chunk-based registration altogether (forgoing
overcommit), then performance comes back.

The reason for this is is the additional control trannel traffic
needed to ask the server to register
memory pages on demand - without this traffic, we can easily
saturate the link.
But with this traffic, the user needs to know (and be given the
option) to disable the feature
in case they want performance instead of flexibility.

IMO that's just because the current control protocol is so inefficient.
You just need to pipeline the registration: request the next chunk
while remote side is handling the previous one(s).

With any protocol, you still need to:
        register all memory
        send addresses and keys to source
        get notification that write is done
what is different with chunk based?
simply that there are several network roundtrips
before the process can start.
So part of the time you are not doing writes,
you are waiting for the next control message.

So you should be doing several in parallel.
This will complicate the procotol though, so I am not asking
for this right away.

But a broken pin-it-all alternative will just confuse matters.  It is
best to keep it out of tree.

There's a huge difference. (Answer continued below this one).

The devil is in the details, here: Pipelining is simply not possible
right now because the migration thread has total control over
when and which pages are requested to be migrated.

You can't pipeline page registrations if you don't know the pages are dirty -
and the only way to that pages are dirty is if the migration thread told
you to save them.

On the other hand, advanced registration of *known* dirty pages
is very important - I will certainly be submitting a patch in the future
which attempts to handle this case.


So make the protocol smarter and fix this. This is not something
management needs to know about.


If you like, you can teach management to specify the max amount of
memory pinned. It should be specified at the appropriate place:
on the remote for remote, on source for source.


Answer below.


What is meant by performance here? downtime?
Throughput. Zero page scanning (and dynamic registration) reduces
throughput significantly.
Again, not something management should worry about.
Do the right thing internally.

I disagree with that: This is an entirely workload-specific decision,
not a system-level decision.

If I have a known memory-intensive workload that is virtualized,
then it would be "too late" to disable zero page detection *after*
the RDMA migration begins.

We have management tools already that are that smart - there's
nothing wrong with smart managment knowing in advance that
a workload is memory-intensive and also knowing that an RDMA
migration is going to be issued.

There's no way for QEMU to know that in advance without some kind
of advanced heuristic that tracks the behavior of the VM over time,
which I don't think anybody wants to get into the business of writing =)

+
+SEND messages require more coordination because the
+receiver must have reserved space (using a receive
+work request) on the receive queue (RQ) before QEMUFileRDMA
+can start using them to carry all the bytes as
+a transport for migration of device state.
+
+To begin the migration, the initial connection setup is
+as follows (migration-rdma.c):
+
+1. Receiver and Sender are started (command line or libvirt):
+2. Both sides post two RQ work requests
Okay this could be where the problem is. This means with chunk
based receive side does:

loop:
        receive request
        register
        send response

while with non chunk based it does:

receive request
send response
loop:
        register
No, that's incorrect. With "non" chunk based, the receive side does
*not* communicate
during the migration of pc.ram.
It does not matter when this happens. What we care about is downtime and
total time from start of qemu on remote and until migration completes.
Not peak throughput.
If you don't count registration time on remote, that's just wrong.

Answer above.


The control channel is only used for chunk registration and device
state, not RAM.

I will update the documentation to make that more clear.
It's clear enough I think. But it seems you are measuring
the wrong things.

In reality each request/response requires two network round-trips
with the Ready credit-management messsages.
So the overhead will likely be avoided if we add better pipelining:
allow multiple registration requests in the air, and add more
send/receive credits so the overhead of credit management can be
reduced.
Unfortunately, the migration thread doesn't work that way.
The thread only generates one page write at-a-time.
Yes but you do not have to block it. Each page is in these states:
        - unpinned not sent
        - pinned no rkey
        - pinned have rkey
        - unpinned sent

Each time you get a new page, it's in unpinned not sent state.
So you can start it on this state machine, and tell migration thread
to proceed tothe next page.

Yes, I'm doing that already (documented as "batching") in the
docs file.

But the problem is more complicated than that: there is no coordination
between the migration_thread and RDMA right now because Paolo is
trying to maintain a very clean separation of function.

However we *can* do what you described in a future patch like this:

1. Migration thread says "iteration starts, how much memory is dirty?"
2. RDMA protocol says "Is there a lot of dirty memory?"
OK, yes? Then batch all the registration messages into a single request but do not write the memory until all the registrations have completed.

OK, no? Then just issue registrations with very little batching so that
                      we can quickly move on to the next iteration round.

Make sense?

If someone were to write a patch which submits multiple
writes at the same time, I would be very interested in
consuming that feature and making chunk registration more
efficient by batching multiple registrations into fewer messages.
No changes to migration core is necessary I think.
But assuming they are - your protocol design and
management API should not be driven by internal qemu APIs.

Answer above.

There's no requirement to implement these optimizations upfront
before merging the first version, but let's remove the
non-chunkbased crutch unless we see it as absolutely necessary.

+3. Receiver does listen()
+4. Sender does connect()
+5. Receiver accept()
+6. Check versioning and capabilities (described later)
+
+At this point, we define a control channel on top of SEND messages
+which is described by a formal protocol. Each SEND message has a
+header portion and a data portion (but together are transmitted
+as a single SEND message).
+
+Header:
+    * Length  (of the data portion)
+    * Type    (what command to perform, described below)
+    * Version (protocol version validated before send/recv occurs)
What's the expected value for Version field?
Also, confusing.  Below mentions using private field in librdmacm instead?
Need to add # of bytes and endian-ness of each field.
Correct, those are two separate versions. One for capability negotiation
and one for the protocol itself.

I will update the documentation.
Just drop the all-pinned version, and we'll work to improve
the chunk-based one until it has reasonable performance.
It seems to get a decent speed already: consider that
most people run migration with the default speed limit.
Supporting all-pinned will just be a pain down the road when
we fix performance for chunk based one.


The speed tops out at 6gbps, that's not good enough for a 40gbps link.

The migration could complete *much* faster by disabling chunk registration.

We have very large physical machines, where chunk registration is not as important
as migrating the workload very quickly with very little downtime.

In these cases, chunk registration just "gets in the way".

+
+The 'type' field has 7 different command values:
0. Unused.

+    1. None
you mean this is unused?
Correct - will update.

+    2. Ready             (control-channel is available)
+    3. QEMU File         (for sending non-live device state)
+    4. RAM Blocks        (used right after connection setup)
+    5. Register request  (dynamic chunk registration)
+    6. Register result   ('rkey' to be used by sender)
Hmm, don't you also need a virtual address for RDMA writes?

The virtual addresses are communicated at the beginning of the
migration using command #4 "Ram blocks".
Yes but ram blocks are sent source to dest.
virtual address needs to be sent dest to source no?

I just said that, no? =)


There is only one capability right now: dynamic server registration.

The client must tell the server whether or not the capability was
enabled or not on the primary VM side.

Will update the documentation.
Cool, best add an exact structure format.

Acnkowledged.

+
+Main memory is not migrated with the aforementioned protocol,
+but is instead migrated with normal RDMA Write operations.
+
+Pages are migrated in "chunks" (about 1 Megabyte right now).
Why "about"? This is not dynamic so needs to be exactly same
on both sides, right?
About is a typo =). It is hard-coded to exactly 1MB.
This, by the way, is something management *may* want to control.

Acknowledged.

+Chunk size is not dynamic, but it could be in a future implementation.
+There's nothing to indicate that this is useful right now.
+
+When a chunk is full (or a flush() occurs), the memory backed by
+the chunk is registered with librdmacm and pinned in memory on
+both sides using the aforementioned protocol.
+
+After pinning, an RDMA Write is generated and tramsmitted
+for the entire chunk.
+
+Chunks are also transmitted in batches: This means that we
+do not request that the hardware signal the completion queue
+for the completion of *every* chunk. The current batch size
+is about 64 chunks (corresponding to 64 MB of memory).
+Only the last chunk in a batch must be signaled.
+This helps keep everything as asynchronous as possible
+and helps keep the hardware busy performing RDMA operations.
+
+Error-handling:
+===============================
+
+Infiniband has what is called a "Reliable, Connected"
+link (one of 4 choices). This is the mode in which
+we use for RDMA migration.
+
+If a *single* message fails,
+the decision is to abort the migration entirely and
+cleanup all the RDMA descriptors and unregister all
+the memory.
+
+After cleanup, the Virtual Machine is returned to normal
+operation the same way that would happen if the TCP
+socket is broken during a non-RDMA based migration.
That's on sender side? Presumably this means you respond to
completion with error?
  How does receive side know
migration is complete?
Yes, on the sender side.

Migration "completeness" logic has not changed in this patch series.

Pleas recall that the entire QEMUFile protocol is still
happening at the upper-level inside of savevm.c/arch_init.c.

So basically receive side detects that migration is complete by
looking at the QEMUFile data?


That's correct - same mechanism used by TCP.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]