|
From: | Aviad Yehezkel |
Subject: | Re: [Qemu-devel] FW: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect |
Date: | Thu, 17 May 2018 10:31:24 +0300 |
User-agent: | Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 |
On 5/17/2018 5:42 AM, 858585 jemmy wrote:
On Wed, May 16, 2018 at 11:11 PM, Aviad Yehezkel <address@hidden> wrote:Hi Lidong and David, Sorry for the late response, I had to ramp up on migration code and build a setup on my side. PSB my comments for this patch below. For the RDMA post-copy patches I will comment next week after testing on Mellanox side too. Thanks! On 5/16/2018 5:21 PM, Aviad Yehezkel wrote:-----Original Message----- From: Dr. David Alan Gilbert [mailto:address@hidden Sent: Wednesday, May 16, 2018 4:13 PM To: 858585 jemmy <address@hidden> Cc: Aviad Yehezkel <address@hidden>; Juan Quintela <address@hidden>; qemu-devel <address@hidden>; Gal Shachaf <address@hidden>; Adi Dotan <address@hidden>; Lidong Chen <address@hidden> Subject: Re: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect * 858585 jemmy (address@hidden) wrote: <snip>I wonder why dereg_mr takes so long - I could understand if reg_mr took a long time, but why for dereg, that sounds like the easy side.I use perf collect the information when ibv_dereg_mr is invoked. - 9.95% client2 [kernel.kallsyms] [k] put_compound_page ` - put_compound_page - 98.45% put_page __ib_umem_release ib_umem_release dereg_mr mlx5_ib_dereg_mr ib_dereg_mr uverbs_free_mr remove_commit_idr_uobject _rdma_remove_commit_uobject rdma_remove_commit_uobject ib_uverbs_dereg_mr ib_uverbs_write vfs_write sys_write system_call_fastpath __GI___libc_write 0 + 1.55% __ib_umem_release + 8.31% client2 [kernel.kallsyms] [k] compound_unlock_irqrestore + 7.01% client2 [kernel.kallsyms] [k] page_waitqueue + 7.00% client2 [kernel.kallsyms] [k] set_page_dirty + 6.61% client2 [kernel.kallsyms] [k] unlock_page + 6.33% client2 [kernel.kallsyms] [k] put_page_testzero + 5.68% client2 [kernel.kallsyms] [k] set_page_dirty_lock + 4.30% client2 [kernel.kallsyms] [k] __wake_up_bit + 4.04% client2 [kernel.kallsyms] [k] free_pages_prepare + 3.65% client2 [kernel.kallsyms] [k] release_pages + 3.62% client2 [kernel.kallsyms] [k] arch_local_irq_save + 3.35% client2 [kernel.kallsyms] [k] page_mapping + 3.13% client2 [kernel.kallsyms] [k] get_pageblock_flags_group + 3.09% client2 [kernel.kallsyms] [k] put_page the reason is __ib_umem_release will loop many times for each page. static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty) { struct scatterlist *sg; struct page *page; int i; if (umem->nmap > 0) ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->npages, DMA_BIDIRECTIONAL); for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) { << loop a lot of times for each page.hereWhy 'lot of times for each page'? I don't know this code at all, but I'd expected once per page?sorry, once per page, but a lot of page for a big size virtual machine.Ah OK; so yes it seems best if you can find a way to do the release in the migration thread then; still maybe this is something some of the kernel people could look at speeding up?The kernel code seem is not complex, and I have no idea how to speed up.Me neither; but I'll ask around.I will ask internally and get back on this one.With your other kernel fix, does the problem of the missing RDMA_CM_EVENT_DISCONNECTED events go away?Yes, after kernel and qemu fixed, this issue never happens again.I'm confused; which qemu fix; my question was whether the kernel fix by itself fixed the problem of the missing event.this qemu fix: migration: update index field when delete or qsort RDMALocalBlockOK good; so then we shouldn't need this 2/2 patch.It is good that the qemu fix solved this issue but me and my colleagues think we need 2/2 patch anyway. According to IB Spec once active side send DREQ message he should wait for DREP message and only once it arrived it should trigger a DISCONNECT event. DREP message can be dropped due to network issues. For that case the spec defines a DREP_timeout state in the CM state machine, if the DREP is dropped we should get a timeout and a TIMEWAIT_EXIT event will be trigger. Unfortunately the current kernel CM implementation doesn't include the DREP_timeout state and in above scenario we will not get DISCONNECT or TIMEWAIT_EXIT events. (Similar scenario exists also for passive side). We think it is best to apply this patch until we will enhance the kernel code.Hi Aviad: How long about the DREP_timeout state? Do RDMA have some tools like tcpdump? then I can use to confirm whether DREP message has received. Thanks.
Are you running IB or RoCE?
this issue also cause by some ram block is not released. but I do not find the root cause.Hmm, we should try and track that down.Do you think we should remove rdma_get_cm_event after rdma_disconnect?I don't think so; if 'rdma_disconnect' is supposed to generate the event I think we're supposed to wait for it to know that the disconnect is really complete.After move qemu_fclose to migration thread, it will not block the main thread when wait the disconnection event.I'm not sure about moving the fclose to the migration thread; it worries me with the interaction with cancel and other failures. DaveDaveDavepage = sg_page(sg); if (umem->writable && dirty) set_page_dirty_lock(page); put_page(page); } sg_free_table(&umem->sg_head); return; }DaveDaveAnyway, it should not invoke rdma_get_cm_event in main thread, and the event channel is also destroyed in qemu_rdma_cleanup. Signed-off-by: Lidong Chen <address@hidden> --- migration/rdma.c | 12 ++---------- migration/trace-events | 1 - 2 files changed, 2 insertions(+), 11 deletions(-) diff --git a/migration/rdma.c b/migration/rdma.c index 0dd4033..92e4d30 100644 --- a/migration/rdma.c +++ b/migration/rdma.c @@ -2275,8 +2275,7 @@ static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma, static void qemu_rdma_cleanup(RDMAContext *rdma) { - struct rdma_cm_event *cm_event; - int ret, idx; + int idx; if (rdma->cm_id && rdma->connected) { if ((rdma->error_state || @@ -2290,14 +2289,7 @@ static void qemu_rdma_cleanup(RDMAContext *rdma) qemu_rdma_post_send_control(rdma, NULL, &head); } - ret = rdma_disconnect(rdma->cm_id); - if (!ret) { - trace_qemu_rdma_cleanup_waiting_for_disconnect(); - ret = rdma_get_cm_event(rdma->channel, &cm_event); - if (!ret) { - rdma_ack_cm_event(cm_event); - } - } + rdma_disconnect(rdma->cm_id);I'm worried whether this change could break stuff: The docs say for rdma_disconnect that it flushes any posted work requests to the completion queue; so unless we wait for the event do we know the stuff has been flushed? In the normal non-cancel case I'm worried that means we could lose something. (But I don't know the rdma/infiniband specs well enough to know if it's really a problem).In qemu_fclose function, it invoke qemu_fflush(f) before invoke f->ops->close. so I think it's safe for normal migration case. For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect, I loop in Aviad Yehezkel<address@hidden>, Aviad will help us find the root cause.Regarding previous discussion on that thread: rdma_disconnect can be invoke even without invoking ib_dreg_mr first. common teardown flow: rdma_disconnect <-- here resources like QPs are still alive and can DMA to RAM rdma_destroy_cm_id <-- All resources are destroyed by MR can be still active rdma_dreg_mrDavetrace_qemu_rdma_cleanup_disconnect(); rdma->connected = false; } diff --git a/migration/trace-events b/migration/trace-events index d6be74b..64573ff 100644 --- a/migration/trace-events +++ b/migration/trace-events @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d" qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after listen: %p" qemu_rdma_block_for_wrid_miss(const char *wcompstr, int wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s (%d) but got %s (%" PRIu64 ")" qemu_rdma_cleanup_disconnect(void) "" -qemu_rdma_cleanup_waiting_for_disconnect(void) "" qemu_rdma_close(void) "" qemu_rdma_connect_pin_all_requested(void) "" qemu_rdma_connect_pin_all_outcome(bool pin) "%d" -- 1.8.3.1-- Dr. David Alan Gilbert / address@hidden / Manchester, UK-- Dr. David Alan Gilbert / address@hidden / Manchester, UK-- Dr. David Alan Gilbert / address@hidden / Manchester, UK-- Dr. David Alan Gilbert / address@hidden / Manchester, UK-- Dr. David Alan Gilbert / address@hidden / Manchester, UK-- Dr. David Alan Gilbert / address@hidden / Manchester, UK
[Prev in Thread] | Current Thread | [Next in Thread] |