qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] FW: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCO


From: 858585 jemmy
Subject: Re: [Qemu-devel] FW: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
Date: Thu, 17 May 2018 15:41:29 +0800

On Thu, May 17, 2018 at 3:31 PM, Aviad Yehezkel
<address@hidden> wrote:
>
>
> On 5/17/2018 5:42 AM, 858585 jemmy wrote:
>>
>> On Wed, May 16, 2018 at 11:11 PM, Aviad Yehezkel
>> <address@hidden> wrote:
>>>
>>> Hi Lidong and David,
>>> Sorry for the late response, I had to ramp up on migration code and build
>>> a
>>> setup on my side.
>>>
>>> PSB my comments for this patch below.
>>> For the RDMA post-copy patches I will comment next week after testing on
>>> Mellanox side too.
>>>
>>> Thanks!
>>>
>>> On 5/16/2018 5:21 PM, Aviad Yehezkel wrote:
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Dr. David Alan Gilbert [mailto:address@hidden
>>>> Sent: Wednesday, May 16, 2018 4:13 PM
>>>> To: 858585 jemmy <address@hidden>
>>>> Cc: Aviad Yehezkel <address@hidden>; Juan Quintela
>>>> <address@hidden>; qemu-devel <address@hidden>; Gal Shachaf
>>>> <address@hidden>; Adi Dotan <address@hidden>; Lidong Chen
>>>> <address@hidden>
>>>> Subject: Re: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED
>>>> event after rdma_disconnect
>>>>
>>>> * 858585 jemmy (address@hidden) wrote:
>>>>
>>>> <snip>
>>>>
>>>>>>>>>> I wonder why dereg_mr takes so long - I could understand if
>>>>>>>>>> reg_mr took a long time, but why for dereg, that sounds like the
>>>>>>>>>> easy side.
>>>>>>>>>
>>>>>>>>> I use perf collect the information when ibv_dereg_mr is invoked.
>>>>>>>>>
>>>>>>>>> -   9.95%  client2  [kernel.kallsyms]  [k] put_compound_page
>>>>>>>>>                                                             `
>>>>>>>>>      - put_compound_page
>>>>>>>>>         - 98.45% put_page
>>>>>>>>>              __ib_umem_release
>>>>>>>>>              ib_umem_release
>>>>>>>>>              dereg_mr
>>>>>>>>>              mlx5_ib_dereg_mr
>>>>>>>>>              ib_dereg_mr
>>>>>>>>>              uverbs_free_mr
>>>>>>>>>              remove_commit_idr_uobject
>>>>>>>>>              _rdma_remove_commit_uobject
>>>>>>>>>              rdma_remove_commit_uobject
>>>>>>>>>              ib_uverbs_dereg_mr
>>>>>>>>>              ib_uverbs_write
>>>>>>>>>              vfs_write
>>>>>>>>>              sys_write
>>>>>>>>>              system_call_fastpath
>>>>>>>>>              __GI___libc_write
>>>>>>>>>              0
>>>>>>>>>         + 1.55% __ib_umem_release
>>>>>>>>> +   8.31%  client2  [kernel.kallsyms]  [k]
>>>>>>>>> compound_unlock_irqrestore
>>>>>>>>> +   7.01%  client2  [kernel.kallsyms]  [k] page_waitqueue
>>>>>>>>> +   7.00%  client2  [kernel.kallsyms]  [k] set_page_dirty
>>>>>>>>> +   6.61%  client2  [kernel.kallsyms]  [k] unlock_page
>>>>>>>>> +   6.33%  client2  [kernel.kallsyms]  [k] put_page_testzero
>>>>>>>>> +   5.68%  client2  [kernel.kallsyms]  [k] set_page_dirty_lock
>>>>>>>>> +   4.30%  client2  [kernel.kallsyms]  [k] __wake_up_bit
>>>>>>>>> +   4.04%  client2  [kernel.kallsyms]  [k] free_pages_prepare
>>>>>>>>> +   3.65%  client2  [kernel.kallsyms]  [k] release_pages
>>>>>>>>> +   3.62%  client2  [kernel.kallsyms]  [k] arch_local_irq_save
>>>>>>>>> +   3.35%  client2  [kernel.kallsyms]  [k] page_mapping
>>>>>>>>> +   3.13%  client2  [kernel.kallsyms]  [k]
>>>>>>>>> get_pageblock_flags_group
>>>>>>>>> +   3.09%  client2  [kernel.kallsyms]  [k] put_page
>>>>>>>>>
>>>>>>>>> the reason is __ib_umem_release will loop many times for each page.
>>>>>>>>>
>>>>>>>>> static void __ib_umem_release(struct ib_device *dev, struct
>>>>>>>>> ib_umem *umem, int dirty) {
>>>>>>>>>       struct scatterlist *sg;
>>>>>>>>>       struct page *page;
>>>>>>>>>       int i;
>>>>>>>>>
>>>>>>>>>       if (umem->nmap > 0)
>>>>>>>>>            ib_dma_unmap_sg(dev, umem->sg_head.sgl,
>>>>>>>>>                                           umem->npages,
>>>>>>>>>                                           DMA_BIDIRECTIONAL);
>>>>>>>>>
>>>>>>>>>            for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
>>>>>>>>> <<
>>>>>>>>> loop a lot of times for each page.here
>>>>>>>>
>>>>>>>> Why 'lot of times for each page'?  I don't know this code at all,
>>>>>>>> but I'd expected once per page?
>>>>>>>
>>>>>>> sorry, once per page, but a lot of page for a big size virtual
>>>>>>> machine.
>>>>>>
>>>>>> Ah OK; so yes it seems best if you can find a way to do the release
>>>>>> in the migration thread then;  still maybe this is something some of
>>>>>> the kernel people could look at speeding up?
>>>>>
>>>>> The kernel code seem is not complex, and I have no idea how to speed
>>>>> up.
>>>>
>>>> Me neither; but I'll ask around.
>>>
>>> I will ask internally and get back on this one.
>>>>
>>>>
>>>>>>>> With your other kernel fix, does the problem of the missing
>>>>>>>> RDMA_CM_EVENT_DISCONNECTED events go away?
>>>>>>>
>>>>>>> Yes, after kernel and qemu fixed, this issue never happens again.
>>>>>>
>>>>>> I'm confused; which qemu fix; my question was whether the kernel fix
>>>>>> by itself fixed the problem of the missing event.
>>>>>
>>>>> this qemu fix:
>>>>> migration: update index field when delete or qsort RDMALocalBlock
>>>>
>>>> OK good; so then we shouldn't need this 2/2 patch.
>>>
>>> It is good that the qemu fix solved this issue but me and my colleagues
>>> think we need 2/2 patch anyway.
>>> According to IB Spec once active side send DREQ message he should wait
>>> for
>>> DREP message and only once it arrived it should trigger a DISCONNECT
>>> event.
>>> DREP message can be dropped due to network issues.
>>> For that case the spec defines a DREP_timeout state in the CM state
>>> machine,
>>> if the DREP is dropped we should get a timeout and a TIMEWAIT_EXIT event
>>> will be trigger.
>>> Unfortunately the current kernel CM implementation doesn't include the
>>> DREP_timeout state and in above scenario we will not get DISCONNECT or
>>> TIMEWAIT_EXIT events.
>>> (Similar scenario exists also for passive side).
>>> We think it is best to apply this patch until we will enhance the kernel
>>> code.
>>>
>> Hi Aviad:
>>      How long about the DREP_timeout state?
>>      Do RDMA have some tools like tcpdump? then I can use to confirm
>> whether
>>      DREP message has received.
>>
>>      Thanks.
>
> Are you running IB or RoCE?

RoCE v2.

>>>>>
>>>>> this issue also cause by some ram block is not released. but I do not
>>>>> find the root cause.
>>>>
>>>> Hmm, we should try and track that down.
>>>>
>>>>>>> Do you think we should remove rdma_get_cm_event after
>>>>>>> rdma_disconnect?
>>>>>>
>>>>>> I don't think so; if 'rdma_disconnect' is supposed to generate the
>>>>>> event I think we're supposed to wait for it to know that the
>>>>>> disconnect is really complete.
>>>>>
>>>>> After move qemu_fclose to migration thread, it will not block the main
>>>>> thread when wait the disconnection event.
>>>>
>>>> I'm not sure about moving the fclose to the migration thread; it worries
>>>> me with the interaction with cancel and other failures.
>>>>
>>>> Dave
>>>>
>>>>>> Dave
>>>>>>
>>>>>>>> Dave
>>>>>>>>
>>>>>>>>>                 page = sg_page(sg);
>>>>>>>>>                 if (umem->writable && dirty)
>>>>>>>>>                     set_page_dirty_lock(page);
>>>>>>>>>                 put_page(page);
>>>>>>>>>            }
>>>>>>>>>
>>>>>>>>>            sg_free_table(&umem->sg_head);
>>>>>>>>>            return;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>> Dave
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> Dave
>>>>>>>>>>>>
>>>>>>>>>>>>>>> Anyway, it should not invoke rdma_get_cm_event in main
>>>>>>>>>>>>>>> thread, and the event channel is also destroyed in
>>>>>>>>>>>>>>> qemu_rdma_cleanup.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Lidong Chen <address@hidden>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>    migration/rdma.c       | 12 ++----------
>>>>>>>>>>>>>>>    migration/trace-events |  1 -
>>>>>>>>>>>>>>>    2 files changed, 2 insertions(+), 11 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/migration/rdma.c b/migration/rdma.c index
>>>>>>>>>>>>>>> 0dd4033..92e4d30 100644
>>>>>>>>>>>>>>> --- a/migration/rdma.c
>>>>>>>>>>>>>>> +++ b/migration/rdma.c
>>>>>>>>>>>>>>> @@ -2275,8 +2275,7 @@ static int
>>>>>>>>>>>>>>> qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    static void qemu_rdma_cleanup(RDMAContext *rdma)  {
>>>>>>>>>>>>>>> -    struct rdma_cm_event *cm_event;
>>>>>>>>>>>>>>> -    int ret, idx;
>>>>>>>>>>>>>>> +    int idx;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>        if (rdma->cm_id && rdma->connected) {
>>>>>>>>>>>>>>>            if ((rdma->error_state ||
>>>>>>>>>>>>>>> @@ -2290,14 +2289,7 @@ static void
>>>>>>>>>>>>>>> qemu_rdma_cleanup(RDMAContext *rdma)
>>>>>>>>>>>>>>>                qemu_rdma_post_send_control(rdma, NULL,
>>>>>>>>>>>>>>> &head);
>>>>>>>>>>>>>>>            }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -        ret = rdma_disconnect(rdma->cm_id);
>>>>>>>>>>>>>>> -        if (!ret) {
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> trace_qemu_rdma_cleanup_waiting_for_disconnect();
>>>>>>>>>>>>>>> -            ret = rdma_get_cm_event(rdma->channel,
>>>>>>>>>>>>>>> &cm_event);
>>>>>>>>>>>>>>> -            if (!ret) {
>>>>>>>>>>>>>>> -                rdma_ack_cm_event(cm_event);
>>>>>>>>>>>>>>> -            }
>>>>>>>>>>>>>>> -        }
>>>>>>>>>>>>>>> +        rdma_disconnect(rdma->cm_id);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm worried whether this change could break stuff:
>>>>>>>>>>>>>> The docs say for rdma_disconnect that it flushes any posted
>>>>>>>>>>>>>> work
>>>>>>>>>>>>>> requests to the completion queue;  so unless we wait for the
>>>>>>>>>>>>>> event
>>>>>>>>>>>>>> do we know the stuff has been flushed?   In the normal
>>>>>>>>>>>>>> non-cancel case
>>>>>>>>>>>>>> I'm worried that means we could lose something.
>>>>>>>>>>>>>> (But I don't know the rdma/infiniband specs well enough to
>>>>>>>>>>>>>> know
>>>>>>>>>>>>>> if it's
>>>>>>>>>>>>>> really a problem).
>>>>>>>>>>>>>
>>>>>>>>>>>>> In qemu_fclose function, it invoke qemu_fflush(f) before invoke
>>>>>>>>>>>>> f->ops->close.
>>>>>>>>>>>>> so I think it's safe for normal migration case.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED
>>>>>>>>>>>>> event
>>>>>>>>>>>>> after rdma_disconnect,
>>>>>>>>>>>>> I loop in Aviad Yehezkel<address@hidden>, Aviad will help
>>>>>>>>>>>>> us
>>>>>>>>>>>>> find the root cause.
>>>
>>>
>>> Regarding previous discussion on that thread:
>>> rdma_disconnect can be invoke even without invoking ib_dreg_mr first.
>>>
>>> common teardown flow:
>>>
>>> rdma_disconnect
>>>    <-- here resources like QPs are still alive and can DMA to RAM
>>> rdma_destroy_cm_id
>>>    <-- All resources are destroyed by MR can be still active
>>> rdma_dreg_mr
>>>
>>>
>>>
>>>>>>>>>>>>>> Dave
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>            trace_qemu_rdma_cleanup_disconnect();
>>>>>>>>>>>>>>>            rdma->connected = false;
>>>>>>>>>>>>>>>        }
>>>>>>>>>>>>>>> diff --git a/migration/trace-events b/migration/trace-events
>>>>>>>>>>>>>>> index d6be74b..64573ff 100644
>>>>>>>>>>>>>>> --- a/migration/trace-events
>>>>>>>>>>>>>>> +++ b/migration/trace-events
>>>>>>>>>>>>>>> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
>>>>>>>>>>>>>>>    qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context
>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>> listen: %p"
>>>>>>>>>>>>>>>    qemu_rdma_block_for_wrid_miss(const char *wcompstr, int
>>>>>>>>>>>>>>> wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s
>>>>>>>>>>>>>>> (%d) but got %s
>>>>>>>>>>>>>>> (%" PRIu64 ")"
>>>>>>>>>>>>>>>    qemu_rdma_cleanup_disconnect(void) ""
>>>>>>>>>>>>>>> -qemu_rdma_cleanup_waiting_for_disconnect(void) ""
>>>>>>>>>>>>>>>    qemu_rdma_close(void) ""
>>>>>>>>>>>>>>>    qemu_rdma_connect_pin_all_requested(void) ""
>>>>>>>>>>>>>>>    qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> 1.8.3.1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>>>>>>>>
>>>>>>>> --
>>>>>>>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>>>>>>
>>>>>> --
>>>>>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>>>>
>>>> --
>>>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>>>
>>>
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]