[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] postcopy migration hangs while loading virtio state
From: |
Dr. David Alan Gilbert |
Subject: |
Re: [Qemu-devel] postcopy migration hangs while loading virtio state |
Date: |
Wed, 26 Apr 2017 12:01:45 +0100 |
User-agent: |
Mutt/1.8.0 (2017-02-23) |
* Christian Borntraeger (address@hidden) wrote:
> On 04/24/2017 09:12 PM, Dr. David Alan Gilbert wrote:
> > * Christian Borntraeger (address@hidden) wrote:
> >> On 04/24/2017 04:35 PM, Dr. David Alan Gilbert wrote:
> >>> * Christian Borntraeger (address@hidden) wrote:
> >>>> On 04/24/2017 12:53 PM, Dr. David Alan Gilbert wrote:
> >>>>> * Christian Borntraeger (address@hidden) wrote:
> >>>>>> David, Juan,
> >>>>>>
> >>>>>> I can trigger a hang of postcopy migration (when I do it early) so
> >>>>>> that both sides are in paused state. Looks like one thread is still
> >>>>>> loading vmstates for virtio and this load accesses guest memory and
> >>>>>> triggers a userfault.
> >>>>>
> >>>>> It's perfectly legal for the destination to cause a userfault as it
> >>>>> loads the virtio queue - the virtio queue should be being loaded by
> >>>>> the main thread from the buffer while the 'listening' thread is
> >>>>> waiting for the incoming page data.
> >>>>>
> >>>>> Can you turn on the following tracing please: destination:
> >>>>> postcopy_ram_fault_thread_request postcopy_place_page
> >>>>> postcopy_place_page_zero
> >>>>>
> >>>>> source: migrate_handle_rp_req_pages ram_save_queue_pages
> >>>>>
> >>>>> You should see: virtio does the access userfault generates a fault
> >>>>> postcopy_ram_fault_thread_request sends the request to the source
> >>>>>
> >>>>> the source sees migrate_handle_rp_req_pages queues it with
> >>>>> ram_save_queue_pages
> >>>>>
> >>>>> the destination sees the page arrive and postcopy_place_page or
> >>>>> postcopy_place_page_zero
> >>>>>
> >>>>> some of that might disappear if the page was already on it's way.
> >>>>
> >>>>
> >>>> the last event on the source are
> >>>> [..]
> >>>> address@hidden:postcopy_place_page host=0x3ff92246000
> >>>> address@hidden:postcopy_place_page host=0x3ff92247000
> >>>
> >>> How do you see those on the source???
> >>
> >> It was the previous migrate (I did it in a loop forth and back)
> >> The problem happens on the migrate back.
> >
> > OK, good.
> >
> >>>
> >>>> address@hidden:migrate_handle_rp_req_pages in s390.ram at 41d9000 len
> >>>> 1000
> >>>> address@hidden:ram_save_queue_pages s390.ram: start: 41d9000 len: 100
> >>>>
> >>>
> >>> Is that a typo? I'm expecting those two 'len' fields to be the same?
> >>
> >> Yes, its a cut'n' paste "miss the last byte"
> >
> > Good.
> >
> > Ok, before digging further, is this a new bug or does it happen on older
> > QEMU? Have you got a rune to reproduce it on x86?
> >
> >>
> >>>
> >>>> On the target a see lots of
> >>>>
> >>>> address@hidden:postcopy_place_page_zero host=0x3ff9befa000
> >>>> address@hidden:postcopy_place_page host=0x3ff9befb000
> >>>> address@hidden:postcopy_place_page host=0x3ff9befc000
> >>>> address@hidden:postcopy_place_page host=0x3ff9befd000
> >>>> address@hidden:postcopy_place_page_zero host=0x3ff9befe000
> >>>> address@hidden:postcopy_place_page host=0x3ff9beff000
> >>>>
> >>>> So we have about 2 seconds of traffic going on after that request,
> >>>> I assume its precopy related.
> >>>>
> >>>> Looking back on the target history there was
> >>>> address@hidden:postcopy_ram_fault_thread_request Request for
> >>>> HVA=3ff618d9000 rb=s390.ram offset=41d9000
> >>>>
> >>>> In fact it seems to be the first and only request:
> >>>>
> >>>> # cat /tmp/test0.trace | grep -v postcopy_place_page
> >>>>
> >>>> address@hidden:postcopy_ram_fault_thread_request Request for
> >>>> HVA=3ff618d9000 rb=s390.ram offset=41d9000
> >>>
> >>> OK, does the HVA there correspond to the address that your virtio device
> >>> is blocking on?
> >>
> >> yes it is the same page.
> >>
> >>
> >>> (or at least the start of the page)
> >>> Do you see a postcopy_place_page with a host= the same HVA ?
> >>
> >> no
> >
> > Hmm, that's bad.
> > The flow is:
> > Precopy
> > (a) Send pages
> > Switch to postcopy
> > (b) Send discards for pages that have been changed after
> > (a)
> > Postcopy
> > (c) Keep sending pages until we run out
> > (d) But send pages we're asked for first
> >
> > (d) might be ignored if the source thinks the page was already sent or not
> > dirty.
> >
> > So we need to figure out:
> > 1) If the source sent the pages during (a)
> > 2) If the source discarded it during (b)
> > 3) If it sent it again any time in c/d
> > 4) If it ignored the request from d
> >
> >
> > So please turn on the traces:
> > get_queued_page_not_dirty should help with (4)
> > get_queued_page should help with (4)
> > ram_discard_range should help with (2)
> > loadvm_postcopy_ram_handle_discard should help with (2)
> > qemu_savevm_send_postcopy_ram_discard should help with (2)
> >
> > add near the top of ram_save_page in ram.c:
> > fprintf(stderr, "%s: %s:%zx\n", __func__, block->idstr, (size_t)offset);
> >
> > should help with 1, 3
>
>
> source:
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 9d
> address@hidden:migrate_handle_rp_req_pages in s390.ram at 3c6b000 len 1000
> address@hidden:ram_save_queue_pages s390.ram: start: 3c6b000 len: 1000
> address@hidden:get_queued_page_not_dirty s390.ram/3c6b000 ram_addr=3c6b000
> (sent=0)
OK, the source is ignoring the request because it thinks the page has
already been sent and hasn't been redirtied.
> I also get
> ram_save_page: s390.ram:3c6b000
and you said in your other message that happens before the discards, which
means it's precopy, and that's why you don't see the 'place_page'.
> target (without postcopy_place)
> address@hidden:ram_discard_range s390.ram: start: 0 5dc00000
That's start of day I think, so we can ignore that (I think!)
> address@hidden:loadvm_postcopy_ram_handle_discard
....
> address@hidden:ram_discard_range s390.ram: start: 3bf6000 1000
> address@hidden:ram_discard_range s390.ram: start: 3bfd000 1000
> address@hidden:ram_discard_range s390.ram: start: 3c06000 4000
> address@hidden:loadvm_postcopy_ram_handle_discard
> address@hidden:ram_discard_range s390.ram: start: 3c79000 3000
Well, it's not discarded the 3c6b000 page, so the page we received
during precopy should still be there, and it shouldn't fault.
In migration/ram.c in ram_load, before the line:
switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
can you add:
fprintf(stderr, "%s: addr=%zx flags=%zx host=%p\n", __func__, addr, flags,
host);
that should show the incoming blocks during precopy and we should be able
to see the page coming in.
Hmm, I have a theory, if the flags field has bit 1 set, i.e.
RAM_SAVE_FLAG_COMPRESS
then try changing ram_handle_compressed to always do the memset.
Dave
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK
- [Qemu-devel] postcopy migration hangs while loading virtio state, Christian Borntraeger, 2017/04/24
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Dr. David Alan Gilbert, 2017/04/24
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Christian Borntraeger, 2017/04/24
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Dr. David Alan Gilbert, 2017/04/24
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Christian Borntraeger, 2017/04/24
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Dr. David Alan Gilbert, 2017/04/24
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Christian Borntraeger, 2017/04/25
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Christian Borntraeger, 2017/04/26
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Christian Borntraeger, 2017/04/26
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state,
Dr. David Alan Gilbert <=
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Christian Borntraeger, 2017/04/26
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Christian Borntraeger, 2017/04/26
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Christian Borntraeger, 2017/04/26
Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Christian Borntraeger, 2017/04/24