qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues


From: Fei Li
Subject: Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
Date: Fri, 26 Oct 2018 20:59:26 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1



On 10/25/2018 08:55 PM, Dr. David Alan Gilbert wrote:
* Fei Li (address@hidden) wrote:
Hi,
these two patches are to fix live migration issues. The first is
about multifd, and the second is to fix some error handling.

But I have a question about using multifd migration.
In our current code, when multifd is used during migration, if there
is an error before the destination receives all new channels (I mean
multifd_recv_new_channel(ioc)), the destination does not exit but
keeps waiting (Hang in recvmsg() in qio_channel_socket_readv) until
the source exits.

My question is about the state of the destination host if fails during
this period. I did a test, after applying [1/2] patch, if
multifd_new_send_channel_async() fails, the destination host hangs for
a while then later pops up a window saying
     "'QEMU (...) [stopped]' is not responding.
     You may choose to wait a short while for it to continue or force
     the application to quit entirely."
But after closing the window by clicking, the qemu on the dest still
hangs there until I exclusively kill the qemu on the source.
That sounds like the main thread is blocked for some reason?
Yes, the main thread on  the dst is keeps looping.
But I don't
normally use the window setup;  if you try with -nographic and can see
the HMP (or a QMP) monitor, can you see if the monitor still responds?

Thanks for the `-nographic` reminder, I harvested an interesting phenonmenon:
If I do the `migrate -d tcp:ip_addr:port` before the guest's graphic appears
(it's dark now), there is no hang and the guest starts up properly later.
But if I do the live migration after the guest fully starts up, I mean when
I can operate something using my mouse inside the guest, the hang
situation is there.
This is true for using `-nographic` for both src and dst,
and using `-nographic` for only src or dst.


The hang phenonmenon is that the dst seems never responds (I
waited three minutes), and the cursor just keeps flashing. After I
exclusively kill the src, then the dst quit. Just as follows:
(Same result if gdb is not used in src)
src:
(qemu) ...
(qemu) q
(gdb) q
dst:
(qemu) Up to now, dst has received the 0 channel
Up to now, dst has received the 1 channel

(qemu)
(qemu)

To check the migtation state in the src:
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: on dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off Migration status: setup /* I added some codes to set the status to "failed", but still not working, details see below */
total time: 0 milliseconds

I guess maybe the source should to proactive to tell the dst and
disconnects from the source side, so I tried to set the above
"Migration status" to be "failed", and use qemu_fclose(s->to_dst_file)
when multifd_new_send_channel_async() fails.
(BTW: I even tried:
 if (s->vm_was_running) {   vm_start();   }   )
But the hang situation is still there.
If it doesn't then try and get a backtrace.

The monitor really shouldn't block, so it would be interesting to see.

Dave
I set two breakpoints and get the following backtrace, hope they can help. :)

Thread 1 "qemu-system-x86" hit Breakpoint 1, multifd_recv_new_channel (
    ioc=0x555557995af0) at /build/gitcode/qemu-build/migration/ram.c:1368
1368    {
(gdb) c
Continuing.

Thread 1 "qemu-system-x86" hit Breakpoint 2, qio_channel_socket_readv (
    ioc=0x555557995af0, iov=0x5555568777d0, niov=1, fds=0x0, nfds=0x0,
    errp=0x7fffffffdb38) at io/channel-socket.c:463
463    {
(gdb) n
464        QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
(gdb)
......
483     retry:
(gdb)
484        ret = recvmsg(sioc->fd, &msg, sflags);
(gdb) bt
#0  qio_channel_socket_readv (ioc=0x555557995af0, iov=0x5555568777d0, niov=1,
    fds=0x0, nfds=0x0, errp=0x7fffffffdb38) at io/channel-socket.c:484
#1  0x0000555555d156c5 in qio_channel_readv_full (ioc=0x555557995af0,
    iov=0x5555568777d0, niov=1, fds=0x0, nfds=0x0, errp=0x7fffffffdb38)
    at io/channel.c:65
#2  0x0000555555d15b26 in qio_channel_readv (ioc=0x555557995af0,
    iov=0x5555568777d0, niov=1, errp=0x7fffffffdb38) at io/channel.c:197
#3  0x0000555555d15853 in qio_channel_readv_all_eof (ioc=0x555557995af0,
    iov=0x7fffffffda70, niov=1, errp=0x7fffffffdb38) at io/channel.c:106
#4  0x0000555555d1595c in qio_channel_readv_all (ioc=0x555557995af0,
    iov=0x7fffffffda70, niov=1, errp=0x7fffffffdb38) at io/channel.c:142
#5  0x0000555555d15d0c in qio_channel_read_all (ioc=0x555557995af0,
    buf=0x7fffffffdad0 "\340\"zVUU", buflen=25, errp=0x7fffffffdb38)
    at io/channel.c:246
#6  0x000055555587695c in multifd_recv_initial_packet (c=0x555557995af0,
    errp=0x7fffffffdb38) at /build/gitcode/qemu-build/migration/ram.c:653
#7  0x00005555558788fb in multifd_recv_new_channel (ioc=0x555557995af0)
    at /build/gitcode/qemu-build/migration/ram.c:1374
#8  0x0000555555bc9978 in migration_ioc_process_incoming (ioc=0x555557995af0)
    at migration/migration.c:573
#9  0x0000555555bd0c69 in migration_channel_process_incoming (ioc=0x555557995af0)
    at migration/channel.c:47
#10 0x0000555555bcf7e8 in socket_accept_incoming_migration (
    listener=0x5555578dcae0, cioc=0x555557995af0, opaque=0x0)
    at migration/socket.c:166
#11 0x0000555555d2051f in qio_net_listener_channel_func (ioc=0x5555579c7180,
    condition=G_IO_IN, opaque=0x5555578dcae0) at io/net-listener.c:53
#12 0x0000555555d1c0a2 in qio_channel_fd_source_dispatch (source=0x5555568d5970,
---Type <return> to continue, or q <return> to quit---
    callback=0x555555d20473 <qio_net_listener_channel_func>,
    user_data=0x5555578dcae0) at io/channel-watch.c:84
#13 0x00007ffff6353dc5 in g_main_context_dispatch ()
   from /usr/lib64/libglib-2.0.so.0
#14 0x0000555555d7d1ad in glib_pollfds_poll () at util/main-loop.c:215
#15 0x0000555555d7d227 in os_host_main_loop_wait (timeout=0) at util/main-loop.c:238 #16 0x0000555555d7d2e0 in main_loop_wait (nonblocking=0) at util/main-loop.c:497
#17 0x00005555559cd679 in main_loop () at vl.c:1884
#18 0x00005555559d4f1e in main (argc=32, argv=0x7fffffffe0b8, envp=0x7fffffffe1c0)
    at vl.c:4618
(gdb) n

Thread 1 "qemu-system-x86" received signal SIGINT, Interrupt.
0x00007ffff5606f64 in recvmsg () from /lib64/libpthread.so.0
(gdb) c
Continuing.

After I input above `n`, the dst just hangs here, seems waiting for the result of recvmsg(sioc->fd, &msg, sflags); Later even I use ctrl+c to kill it, the dst still hangs.

Have a nice day, thanks
Fei

The source host keeps running as expected, but I guess the hang
phenonmenon in the dest is not right.
Would someone kindly give some suggestions on this? Thanks a lot.


Fei Li (2):
   migration: fix the multifd code
   migration: fix some error handling

  migration/migration.c    |  5 +----
  migration/postcopy-ram.c |  3 +++
  migration/ram.c          | 33 +++++++++++++++++++++++----------
  migration/ram.h          |  2 +-
  4 files changed, 28 insertions(+), 15 deletions(-)

--
2.13.7

--
Dr. David Alan Gilbert / address@hidden / Manchester, UK





reply via email to

[Prev in Thread] Current Thread [Next in Thread]