qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] Migration+TLS: Fix crash due to double cleanup


From: Dr. David Alan Gilbert
Subject: Re: [Qemu-devel] [PATCH] Migration+TLS: Fix crash due to double cleanup
Date: Tue, 1 May 2018 11:57:25 +0100
User-agent: Mutt/1.9.5 (2018-04-13)

* Daniel P. Berrangé (address@hidden) wrote:
> On Tue, May 01, 2018 at 11:00:35AM +0100, Daniel P. Berrangé wrote:
> > On Mon, Apr 30, 2018 at 07:59:43PM +0100, Dr. David Alan Gilbert (git) 
> > wrote:
> > > From: "Dr. David Alan Gilbert" <address@hidden>
> > > 
> > > During a TLS connect we see:
> > >   migration_channel_connect calls
> > >   migration_tls_channel_connect
> > >   (calls after TLS setup)
> > >   migration_channel_connect
> > > 
> > > My previous error handling fix made migration_channel_connect
> > > call migrate_fd_connect in all cases; unfortunately the above
> > > means it gets called twice and crashes doing double cleanup.
> > > 
> > > Fixes: 688a3dcba98
> > 
> > This fixes the crash, but we're still seeing error messages duplicated
> > 
> > (qemu) migrate_set_parameter tls-creds tls0
> > (qemu) migrate tcp:localhost:9000
> > qemu-system-x86_64: Certificate does not match the hostname localhost
> > qemu-system-x86_64: Certificate does not match the hostname localhost
> > 
> > git bisect points to 688a3dcba98 as the cause of these double
> > errors still.

Can you give me the full sequence to trigger that; I did try yesterday
and have failed to get that error out of the TLS code.

> FYI the stack traces look like this..
> 
> The first error message is printed in this context:
> 
> #0  0x0000555555bb6c20 in error_report (fmt=0x555555d0d77e "%s") at 
> util/qemu-error.c:273
> #1  0x0000555555bb5fa5 in error_report_err (err=0x55555749cb00) at 
> util/error.c:228
> #2  0x0000555555a6d1f8 in migrate_fd_cleanup (address@hidden) at 
> migration/migration.c:1106
> #3  0x0000555555a6e2aa in migrate_fd_connect (address@hidden, 
> error_in=0x55555768de80) at migration/migration.c:2387
> #4  0x0000555555a6f8e7 in migration_channel_connect (address@hidden, 
> address@hidden, address@hidden, error=<optimized out>) at 
> migration/channel.c:83
> #5  0x0000555555a6f2d6 in migration_tls_outgoing_handshake (task=<optimized 
> out>, opaque=0x5555569925a0) at migration/tls.c:124
> #6  0x0000555555b6d2b2 in qio_task_complete (address@hidden) at io/task.c:142
> #7  0x0000555555b68c74 in qio_channel_tls_handshake_task (address@hidden, 
> address@hidden) at io/channel-tls.c:171
> #8  0x0000555555b6975a in qio_channel_tls_handshake (address@hidden, 
> address@hidden <migration_tls_outgoing_handshake>, address@hidden, 
> address@hidden) at io/channel-tls.c:215
> #9  0x0000555555a6f6ac in migration_tls_channel_connect (address@hidden, 
> address@hidden, address@hidden "localhost", address@hidden) at 
> migration/tls.c:159
> #10 0x0000555555a6f967 in migration_channel_connect (s=0x5555569925a0, 
> address@hidden, hostname=0x555556bc0c30 "localhost", error=<optimized out>) 
> at migration/channel.c:73
> #11 0x0000555555a6e5f4 in socket_outgoing_migration (task=<optimized out>, 
> opaque=0x555556cc7de0) at migration/socket.c:85
> #12 0x0000555555b6d2b2 in qio_task_complete (address@hidden) at io/task.c:142
> #13 0x0000555555b6d3a2 in gio_task_thread_result (opaque=0x5555578eff50) at 
> io/task.c:88
> #14 0x00007ffff5da1577 in g_idle_dispatch () at /lib64/libglib-2.0.so.0
> #15 0x00007ffff5da4b77 in g_main_context_dispatch () at 
> /lib64/libglib-2.0.so.0
> #16 0x0000555555baf677 in glib_pollfds_poll () at util/main-loop.c:214
> #17 0x0000555555baf677 in os_host_main_loop_wait (timeout=<optimized out>) at 
> util/main-loop.c:261
> #18 0x0000555555baf677 in main_loop_wait (nonblocking=<optimized out>) at 
> util/main-loop.c:515
> #19 0x00005555557b9157 in main_loop () at vl.c:1935
> #20 0x00005555557b9157 in main (argc=<optimized out>, argv=<optimized out>, 
> envp=<optimized out>) at vl.c:4767

Right, that's certainly the one I'm expecting; my rule here is that
any connect path that might call migrate_fd_connect, *always* calls
migrate_fd_connect and in the error case gives it the error to report.
My original reason to do this was to make sure we clean up any
'cancelling's that are in flight - cancel+a broken connect could lead
to it getting stuck in cancelling.

> The second error message is printed in this context:
> 
> #0  0x0000555555bb6c20 in error_report (address@hidden "%s") at 
> util/qemu-error.c:273
> #1  0x0000555555904035 in hmp_migrate_status_cb (opaque=0x555556bc0ba0) at 
> hmp.c:1909
> #2  0x0000555555baefac in timerlist_run_timers (timer_list=0x555556a0d760) at 
> util/qemu-timer.c:536
> #3  0x0000555555baf1b7 in qemu_clock_run_timers (type=QEMU_CLOCK_REALTIME) at 
> util/qemu-timer.c:547
> #4  0x0000555555baf1b7 in qemu_clock_run_all_timers () at 
> util/qemu-timer.c:662
> #5  0x0000555555baf69a in main_loop_wait (nonblocking=<optimized out>) at 
> util/main-loop.c:521
> #6  0x00005555557b9157 in main_loop () at vl.c:1935
> #7  0x00005555557b9157 in main (argc=<optimized out>, argv=<optimized out>, 
> envp=<optimized out>) at vl.c:4767
> 
> 
> The second stack trace is the error reporting context that I added originally
> in
> 
>   commit d59ce6f34434bf47a9b26138c908650bf9a24be1
>   Author: Daniel P. Berrange <address@hidden>
>   Date:   Wed Apr 27 11:05:00 2016 +0100
> 
>     migration: add reporting of errors for outgoing migration
> 
> 
> So the first stack trace is the new duplicate.
>
> Which error reporting context is "better" though, I don't know ?


OK, I see why I didn't see this - I almost always use migrate -d  and
that timer only happens without the -d.


> My patch was based on the view that, although alot of code uses error_report,
> long term all migration would eventually need to be able to filter an
> 'Error *errp' back up the stack, so that we can pass it back to QMP / HMP via
> 'info migrate' / query-migrate. So I decided to leave the error_report_err
> call to the hmp.c code, as long term that's the only place that would need
> to print to the console.

I wanted it to land in the /var/log/libvirt/qemu/whatever so we could
see what happened.

Dave

> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK



reply via email to

[Prev in Thread] Current Thread [Next in Thread]