[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] monitor: flush messages on abort
|
From: |
Steven Sistare |
|
Subject: |
Re: [PATCH] monitor: flush messages on abort |
|
Date: |
Wed, 15 Nov 2023 10:52:43 -0500 |
|
User-agent: |
Mozilla Thunderbird |
On 11/15/2023 3:41 AM, Markus Armbruster wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
>
>> On Fri, Nov 03, 2023 at 03:51:00PM -0400, Steven Sistare wrote:
>>> On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote:
>>>> On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote:
>>>>> Buffered monitor output is lost when abort() is called. The pattern
>>>>> error_report() followed by abort() occurs about 60 times, so valuable
>>>>> information is being lost when the abort is called in the context of a
>>>>> monitor command.
>>>>
>>>> I'm curious, was there a particular abort() scenario that you hit ?
>>>
>>> Yes, while tweaking the suspended state, and forgetting to add transitions:
>>>
>>> error_report("invalid runstate transition: '%s' -> '%s'",
>>> abort();
>>>
>>> But I have previously hit this for other errors.
>
> Can you provide a reproducer?
I sometimes hit this when developing new code. I do not have a reproducer for
upstream
branches. The patch is aimed at helping developers, not users.
>>>> For some crude statistics:
>>>>
>>>> $ for i in abort return exit goto ; do echo -n "$i: " ; git grep --after
>>>> 1 error_report | grep $i | wc -l ; done
>>>> abort: 47
>>>> return: 512
>>>> exit: 458
>>>> goto: 177
>>>>
>>>> to me those numbers say that calling "abort()" after error_report
>>>> should be considered a bug, and we can blanket replace all the
>>>> abort() calls with exit(EXIT_FAILURE), and thus avoid the need to
>>>> special case flushing the monitor.
>>>
>>> And presumably add an atexit handler to flush the monitor ala monitor_abort.
>>> AFAICT currently no destructor is called for the monitor at exit time.
>>
>> The HMP monitor flushes at each newline, and exit() will take care of
>> flushing stdout, so I don't think there's anything else needed.
>
> Correct.
>
>>>> Also I think there's a decent case to be made for error_report()
>>>> to call monitor_flush().
>
> No, because the messages printed by error_report() all end in newline,
> and printing a newline to a monitor triggers monitor_flush_locked().
>
>>> A good start, but that would not help for monitors with skip_flush=true,
>>> which
>>> need to format the buffered string in a json response, which is the case I
>>> tripped over.
>>
>> 'skip_flush' is only set to 'true' when using a QMP monitor and invoking
>> "hmp-monitor-command".
>
> Correct.
>
>> In such a case, the error message needs to be built into a JSON error
>> reply and sent over the socket. Your patch doesn't help this case
>> since you've just printed to stderr. I don't think it is reasonable
>> to expect QMP monitors to send replies on SIG_ABRT anyway. So I don't
>> think the skip_flush=true scenario is a problem to be concerned with.
>>
>>>>> To fix, install a SIGABRT handler to flush the monitor buffer to stderr.
>>>>>
>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>> ---
>>>>> monitor/monitor.c | 38 ++++++++++++++++++++++++++++++++++++++
>>>>> 1 file changed, 38 insertions(+)
>>>>>
>>>>> diff --git a/monitor/monitor.c b/monitor/monitor.c
>>>>> index dc352f9..65dace0 100644
>>>>> --- a/monitor/monitor.c
>>>>> +++ b/monitor/monitor.c
>>>>> @@ -701,6 +701,43 @@ void monitor_cleanup(void)
>>>>> }
>>>>> }
>>>>>
>>>>> +#ifdef CONFIG_LINUX
>>>>> +
>>>>> +static void monitor_abort(int signal, siginfo_t *info, void *c)
>>>>> +{
>>>>> + Monitor *mon = monitor_cur();
>>>>> +
>>>>> + if (!mon || qemu_mutex_trylock(&mon->mon_lock)) {
>>>>> + return;
>>>>> + }
>>>>> +
>>>>> + if (mon->outbuf && mon->outbuf->len) {
>>>>> + fputs("SIGABRT received: ", stderr);
>>>>> + fputs(mon->outbuf->str, stderr);
>>>>> + if (mon->outbuf->str[mon->outbuf->len - 1] != '\n') {
>>>>> + fputc('\n', stderr);
>>>>> + }
>>>>> + }
>>>>> +
>>>>> + qemu_mutex_unlock(&mon->mon_lock);
>>>>
>>>> The SIGABRT handling does not only fire in response to abort()
>>>> calls, but also in response to bad memory scenarios, so we have
>>>> to be careful what we do in signal handlers.
>>>>
>>>> In particular using mutexes in signal handlers is a big red
>>>> flag generally. Mutex APIs are not declare async signal
>>>> safe, so this code is technically a POSIX compliance
>>>> violation.
>
> "Technically a POSIX compliance violation" sounds like something only
> pedants would care about. It's actually a recipe for deadlocks and
> crashes.
>
>>> Righto. I would need to mask all signals in the sigaction to be on the
>>> safe(r) side.
>>
>> This is still doomed, because SIGABRT could fire while 'mon_lock' is
>> already held, and so this code would deadlock trying to acquire the
>> lock.
>
> Yup.
>
> There is no way to make async signal unsafe code safe.
The handler calls trylock, not lock. If it cannot get the lock, it bails.
However, I suppose pthread_mutex_trylock could in theory take and briefly hold
some internal lock as part of its implementation.
- Steve
>>>> So I think we'd be safer just eliminating the explicit abort()
>>>> calls and adding monitor_flush call to error_report.
>>>
>>> I like adding a handler because it is future proof. No need to play
>>> whack-a-mole when
>>> developers re-introduce abort() calls in the future. A minor benefit is I
>>> would not
>>> need ack's from 50 maintainers to change 50 call sites from abort to exit.
>>
>> That's a bit of a crazy exaggeration. THe aborts() don't cover 50 different
>> subsystems, and we don't require explicit acks from every subsystem
>> maintainer
>> for trivial cleanups like this.
>>
>>> A slight risk of the exit solution is that something bad happened at the
>>> call site, so
>>> qemu state can no longer be trusted. Calling abort immediately may be
>>> safer than calling
>>> exit which will call the existing atexit handlers and could have side
>>> effects.
>>
>> If that was a real problem, then we already face it because we have
>> ~500 places already calling exit() and only 50 calling abort().
>>
>>> A third option is to define qemu_abort() which flushes the monitor, and
>>> replaces all abort
>>> calls. That avoids async-signal-mutex hand wringing, but is still subject
>>> to whack-a-mole.
>>>
>>> So: atexit, signal handler, or qemu_abort? I will go with your preference.
>>
>> Just replace abort -> exit.
>
> I disagree.
>
> Use exit(1) for fatal errors.
>
> Use abort() for programming errors (a.k.a. bugs).
>
>> I'm not seeing a need for an atexit handler on top.
>
> I'm not yet seeing a need for anything. A reproducer may change that.
>
- [PATCH] monitor: flush messages on abort, Steve Sistare, 2023/11/03
- Re: [PATCH] monitor: flush messages on abort, Daniel P . Berrangé, 2023/11/03
- Re: [PATCH] monitor: flush messages on abort, Steven Sistare, 2023/11/03
- Re: [PATCH] monitor: flush messages on abort, Steven Sistare, 2023/11/03
- Re: [PATCH] monitor: flush messages on abort, Daniel P . Berrangé, 2023/11/06
- Re: [PATCH] monitor: flush messages on abort, Markus Armbruster, 2023/11/15
- Re: [PATCH] monitor: flush messages on abort,
Steven Sistare <=
- Re: [PATCH] monitor: flush messages on abort, Markus Armbruster, 2023/11/15
- Re: [PATCH] monitor: flush messages on abort, Steven Sistare, 2023/11/15
- Re: [PATCH] monitor: flush messages on abort, Markus Armbruster, 2023/11/16
- Re: [PATCH] monitor: flush messages on abort, Steven Sistare, 2023/11/16
- Re: [PATCH] monitor: flush messages on abort, Steven Sistare, 2023/11/15
- Re: [PATCH] monitor: flush messages on abort, Markus Armbruster, 2023/11/15
- Re: [PATCH] monitor: flush messages on abort, Daniel P . Berrangé, 2023/11/16