monit-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SIGSEGV problem


From: Martin Pala
Subject: Re: SIGSEGV problem
Date: Fri, 15 Aug 2003 20:29:06 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030714 Debian/1.4-2

Jan-Henrik Haukeland wrote:

Christian Hopp <address@hidden> writes:

On Thu, 14 Aug 2003, Jan-Henrik Haukeland wrote:

I ran a fast test with efence and managed to reproduce the SIGSEGV (it
may be more). SIGSEGV is thrown in process/common.c:connectchild()
from this line:

 parent->children[parent->children_num - 1] = (struct myprocesstree *) child;


From my gdb/efence session:

 Program received signal SIGSEGV, Segmentation fault.
 [Switching to Thread 1024 (LWP 1269)]
 0x0805b340 in connectchild (parent=0x41143fa0, child=0x41144740)
     at process/common.c:232

 (gdb) p *parent->children
 Cannot access memory at address 0x41365fcc

 (gdb) p parent->children[parent->children_num - 1]
 Cannot access memory at address 0x41365ffc

I suspect it's caused by trying to access something outside the
array. Maybe Christian can debug this since it's his code :) I'm of to
bed, it's late.
Strange... I just had a look at the code... and it IMHO impossible to
access memory which is not allocated at this position!

I do a xcalloc of parent->children_num entities of pointers and it has to
be possible to access the last one (parent->childen_num - 1)... or?  Or is
it being deleted while this happens... somekind of race condition???

I think it must have been a race condition of some sort. The strange
thing is that I cannot reproduce the problem after I added the signal
block code. Maybe that was it and it is fixed!? Do any of you get any
more SIGSEGV now? Martin?

I did another test, problem remains. There were 152 problem occurences of 3848 attempts => error ratio 5.34%

I'm running Debian unstable (sid) with glic-2.3.2 and gcc-3.3.1

My configuration:

---8<---
set daemon 5
set logfile syslog
set mailserver ms2.dkm.cz
set mail-format { from: address@hidden }
set httpd port 2812 and allow 127.0.0.1 use address 127.0.0.1

check slapd with pidfile /var/run/slapd.pid
start program = "/etc/init.d/slapd start"
stop program = "/etc/init.d/slapd stop"
if failed host 127.0.0.1 port 389 protocol ldap3 then restart
if cpu usage > 2% for 5 cycles then restart
group database
if 2 restarts within 2 cycles then timeout
mode active
---8<---

To replicate the problem it is sufficient to:
1.) stop slapd
2.) change /etc/init.d/slapd start startup script so, that it is not able to start slapd successfully 3.) while true; do strace -f -o monit.strace.`date +%Y%m%d%S%N` ./monit -vc /etc/monitrc validate > monit.out.`date +%Y%m%d%S%N` 2>&1; done

You will quickly obtain few occurences of the problem.

As i wrote, it fails in wait_start (see attached strace):

24065 stat64("XE^^G^H/run/slapd.pid", <unfinished ...>
24063 close(3 <unfinished ...>
24064 read(4, <unfinished ...>
24065 <... stat64 resumed> 0xbf7ff93c) = -1 ENOENT (No such file or directory)
24063 <... close resumed> ) = -1 EBADF (Bad file descriptor)
24064 <... read resumed> "address@hidden,@address@hidden"..., 148) = 148
24065 --- SIGSEGV (Segmentation fault) @ 0 (0) ---


24065 ... wait_start thread

As you can see, instead of /var/run/slapd.pid the s->path referenced string is garbled (in 5% cases - see the error ratio above). Under normal condition (95% cases) s->path references correct data. With above setup it fails every time the error occures in exactly same place.

As i wrote, when i inserted some primitive fprintf based marks arround critical code, the problem didn't occured any time. I think it is because these calls slowed down monit and the (possibly) race condition didn't occured.

It is possible to involve another kind of SIGSEGV, when you:

1.) stop slapd
2.) change /etc/init.d/slapd start startup script so, that it is not able to start slapd successfully
3.) echo 7777 > /var/run/slapd.pid #some non-existent pid
4.) while true; do strace -f -o monit.strace.`date +%Y%m%d%S%N` ./monit -vc /etc/monitrc validate > monit.out.`date +%Y%m%d%S%N` 2>&1; done

The result is similar but the place is different (but again every time the same) - see strace output i've send.


I tried to run gdb on core:

unicorn:~/cvs/monit# gdb ./monit core.24065
GNU gdb 5.3.90_2003-08-01-cvs-debian
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-linux"...
Core was generated by `./monit -vc /etc/monitrc validate'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libpthread.so.0...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /usr/lib/i686/cmov/libssl.so.0.9.7...done.
Loaded symbols for /usr/lib/i686/cmov/libssl.so.0.9.7
Reading symbols from /usr/lib/i686/cmov/libcrypto.so.0.9.7...done.
Loaded symbols for /usr/lib/i686/cmov/libcrypto.so.0.9.7
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libnss_compat.so.2...done.
Loaded symbols for /lib/libnss_compat.so.2
Reading symbols from /lib/libnss_nis.so.2...done.
Loaded symbols for /lib/libnss_nis.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /lib/libnss_dns.so.2...done.
Loaded symbols for /lib/libnss_dns.so.2
#0 0x08053b48 in is_process_running (s=0x807cc58) at util.c:880
880 memset(s->procinfo, 0, sizeof *(s->procinfo));
(gdb) bt
#0 0x08053b48 in is_process_running (s=0x807cc58) at util.c:880
#1 0x0804ba8d in wait_start (service=0x807cc58) at control.c:415
#2 0x400258be in pthread_start_thread () from /lib/libpthread.so.0
#3 0x4027b217 in clone () from /lib/libc.so.6


From util.c:

int is_process_running(Service_T s) {
pid_t pid;
ASSERT(s);
errno= 0;
if((pid= get_pid(s->path))) {
if(( getpgid(pid) > 0 ) || ( errno == EPERM ))
return pid;
}
memset(s->procinfo, 0, sizeof *(s->procinfo));
return FALSE;
}


=> it seems we need to take care for non thread safe memset in is_process_running, which resets procinfo every time (probably move it to other place?)

Martin

Attachment: monit.strace.2003081513446770000.gz
Description: application/gzip


reply via email to

[Prev in Thread] Current Thread [Next in Thread]