help-cfengine
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

cfservd thrushes, nodes fail to get anything


From: Yaroslav Halchenko
Subject: cfservd thrushes, nodes fail to get anything
Date: Sat, 7 May 2005 11:50:59 -0400
User-agent: Mutt/1.5.8i

Dear All,

Yesterday one of the users filled up /tmp on a main node with junk and it 
rendered
cfengine unusable. First it reported

daemon.log:May  6 21:11:23 ravana cfservd[16657]:  Couldn't open checksum 
database /tmp/testDATABASEcache 
daemon.log:May  6 21:11:23 ravana cfservd[16657]:  db_open: No space left on 
device

and seems after that whenever any node connects to it - cfservd
becomes extremely busy and then finally fails with next message being
reported by the nodes

cfengine:node20: Received signal 13 (SIGPIPE) while doing [no_active_lock]
cfengine:node20: Logical start time Fri May  6 23:51:10 2005
cfengine:node20: This sub-task started really at Fri May  6 23:51:10 2005

or actually now for some reason without a node name

cfengine:: Received signal 13 (SIGPIPE) while doing [pre-lock-state]
cfengine:: Logical start time Sat May  7 11:00:33 2005
cfengine:: This sub-task started really at Sat May  7 11:00:33 2005

and then another stating refusal for copying

cfengine:: Transmission refused or failed statting 
/etc/cfengine/inputs/CVS/Repository
Got:
cfengine:: Received signal 13 (SIGPIPE) while doing 
[lock.cfagent_conf.node2.copy.copy_3343]
cfengine:: Logical start time Sat May  7 04:30:29 2005
cfengine:: This sub-task started really at Sat May  7 04:30:29 2005

I've tried restarting cfengine parts on both ends - doesn't help.
running cfservd with -d2 gave next: while trying to run update script
(copy /etc/cfengine/input files across the nodes into /etc/cfengine)

----------------------------------------
...
Access privileges - match found
cfservd: Host node2.ravana.rutgers.edu granted access to 
/etc/cfengine/inputs/CVS/Root
Clocks were off by 0
StatFile(/etc/cfengine/inputs/CVS/Root)
OK: type=0
 mode=644
 lmode=0
 uid=0
 gid=0
 size=10
 atime=1115477605
 mtime=1067285389
Transaction Send[t 65][Packed text]
Attempting to send 73 bytes
SendSocketStream, sent 73
Transaction Send[t 3][Packed text]
Attempting to send 11 bytes
SendSocketStream, sent 11
RecvSocketStream(8)
    (Concatenated 8 from stream)
Transaction Receive [t 51][]
RecvSocketStream(51)
    (Concatenated 51 from stream)
Received: [MD5 /etc/cfengine/inputs/CVS/Root] on socket 5
CompareLocalChecksums(/etc/cfengine/inputs/CVS/Root,MD5=05e8d918529f204488a626792c4f8a6f)
ChecksumChanged: key /etc/cfengine/inputs/CVS/Root with data 
MD5=05e8d918529f204488a626792c4f8a6f

<At this point it stalls for a minute or two although cfservd running
busy>

IPV4 address
sockaddr_ntop(10.0.0.2)
Obtained IP address of 10.0.0.2 on socket 7 from accept

FuzzyItemIn(LIST,10.0.0.2)
Purging Old Connections...
Done purging

FuzzyItemIn(LIST,10.0.0.2)
cfservd: Denying repeated connection from 10.0.0.2
----------------------------------------

from client (cfagent) side it looks like

----------------------------------------
Compare binary sums on ravana:/etc/cfengine/inputs/CVS/Root & 
/var/lib/cfengine2/inputs/CVS/Root
Using network md5 checksum instead
ChecksumFile(m,/var/lib/cfengine2/inputs/CVS/Root)
Send digest of /var/lib/cfengine2/inputs/CVS/Root to server, 
MD5=05e8d918529f204488a626792c4f8a6f
Transaction Send[t 51][Packed text]
Attempting to send 59 bytes
SendSocketStream, sent 59
RecvSocketStream(8)
<STALLS HERE and I got bored waiting till it dies... may be it never
dies this time>

----------------------------------------

So here are the questions:

1. how to fix current situation?  
   clearly there is something broken in a current state, so may be I can
   clean out cfengine state so as to start from a clean one - I wouldn't
   mind if it takes longer to run for the first time ;-) Sure I can
   completely reinstall and then it should work I believe but...
  
   
2. what would be a nice policy to enforce over /tmp so I don't
remove anything valuable (like ssh-agent sockets and some other staff
opened by running programs). I'm thinking about smth like files and
directories large in size should be forbidden (>1M) if they are older
than an hour. I'm not sure if I can discard data solely on age, so
age+size sounds good to me..


-- 
Yaroslav Halchenko
Research Assistant, Psychology Department, Rutgers-Newark
Office: (973) 353-5440x263 | FWD: 82823 | Fax: (973) 353-1171
        101 Warren Str, Smith Hall, Rm 4-105, Newark NJ 07105
Student  Ph.D. @ CS Dept. NJIT




reply via email to

[Prev in Thread] Current Thread [Next in Thread]