help-cfengine
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cfservd thrushes, nodes fail to get anything


From: Mark Burgess
Subject: Re: cfservd thrushes, nodes fail to get anything
Date: Wed, 25 May 2005 12:14:17 +0200

I cannot see how this can occur. Any ideas from anyone else. There are
no loops. It is possible that this is internal to db.

Mark

On Sat, 2005-05-07 at 18:41 -0400, Yaroslav Halchenko wrote:
> I've found the reason and probably that would be benefitial to adjust
> cfservd to don't get into such situation again:
> 
> I had a leftover file 
> /tmp/__db.testDATABASEcache
> 
> so strace revealed me infinite loop of
> 
> 28731 stat64("/tmp/testDATABASEcache", 0xb7c57350) = -1 ENOENT (No such file 
> or directory)
> 28731 open("/tmp/__db.testDATABASEcache", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 
> 0644) = -1 EEXIST (File exists)
> 28731 open("/tmp/__db.testDATABASEcache", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 
> 0644) = -1 EEXIST (File exists)
> 28731 open("/tmp/__db.testDATABASEcache", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 
> 0644) = -1 EEXIST (File exists)
> 
> 
> Version installed  (debian unstable)
> cfengine2              2.1.14-1 
> 
> --
> Yarik
> 
> 
> On Sat, May 07, 2005 at 11:50:59AM -0400, Yaroslav Halchenko wrote:
> > Dear All,
> 
> > Yesterday one of the users filled up /tmp on a main node with junk and it 
> > rendered
> > cfengine unusable. First it reported
> 
> > daemon.log:May  6 21:11:23 ravana cfservd[16657]:  Couldn't open checksum 
> > database /tmp/testDATABASEcache 
> > daemon.log:May  6 21:11:23 ravana cfservd[16657]:  db_open: No space left 
> > on device
> 
> > and seems after that whenever any node connects to it - cfservd
> > becomes extremely busy and then finally fails with next message being
> > reported by the nodes
> 
> > cfengine:node20: Received signal 13 (SIGPIPE) while doing [no_active_lock]
> > cfengine:node20: Logical start time Fri May  6 23:51:10 2005
> > cfengine:node20: This sub-task started really at Fri May  6 23:51:10 2005
> 
> > or actually now for some reason without a node name
> 
> > cfengine:: Received signal 13 (SIGPIPE) while doing [pre-lock-state]
> > cfengine:: Logical start time Sat May  7 11:00:33 2005
> > cfengine:: This sub-task started really at Sat May  7 11:00:33 2005
> 
> > and then another stating refusal for copying
> 
> > cfengine:: Transmission refused or failed statting 
> > /etc/cfengine/inputs/CVS/Repository
> > Got:
> > cfengine:: Received signal 13 (SIGPIPE) while doing 
> > [lock.cfagent_conf.node2.copy.copy_3343]
> > cfengine:: Logical start time Sat May  7 04:30:29 2005
> > cfengine:: This sub-task started really at Sat May  7 04:30:29 2005
> 
> > I've tried restarting cfengine parts on both ends - doesn't help.
> > running cfservd with -d2 gave next: while trying to run update script
> > (copy /etc/cfengine/input files across the nodes into /etc/cfengine)
> 
> > ----------------------------------------
> > ...
> > Access privileges - match found
> > cfservd: Host node2.ravana.rutgers.edu granted access to 
> > /etc/cfengine/inputs/CVS/Root
> > Clocks were off by 0
> > StatFile(/etc/cfengine/inputs/CVS/Root)
> > OK: type=0
> >  mode=644
> >  lmode=0
> >  uid=0
> >  gid=0
> >  size=10
> >  atime=1115477605
> >  mtime=1067285389
> > Transaction Send[t 65][Packed text]
> > Attempting to send 73 bytes
> > SendSocketStream, sent 73
> > Transaction Send[t 3][Packed text]
> > Attempting to send 11 bytes
> > SendSocketStream, sent 11
> > RecvSocketStream(8)
> >     (Concatenated 8 from stream)
> > Transaction Receive [t 51][]
> > RecvSocketStream(51)
> >     (Concatenated 51 from stream)
> > Received: [MD5 /etc/cfengine/inputs/CVS/Root] on socket 5
> > CompareLocalChecksums(/etc/cfengine/inputs/CVS/Root,MD5=05e8d918529f204488a626792c4f8a6f)
> > ChecksumChanged: key /etc/cfengine/inputs/CVS/Root with data 
> > MD5=05e8d918529f204488a626792c4f8a6f
> 
> > <At this point it stalls for a minute or two although cfservd running
> > busy>
> 
> > IPV4 address
> > sockaddr_ntop(10.0.0.2)
> > Obtained IP address of 10.0.0.2 on socket 7 from accept
> 
> > FuzzyItemIn(LIST,10.0.0.2)
> > Purging Old Connections...
> > Done purging
> 
> > FuzzyItemIn(LIST,10.0.0.2)
> > cfservd: Denying repeated connection from 10.0.0.2
> > ----------------------------------------
> 
> > from client (cfagent) side it looks like
> 
> > ----------------------------------------
> > Compare binary sums on ravana:/etc/cfengine/inputs/CVS/Root & 
> > /var/lib/cfengine2/inputs/CVS/Root
> > Using network md5 checksum instead
> > ChecksumFile(m,/var/lib/cfengine2/inputs/CVS/Root)
> > Send digest of /var/lib/cfengine2/inputs/CVS/Root to server, 
> > MD5=05e8d918529f204488a626792c4f8a6f
> > Transaction Send[t 51][Packed text]
> > Attempting to send 59 bytes
> > SendSocketStream, sent 59
> > RecvSocketStream(8)
> > <STALLS HERE and I got bored waiting till it dies... may be it never
> > dies this time>
> 
> > ----------------------------------------
> 
> > So here are the questions:
> 
> > 1. how to fix current situation?  
> >    clearly there is something broken in a current state, so may be I can
> >    clean out cfengine state so as to start from a clean one - I wouldn't
> >    mind if it takes longer to run for the first time ;-) Sure I can
> >    completely reinstall and then it should work I believe but...
> 
> 
> > 2. what would be a nice policy to enforce over /tmp so I don't
> > remove anything valuable (like ssh-agent sockets and some other staff
> > opened by running programs). I'm thinking about smth like files and
> > directories large in size should be forbidden (>1M) if they are older
> > than an hour. I'm not sure if I can discard data solely on age, so
> > age+size sounds good to me..





reply via email to

[Prev in Thread] Current Thread [Next in Thread]