guix-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services


From: Ludovic Courtès
Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
Date: Sun, 04 Mar 2018 23:11:28 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux)

Hello,

Carlo Zancanaro <address@hidden> skribis:

> On Sat, Mar 03 2018, Ludovic Courtès wrote:
>> If they’re zombies, that means nobody called waitpid(2). Presumably
>> the
>> polling code would need to do that.
>>
>> So I suppose ‘check-for-dead-services’ should do something like:
>>
>> [ ... ]
>>
>> Does that make sense?  Please check waitpid(2) carefully though,
>> because
>> it’s pretty gnarly and I might have forgotten or misinterpreted
>> something here.
>
> Unfortunately we can't do that. We fall back to the polling approach
> to handle the fact that the processes that we care about aren't our
> children (hence we don't get SIGCHLD). The waitpid system call only
> waits for processes which are children of the calling process.

Oh right!

> I looked into the zombie problem a bit more, and I found what the
> problem actually is. In the build environment a guile process is
> running as pid 1 (the *-guile-builder script for that job). This guile
> process never handles SIGCHLD, and never calls wait/waitpid, so any
> orphaned processes become zombies. I tried modifying derivations.scm,
> but it wanted to rebuild a lot of things, so I gave up. I think we
> need to add something like this to the *-guile-builder script:
>
>  (sigaction SIGCHLD
>    (lambda ()
>      (let loop ()
>        (match (waitpid WAIT_ANY WNOHANG)
>          ((0 . _) #f)
>          ((pid . _) (loop))
>          (_ #f))))
>    SA_NOCLDSTOP)

Good catch.  We could add this in gnu-build-system.scm in core-updates,
though it’s no big deal anyway since these are throw-away environments.

Thoughts?

> From e529e4035eec147f448804dd10fdbca13a17f523 Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <address@hidden>
> Date: Sun, 4 Mar 2018 07:01:30 +1100
> Subject: [PATCH 1/3] Add prctl syscall wrapper along with with
>  PR_SET_CHILD_SUBREAPER.
>
> * configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER.
> * modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variable
>   and export it.
>   (prctl): Add new procedure and export it.

Applied with a copyright line for you.

> From b43c128d8a175a9a123eb7b1af465fb3747a5393 Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <address@hidden>
> Date: Sun, 4 Mar 2018 07:46:13 +1100
> Subject: [PATCH 2/3] Handle forked process SIGCHLD signals
>
> * Makefile.am (TESTS): Add tests/forking-service.sh.
> * modules/shepherd.scm: Set the child subreaper attribute of main shepherd
>   process (as long as we're not pid 1).
> * modules/shepherd/service.scm (root-service)[daemonize]: Set the child
>   subreaper attribute of newly forked shepherd process.

Applied.  However tests/forking-service.sh was missing, so I took it
from the previous version of this patch and added it.  I also changed
the “Bashishms” that were in that file, as discussed earlier.  Let me
know if anything’s wrong!

> From 3d3c091660bbbd529af0058b0ba9b5ddbfc6b481 Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <address@hidden>
> Date: Wed, 21 Feb 2018 22:57:59 +1100
> Subject: [PATCH 3/3] Poll every 0.5s to find dead forked services
>
> * modules/shepherd.scm (open-server-socket): Set socket to be
>   non-blocking.
>   (main): Use select with a timeout. If prctl failed when shepherd started
>   then call check-for-dead-services between connections/timeouts.
> * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
>   signal handler.
>   (respawn-service): Separate logic for respawning services from handling
>   SIGCHLD.
>   (handle-SIGCHLD, check-for-dead-services): New exported procedures.
> * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
>   symbols.
> * doc/shepherd.texi (Slots of services): Add note about service running slot
>   being a process id.

[...]

> --- a/modules/shepherd.scm
> +++ b/modules/shepherd.scm
> @@ -42,6 +42,8 @@
>    (with-fluids ((%default-port-encoding "UTF-8"))
>      (let ((sock    (socket PF_UNIX SOCK_STREAM 0))
>            (address (make-socket-address AF_UNIX file-name)))
> +      (fcntl sock F_SETFL (logior O_NONBLOCK
> +                                  (fcntl sock F_GETFL)))
>        (bind sock address)
>        (listen sock 10)
>        sock)))
> @@ -49,14 +51,28 @@
>
>  ;; Main program.
>  (define (main . args)
> -  (initialize-cli)
> +  (define poll-services
> +    (if (= 1 (getpid))
> +        (lambda () #f) ;; If we're pid 1 then we don't need to set
> +                       ;; PR_SET_CHILD_SUBREAPER

[...]

> +            (match (select (list sock) (list) (list) 0.5)
> +              (((sock) _ _)
> +               (read-from sock))
> +              (_
> +               #f))
> +            (poll-services)

Here everyone ends up paying some overhead (the 0.5 second timeout),
which isn’t great.

How about something like:

  (define poll-services
    (and (not (= 1 (getpid)))
         …))

  (match (select (list sock) '() '() (if poll-services 0.5 0))
    …)

?

Thanks,
Ludo’.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]