guix-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services


From: Ludovic Courtès
Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
Date: Sat, 03 Mar 2018 16:21:07 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux)

Hi Carlo,

Overall LGTM!  It’s a long reply though, but that’s because there are
lots of details to pay attention to in this Unix quagmire.  :-)

Carlo Zancanaro <address@hidden> skribis:

> I've re-written my patch, and it's attached in two commits. The first
> one adds the necessary calls to prctl, and the second adds the
> fallback to polling.

If possible I would prefer a commit that only adds prctl, and the next
one would actually use it.  I find it clearer and more convenient if we
need to bisect or revert.

> On Fri, Mar 02 2018, Ludovic Courtès wrote:
>> The ‘prctl’ procedure itself should simply throw to 'system-error on
>> GNU/Hurd. But then, in (shepherd), we could add the polling thing
>> when (not (string-contains %host-type "linux")).
>>
>> WDYT?
>
> I don't like the idea of doing this based on the host type. In my
> patch I've done it based on whether the prctl call succeeded.

Right, I actually agree with feature-based checks.  :-)

More on that inline below.

> The fallback code still fails in the guix build environment (as my
> previous patch did), but when it's using prctl it works properly. This
> means that a build on Linux pre-3.4, or on Hurd, will fail, which
> probably isn't acceptable given that shepherd is a hard dependency for
> starting a GuixSD system. As far as I can tell the test fails because
> the processes stick around as zombies,

If they’re zombies, that means nobody called waitpid(2).  Presumably the
polling code would need to do that.

So I suppose ‘check-for-dead-services’ should do something like:

          (when (integer? running)
            (catch 'system-error
              (lambda ()
                (match (waitpid* running WNOHANG)
                  (#f #f)  ;uh, not a PID?
                  ((0 . _) #f) ;ditto?
                  ((pid . _)
                   (local-output "PID ~a (~a) is dead" running (canonical-name 
service))
                   (respawn-service service))))
              (lambda args
                (or (= ECHILD (system-error-errno args))  ;wrong PID?
                    (= EPERM (system-error-errno args))   ;not a child
                    (apply throw args)))))

Does that make sense?  Please check waitpid(2) carefully though, because
it’s pretty gnarly and I might have forgotten or misinterpreted
something here.

>> We want to set the “reaper” of child processes to Shepherd itself,
>> so we must do that in child processes. The shepherd process cannot
>> be its own reaper I suppose.
>
> Reading the manpage, and then running some code, I think you're wrong
> about this. Using prctl with PR_SET_CHILD_SUBREAPER marks the calling
> process as a child subreaper. That means that any processes that are
> orphaned below the current process get reparented under the current
> process (or a closer child subreaper, if there's one further down). If
> there are no processes marked as child subreapers, then the orphaned
> process is reparented under pid 1. We should only need to call prctl
> in two places: when shepherd initially starts, or when we fork for
> daemonize.

Oh you’re right, sorry for the confusion!

> From 5f26da2ce6a26c8412368900987ac5438f3e70cd Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <address@hidden>
> Date: Sat, 3 Mar 2018 17:26:05 +1100
> Subject: [PATCH 1/2] Handle forked process SIGCHLD signals
>
> * Makefile.am (TESTS): Add tests/forking-service.sh.
> * configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER.
> * modules/shepherd.scm: Set the child subreaper attribute of main shepherd
>   process (as long as we're not pid 1).
> * modules/shepherd/service.scm (root-service)[daemonize]: Set the child
>   subreaper attribute of newly forked shepherd process.
> * modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variable
>   and export it.
>   (prctl): Add new procedure and export it.

[...]

> --- a/modules/shepherd.scm
> +++ b/modules/shepherd.scm
> @@ -50,6 +50,8 @@
>  ;; Main program.
>  (define (main . args)
>    (initialize-cli)
> +  (when (not (= 1 (getpid)))
> +    (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))

I think it’s a good idea to add a comment, like:

  ;; Register ourselves to get SIGCHLD when child processes terminate.
  ;; This is necessary for daemons for which we’d otherwise never get
  ;; SIGCHLD.

> +(define prctl
> +  (let ((proc (syscall->procedure long "prctl" (list int int))))

On GNU/Hurd libc doesn’t have a “prctl” symbol.  So you need something
like:

  (if (dynamic-func "prctl" (dynamic-link))
      (let ((proc …)) …)
      (lambda (process operation)
        ;; We’re running on an OS that lacks ‘prctl’, such as GNU/Hurd.
        (throw 'system-error "prctl" "~A" (list (strerror ENOSYS))
               (list ENOSYS))))

> +function cleanup() {

You need either () or “function” but not both (shells other than Bash
might complain…).

> +shepherd_pid="$(cat $pid)"

Likewise, we should use `foo` instead of $(foo) to suppose non-Bash
shells (there are several occurrences of $(foo) here.)

> From ec47fa189c7d47f1d9444d939b084850f0a7186c Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <address@hidden>
> Date: Wed, 21 Feb 2018 22:57:59 +1100
> Subject: [PATCH 2/2] Poll every 0.5s to find dead forked services
>
> * modules/shepherd.scm (open-server-socket): Set socket to be
>   non-blocking.
>   (main): Use select with a timeout. If prctl failed when shepherd started
>   then call check-for-dead-services between connections/timeouts.
> * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
>   signal handler.
>   (respawn-service): Separate logic for respawning services from handling
>   SIGCHLD.
>   (handle-SIGCHLD, check-for-dead-services): New exported procedures.
> * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
>   symbols.

[...]

> +  (define poll-services
> +    (if (= 1 (getpid))
> +        (lambda () #f)
> +        (catch 'system-error
> +          (lambda ()
> +            (prctl PR_SET_CHILD_SUBREAPER 1)
> +            (lambda () #f))
> +          (lambda (key . args)
> +            check-for-dead-services))))

Please add a comment in the handler saying that we resort to polling on
OSes that do not support ‘prctl’.

However, perhaps we should do:

  (lambda args
    (let ((errno (system-error-errno args)))
      (if (= ENOSYS errno)
          check-for-dead-services
          (apply throw args))))

so that important/unexpected errors are not silently ignored.

> +(define (respawn-service serv)
> +  (when serv

Please add a docstring and move ‘when’ to the caller.

> +(define (check-for-dead-services)

Docstring as well :-), and also a comment explaining that this is a last
resort for prctl-less OSes.

>  (register-services
>   (make <service>
>     #:provides '(test-loaded)
> -   #:start (const 42)
> +   #:start (const 'abc)

Perhaps with the ‘check-for-dead-services’ use of ‘waitpid’ I outlined
above we can even keep 42 here?

If not, we should add in shepherd.texi, under “Slots of services”, a few
words saying that when ‘running’ is an integer it is assumed to be a
PID.

Could you send an updated patch?

Thanks for working on this!

Ludo’.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]