bug-guix
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#30365: Offloading sometimes hangs


From: Ludovic Courtès
Subject: bug#30365: Offloading sometimes hangs
Date: Wed, 07 Feb 2018 14:42:22 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux)

address@hidden (Ludovic Courtès) skribis:

> On the build machine side, the guile process that forwards data between
> the sshd and guix-daemon¹ is stuck on:
>
>   read(0, …)

> ¹ 
> https://git.savannah.gnu.org/cgit/guix.git/tree/guix/ssh.scm?id=0362e5820ab6a1eb8eaf33bc47e592857c25f765#n102

I’ve been able to strace this Guile process.  When “things work
normally”, its loop does things like this:

--8<---------------cut here---------------start------------->8---
20456 10:52:17 read(9, "gmlo\0\0\0\0\201\0\0\0\0\0\0\0make[2]: Entering 
directory '/tmp/guix-build-make-4.2.1.drv-0/make-4.2.1/doc'\nmake[2]: Nothing 
to be done for 'install-exec-am'.\n\0\0\0\0\0\0\0", 65536) = 152
20456 10:52:17 write(1, "gmlo\0\0\0\0\201\0\0\0\0\0\0\0make[2]: Entering 
directory '/tmp/guix-build-make-4.2.1.drv-0/make-4.2.1/doc'\nmake[2]: Nothing 
to be done for 'install-exec-am'.\n\0\0\0\0\0\0\0", 152) = 152
20456 10:52:17 rt_sigprocmask(SIG_BLOCK, NULL,  <unfinished ...>
20456 10:52:17 <... rt_sigprocmask resumed> [], 8) = 0
20456 10:52:17 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...>
20456 10:52:17 <... pselect6 resumed> ) = 1 (in [9])
20456 10:52:17 read(9, "gmlo\0\0\0\0\223\0\0\0\0\0\0\0 
/gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/mkdir -p 
'/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0\0",
 65536) = 168
20456 10:52:17 write(1, "gmlo\0\0\0\0\223\0\0\0\0\0\0\0 
/gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/mkdir -p 
'/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0\0",
 168) = 168
20456 10:52:17 rt_sigprocmask(SIG_BLOCK, NULL,  <unfinished ...>
20456 10:52:17 <... rt_sigprocmask resumed> [], 8) = 0
20456 10:52:17 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...>
20456 10:52:17 <... pselect6 resumed> ) = 1 (in [9])
20456 10:52:17 read(9, "gmlo\0\0\0\0\304\0\0\0\0\0\0\0 
/gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/install -c 
-m 644 ./make.info ./make.info-1 ./make.info-2 
'/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0", 
65536) = 216
20456 10:52:17 write(1, "gmlo\0\0\0\0\304\0\0\0\0\0\0\0 
/gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/install -c 
-m 644 ./make.info ./make.info-1 ./make.info-2 
'/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0", 
216) = 216
--8<---------------cut here---------------end--------------->8---

Here FD 9 is the guix-daemon socket and 1 is stdout, which is connected
to the remote ‘guix offload’ process.

We then see a slightly different sequence leading to the read(0, …) call
that hangs:

--8<---------------cut here---------------start------------->8---
20456 10:52:18 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...>
20456 10:52:18 <... pselect6 resumed> ) = 1 (in [9])
20456 10:52:18 read(9, "gmlo\0\0\0\0005\0\0\0\0\0\0\0phase `validate-runpath' 
succeeded after 0.0 seconds\n\0\0\0", 65536) = 72
20456 10:52:18 write(1, "gmlo\0\0\0\0005\0\0\0\0\0\0\0phase `validate-runpath' 
succeeded after 0.0 seconds\n\0\0\0", 72) = 72
20456 10:52:18 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, 
tv_nsec=261868144}) = 0
20456 10:52:18 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
20456 10:52:18 madvise(0xffffb0025000, 4096, MADV_DONTNEED <unfinished ...>
20456 10:52:18 <... madvise resumed> )  = 0
20456 10:52:18 write(6, "\0", 1 <unfinished ...>
20456 10:52:18 <... write resumed> )    = 1
20456 10:52:18 clock_gettime(CLOCK_PROCESS_CPUTIME_ID,  <unfinished ...>
20456 10:52:18 <... clock_gettime resumed> {tv_sec=0, tv_nsec=268384968}) = 0
20456 10:52:18 read(0,  <unfinished ...>
--8<---------------cut here---------------end--------------->8---

Notice that there’s no pselect showing up here!

The two clock_gettime correspond to ‘get_internal_run_time’ in libguile,
and ‘madvise’ corresponds to ‘return_unused_stack_to_os’; IOW, this part
corresponds to garbage collection.

So what we have here is that the Scheme procedure ‘select’ returned
stdin as “ready for reading”.  How did that happen?  I believe this is
due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’
returns 1, so ‘select’ returns EINTR but it does so without clearing the
FD sets.

Ludo’.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]