bug-guix
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#41625: Sporadic guix-offload crashes due to EOF errors


From: Maxim Cournoyer
Subject: bug#41625: Sporadic guix-offload crashes due to EOF errors
Date: Fri, 24 Sep 2021 00:53:22 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)

Hello!

Ludovic Courtès <ludo@gnu.org> writes:

> Hi,
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> Now that I have root access to overdrive1, I could strace the sshd
>> process (I just did 'strace -p340', noting the process of sshd displayed
>> with 'herd status sshd'):
>>
>> pselect6(87, [3 4], NULL, NULL, NULL, NULL) = 1 (in [3])
>> accept(3, {sa_family=AF_INET, sin_port=htons(33262), 
>> sin_addr=inet_addr("66.158.152.121")}, [128->16]) = 5
>> fcntl(5, F_GETFL)                       = 0x2 (flags O_RDWR)
>> pipe2([6, 7], 0)                        = 0
>> socketpair(AF_UNIX, SOCK_STREAM, 0, [8, 9]) = 0
>> clone(child_stack=NULL, 
>> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
>> child_tidptr=0xffff8e0ef0e0) = 644
>> close(7)                                = 0
>> close(9)                                = 0
>> write(8, "\0\0\1\245\0", 5)             = 5
>> write(8, "\0\0\1\234\nPort 22\nPermitRootLogin no\n"..., 420) = 420
>> close(8)                                = 0
>> close(5)                                = 0
>> getpid()                                = 340
>> getpid()                                = 340
>> getpid()                                = 340
>> getpid()                                = 340
>> getpid()                                = 340
>> getpid()                                = 340
>> getpid()                                = 340
>> pselect6(87, [3 4 6], NULL, NULL, NULL, NULL) = 1 (in [6])
>> read(6, "\0", 1)                        = 1
>> pselect6(87, [3 4 6], NULL, NULL, NULL, NULL) = 1 (in [6])
>> read(6, "", 1)                          = 0
>
> OK, so it looks as if the client disconnected right away.  Hard to tell
> exactly what that happened.  :-/  Perhaps turning libssh debugging on on
> the client side could help (by uncommenting “#:log-verbosity 'protocol”
> in (guix ssh)).

I was able to better understand the problem after encountering it on
another low power ARM board.  It's about the guile-ssh/libssh timeout
causing a channel read to return EOF.

I have one example here where it hangs at the (inferior-eval
'(use-modules (gnu)) result)' step; Guix runs for about 1m30s,
apparently loading all the package modules. Perhaps my
GUILE_COMPILED_PATH is not set correctly and things are slower than
expected.  Not sure.

But what happens is that there's no output in the 15 s timeout that we
set for the SSH session elapses, and libssh's ssh_channel_read returns
0, which is the same value it returns when it encounters EOF.  Guile's
peek_byte_or_eof turn that zero into an EOF.  I've shared my analysis on
the guile-ssh bug tracker [0]

So information is lost at libssh's level, which is not so nice.  Knowing
exactly how that EOF come into the picture, we can handle it and produce
better diagnostic though.  I'll try reworking my original patch in that
direction.

Thanks,

Maxim

[0]  https://github.com/artyom-poptsov/guile-ssh/issues/29





reply via email to

[Prev in Thread] Current Thread [Next in Thread]