Re: Mach time device, or: I know why the network deadlocks!

On Fri, Apr 28, 2023 at 7:54 AM Sergey Bugaev <bugaevc@gmail.com> wrote:

Hello!

Here comes yet another bug description and a change proposal;
hopefully not too long this time. (Update after having written like
half of it: apparently it *is* going to be long.)

My system regularly experiences network deadlocks: the system itself
still works if I access it through the console, but all my SSH
sessions just hang, as do any other connections. This can happen soon
after startup (in like 20% of times), and later during run time at
arbitrary moments too; but in this second case it seems related to how
much network activity there is. This is one of the reasons preventing
me from running the glibc testsuite, since that spits out _a lot_ of
output, and will almost surely lead to a networking deadlock at some
point. I don't know why nobody else seems to be complaining about
this; based on the analysis below the root cause should be fairly
reproducible.

Shutting down the system normally in this case does not work, since
the dhcp client attempts to release its lease, but since the network
stack is deadlocked it never manages to. And since (on Debian
GNU/Hurd) the shutdown process is apparently synchronous and there are
no timeouts and sigkills in place, it just hangs there forever, with
no apparent way to even Ctrl-C it (which sort of makes sense since
it's not an interactive shell). If I approach it calmly, I do syncfs
--sync / (just in case) and then reboot-hurd, which works. If,
however, if I have already managed to type in 'reboot', I am left with
no choice but to forcefully power off the machine and risk corrupting
the file system; and the consequences range from minor complaints from
fsck on the next boot-up to the installation breaking so badly that it
does not boot.

So we can agree that this is slightly suboptimal.

Yesterday I got one of these hangs again, and decided to bite the
bullet & look into what's going on. There are two things that could be
locking up: either pfinet or netdde; quick inspection revealed that
pfinet is the guilty party. There is apparently a single global_lock
in pfinet; it makes sense that if something manages to hold it
indefinitely then everything else will hang. That something turned out
to be the timer thread (see pfinet/timer-emul.c:timer_function). It
grabs the lock after sleeping and proceeds to run the handlers, and
apparently never exits from that loop; so nothing else can make any
progress.

But why doesn't it exit the loop? Surely there's only a finite,
hopefully small, number of timers overall? Well, yes, but they
re-schedule themselves from inside the handlers. As an aside, whoever
wrote add_timer () in the same file obviously didn't care about
releasing the mach_thread_self () right; currently a reference is
leaked on each invocation (yay). It should either deallocate the right
or use hurd_thread_self ().

The offending function that reschedules itself over and over is
pfinet/linux-src/net/ipv4/tcp_timer.c:tcp_sltimer_handler. I found out
(which wasn't super easy when everything's <optimized out>, in
addition to me being unable to use my terminal emulator of choice and
having to squint at the Mach console which is tiny because it's not
getting appropriately scaled for my HiDPI screen [0]) that the 'now +
next' addition overflows and results in a small timestamp (number of
jiffies), which is then getting taken by timer-emul to be a timestamp
in the past, so it immediately runs the handler again, thus the
busy-loop.

[0]: https://gitlab.gnome.org/GNOME/gnome-boxes/-/issues/635

Why does it overflow? Because 'now' is huge. 'now' is set from
'jiffies' at the beginning of tcp_sltimer_handler. 'jiffies' is
#defined to fetch_jiffies () from pfinet/mapped-time.h; which uses
maptime_read () to read the current time from the Mach time device,
and then subtracts 'root_jiffies' from it. 'root_jiffies' is filled
from the Mach time device on pfinet startup. So clearly 'j -
root_jiffies' itself overflows, resulting in a huge 'jiffies' value; I
added an 'assert (j >= root_jiffies)' and sure enough, it fails at
about the same rate that the network would lock up.

In conclusion, what happens is pfinet reads a timestamp from the Mach
time device, and then reads another timestamp, and the second one
happens to be smaller than the first one, which confuses it into
busy-looping and deadlocking.

I then grepped gnumach to see how the time device is set up and
realized that the time device is supposed to be CLOCK_REALTIME, not
something like CLOCK_BOOTTIME or CLOCK_MONOTONIC. So it only makes
sense that if *something* adjusts the clock with host_set_time (), the
time as seen through the time device will experience a discontinuous
jump, possibly into the past. And that *something* appears to be
/usr/sbin/ntpdate.

This never happened to me. Probably because I don't have NTP installed

in my Hurd box?

Now, how do we fix this?

Clearly pfinet needs to implement 'jiffies' differently. A stopgap
solution would be to detect the case of j < root_jiffies and just set
root_jiffies = j; this will probably break all sort of timeouts, but
at least will solve the busy-loop / deadlock.

But also, Mach needs to expose a CLOCK_BOOTTIME-like clock in addition
to the CLOCK_REALTIME one. There is already internal tracking for
this, see kern/mach_clock.c:clock_boottime_offset; it only needs to be
exposed to the userland.

Since we have already broken ABI compatibility with the time device on
other Mach versions (if it ever existed) by adding the *64 variants to
the struct mapped_time_value, maybe we could just place the
clock_boottime_offset into the same struct? Then the userspace would
be able to pick it up and use it to calculate the boottime-relative
time.

User land can still use the old time structures but can now be updated to read the

*64 variants.

We could even attempt to use this to implement some support for
CLOCK_MONOTONIC in glibc, which I imagine would be a welcome addition.
Maybe backcompat is not that much of a concern either: the kernel
allocates a whole page for the mapped_time_value and zero-fills it; if
we see that clock_boottime_offset is zero that clearly means that
either the system was booted up on 1 January, 1970, or the Mach
version in use doesn't support clock_boottime_offset.

What do you think?

I think the approach makes sense to me. We can update the clock_boottime_offset

in the structure whenever it is updated by the kernel and then provide a

maptime_read(clockid_t clock, struct mapped_time_value *mtime, struct timeval *tv)

routine to read from the mapped memory and return either a time using CLOCK_REALTIME

or CLOCK_MONOTONIC semantics.

We could also have an RPC host_get_time that is parameterized by clockid_t to have some

symmetry with the routine above.

Sergey

Flavio

From:	Flávio Cruz
Subject:	Re: Mach time device, or: I know why the network deadlocks!
Date:	Sat, 29 Apr 2023 00:39:33 -0400