bug-hurd
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] libports: implement lockless management of threads


From: Justus Winter
Subject: Re: [PATCH] libports: implement lockless management of threads
Date: Tue, 12 Nov 2013 16:53:13 +0100
User-agent: alot/0.3.4

Quoting Neal H. Walfield (2013-11-11 22:02:46)
> Yes, this is what I was thinking of.

Awesome :)

> I recall there being type defs for appropriate atomic types.  If that
> is still the recommended approach, please update your patch
> appropriately.

Right. I knew next to nothing about the gcc atomic builtins, so I read
up on them in the gcc docs and wiki. I'll briefly document my findings.

According to [0], the __atomic* functions should be preferred over the
__sync* ones.

0: http://gcc.gnu.org/wiki/Atomic/GCCMM

According to [1], "GCC allows any integral scalar or pointer type that
is 1, 2, 4, or 8 bytes in length."

1: http://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html

> The most important thing, however, is ensuring that the semantics are
> preserved.

The __atomic* functions allows one to specify a memory model. I chose
__ATOMIC_RELAXED. This model does not impose any happens-before
relation on the events, but only atomic access to the variable.

2: http://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync

> That is, was the use of the values also protected by the lock?

It was. I believe that is okay though, as the atomic operations will
ensure the consistency of the values retrieved by the different
threads, e.g. only one thread will see NREQTHREADS (tc) decremented to
zero and stick around, while the others may exit.

> Does moving to atomic updates introduce a possible inconsistency?
>
> I haven't looked at the code.  Before this is checked in, however,
> someone should.

Yes please :)

I'm carefully optimistic about this approach and the patch. I built a
Hurd package with this patch and I am currently using it on almost all
of my Hurd machines. They do seem to perform as expected.

The patch actually improves the performance of one micro benchmark I
run. The benchmark pipes(2) data from /dev/zero to /dev/null and
measures the throughput. If the data is piped byte-wise, then this is
an measurement of the rpc overhead. The general form of the test is

    dd if=/dev/zero bs=$bs count=$count 2>/dev/null \
        | pipebench -q >/dev/null

(pipebench is from the pipebench Debian package.)

Here are the results on my VIA Epia (1.3GHz iirc) box I mentioned
earlier [3]:

3: http://lists.debian.org/debian-hurd/2013/10/msg00069.html

--- without-lockless-benchmark.1384259790.log   2013-11-12 13:43:36.000000000 
+0100
+++ with-lockless-benchmark.1384264893.log      2013-11-12 15:09:39.000000000 
+0100
@@ -2,31 +2,31 @@
 Testing pipe throughput, blocksize 1 count 64k...
Summary:
-Piped   64.00 kB in 00h00m10.63s:    6.02 kB/second
+Piped   64.00 kB in 00h00m06.76s:    9.46 kB/second

These are the results of the test on my Hurd installation running on
kvm on an 'Intel(R) Core(TM)2 Duo CPU L7500 @ 1.60GHz'. I extended the
test so that it creates more processes piping the data in parallel:

--- without-lockless-benchmark.1384264129.log      2013-11-12 
14:57:13.000000000 +0100
+++ with-lockless-benchmark.1384266387.log      2013-11-12 15:31:17.000000000 
+0100
@@ -3,16 +3,16 @@
 Stopping MTA: exim4_listener.
 Testing pipe throughput, blocksize 1 count 64k...
Summary:
-Piped   64.00 kB in 00h00m30.81s:    2.07 kB/second
+Piped   64.00 kB in 00h00m26.26s:    2.43 kB/second
 Testing pipe throughput, blocksize 1 count 64k, 2 processes...
Summary:
-Piped   64.00 kB in 00h00m45.11s:    1.41 kB/second
+Piped   64.00 kB in 00h00m41.79s:    1.53 kB/second
 Waiting for children...

So the situation seems to improve here as well. It strikes me as
strange though, that the VIA embedded system running on bare metal
would outperform the kvm installation on the faster Intel cpu.

 Testing pipe throughput, blocksize 1 count 64k, 8 processes...
Summary:
-Piped   64.00 kB in 00h01m57.18s:  559.00 B/second
+Piped   64.00 kB in 00h02m12.56s:  494.00 B/second
 Waiting for children...
 Testing pipe throughput, blocksize 1 count 64k, 32 processes...
Summary:
-Piped   64.00 kB in 00h04m27.28s:  245.00 B/second
+Piped   64.00 kB in 00h01m00.39s:    1.05 kB/second
 Waiting for children...

I believe this to be an artifact of my testing methodology. I start
several dd processes, and some are finished way earlier than others. I
guess that is to be expected in the absence of any fairness
constraints in msg/task scheduling decisions. But I don't know tbh.

Cheers,
Justus



reply via email to

[Prev in Thread] Current Thread [Next in Thread]