[ft-devel] Fw: On atomic ops and text stack thread-safety

Ok let's see if I can explain this succinctly.

For what I'm doing, performance is definitely of utmost important, but only while maintaining correctness, and portability has been a huge part of it.

There's different issues at play. Let's discuss them in three parts:

1. Atomic operations in presence of multiple threads,
2. Memory ordering in regard to compiler and CPU optimizations,
3. Memory / cache coherency in presence of multiple CPUs.

== Atomic operations ==

By default, read and write operations on properly-aligned char, short, int, and long on all current CPUs are atomic. C11 and C++11 guarantee that IIRC. What we mean by atomic, is that, eg, if we have "static x = 0" and one thread sets it to 42, other threads reading this same variable will only either see 0 or 42, but no other value. That's atomicity of read / write (also known as load / store).

Now, when we get to more involved operations, namely, "read-modify-write", atomicity is not guaranteed. Take FT_Reference_Library:

FT_EXPORT_DEF( FT_Error )
FT_Reference_Library( FT_Library library )
{
    if ( !library )
      return FT_THROW( Invalid_Library_Handle );

    library->refcount++;

    return FT_Err_Ok;
}

As long as there's only one thread of execution, this code is fine and correct. Compiler and CPU make sure that what gets executed has the same semantics of what you wrote. But what if there are multiple threads involved? Things get out of hand and programmer needs to intervene. In this case, the "refcount++" is a read-modify-write operation. That is, in machine code / pseudo-assembler, it involves:

load r, refcount

inc r, 1

store refcount, r

Now, what happens if the first thread calling this function performs the load, reading, eg. refcount of 1, but before performing the rest of the operations it's pre-empted and second thread runs the same code, also reading refcount of 1, incrementing it, and storing 2. Now the first thread resumes, incrementing r to 2 and store it into refcount. What just happened is that we started with a refcount of 1, two threads incremented refcount by one each, but as a result we got 2 stored in refcount instead of 3.

That's the first issue. To resolve it, we need an atomic "fetch_and_add" operation. That is, one that does the "load, inc, store" atomically without another thread interrupting. That's exactly the kind of thing that atomic operations in CPUs, compilers, and standard libraries provide.

You can read more at Linux's documentation:

https://github.com/torvalds/linux/blob/master/Documentation/core-api/atomic_ops.rst

== Memory ordering and optimizations ==

Imagine this pseudo code:

static int x = 0;
static bool ready = false;

One thread running:

x = expensive_computation_producing_42 ();

ready = true;

Other thread:

if (ready)

use (x);

Now, as discussed, we already know that read and write of int and bool types *are* atomic. So *that* is not the issue here. However, you would expect that if the second thread get into the body of the "if", then the value of x read must be 42. However, that's not what always will happen, because of legitimate compiler and CPU optimizations.

In particular, the compiler "is within its rights" to reorder the assignment of x and ready in the first thead. The CPU, is within its rights to execute them out of order, or in parallel. These freedoms / rights allow for tons of optimizations / parallelism. They are allowed because to single-threaded programs they don't make any observable difference. The program results are always the same. But if you have another thread reading those, this can wreak havoc. What's worse, the second thread's compiler / CPU is also within its rights to, eg, read "x" before / in parallel to reading "ready". Because to a single-threaded program those do NOT make a visible difference, but allow for faster code.

Imagine the code above being transformed in machine code to:

Thread 1:

ready = true;

x = expensive_computation_producing_42 ();

Other thread:

r = x;

if (ready)

use (r);

Obviously there's a race condition there now. We might be reading old value of x and using it because "ready" is true. That's a problem.

To fix this problem, we use what's called memory-barriers. There's many types of them, offering different guarantees. For now, let's just discuss what's known as a full memory barrier, or simply memory-barrier. What a memory barrier does, in this, context, is to make sure compiler / CPU do NOT reorder reads / writes from before and after it. Ie. a read / write after the memory barrier will be executed *after* a read / write before the memory barrier.

This, in general, is called memory-ordering. Many atomic operations come with memory-ordering guarantees built in. C11 / C++11 provide full set of atomic operations with selectable memory-ordering guarantees:

https://en.cppreference.com/w/c/atomic
https://en.cppreference.com/w/cpp/atomic

The release-acquire memory-ordering sequence is of special interest to us. Read:

https://en.cppreference.com/w/c/atomic/memory_order

It takes a while to fully digest that. For full-on deep understanding, read the first third of:

https://www.kernel.org/doc/Documentation/memory-barriers.txt

Though maybe read that last one after you finish the third part of this message.

== Multiple CPUs ==

Ok so let's consider the case of multiple CPUs. While it's natural for one to expect that a store A performed on CPU 1 before store B be visible in that same order in CPU 2, this is not guaranteed by some architectures, namely DEC Alpha.

Let's see what this means in practice. Imagine one thread initializes some data that other threads are going to consume:

static void *P = NULL;

Thread 1 on CPU 1:

void *X = malloc (...);

setup (X);

memory_barrier ();

P = X;

Thread 2 on CPU 2:

void * Y = P;

memory_barrier ();

if (Y)

Y->use_it ();

The memory_barrier()s exist to address optimization issues discussed earlier. Now, it would be a pit if thread 2 reads P into Y and it's non-NULL, but the memory pointed to by Y is NOT yet visible to CPU 2. This, can happen.

Fortunately, memory-barrier's, when correctly used, also make sure that this situation does NOT happen. That's what acquire-release semantics is about. In particular, if the threads do:

Thread 1 on CPU 1:

void *X = malloc (...);

setup (X);

release_barrier ();

P = X;

Thread 2 on CPU 2:

void * Y = P;

acquire_barrier ();

if (Y)

Y->use_it ();

Then it's guaranteed that if Y sees the value of P pointing to X, then any stores done on CPU 1 before the release_barrier() calls are also visible to CPU 2 after the acquire_barrier() call. This is exactly what we want.

For full details of different types of barriers, read first third of:

https://www.kernel.org/doc/Documentation/memory-barriers.txt

But just know that a read barrier provides stronger guarantees than an acquire barrier, and a write barrier provides stronger guarantees than a release barrier, and a full memory barrier provides stronger guarantees than either read or write barrier.

That's the gist of it. Hope it makes sense. In another message I'll discuss how I use these in HarfBuzz, FontConfig, Cairo, Pango, etc, to provide lock-free threadsafety in the text stack. FreeType, indeed, needs some work.

Cheers,

behdad

On Wed, Aug 1, 2018 at 9:01 PM, Behdad Esfahbod <address@hidden> wrote:

Hi Werner,

Glad you asked. Yes, there's work to be done in FreeType as well. Give me some time to write down my thoughts.

Cheers,
b

On Wed, Aug 1, 2018 at 12:29 AM, Werner LEMBERG <address@hidden> wrote:

Behdad,

> typedef int hb_atomic_int_impl_t;

I always wondered what this `atomic' stuff is good for. It seems that
you are working on a very low level, almost mimicking CPU opcodes. Is
this for speed? Or for thread safety? And do you think that
something similar to your work would be beneficial for FreeType also?

Werner

--
behdad
http://behdad.org/

behdad
http://behdad.org/

From:	Werner LEMBERG
Subject:	[ft-devel] Fw: On atomic ops and text stack thread-safety
Date:	Thu, 02 Aug 2018 08:07:50 +0200 (CEST)