Memory Model and Enforced Ordering
Both the Compiler and the CPU can reorder memory accesses, meaning that they can happen in a
different order from what's specified in the code. These reordering happen for optimization
purposes and we don't really care about them in single threaded applications However, this can
become problematic in multithreaded applications where multiple threads may need to read from
and write to shared memory simultaneously.
There are three different memory ordering models each offering different levels of restrictions and
rules. They determine the degree of freedom that the CPU and the compiler have to reorder the
operations in one thread, and to propagate changes between multiple threads. Here is a brief
description of each model:
Sequentially Consistent: This is the most strict and straightforward ordering model. It
guarantees that all the threads will agree on the same order of events, meaning that they will
establish a single total modification order of all atomic operations that are tagged with
memory_order_seq_cst .
Relaxed: This ordering poses no constraints on the CPU and the compiler. The only
guarantee with this ordering is that the operations will be carried out atomically. This
ordering applies to operations tagged with memory_order_relaxed .
Acquire/Release: If the result of a release atomic store ( atomicVar.store(value,
std::memory_order_release) ) is read by an acquire atomic load
( atomicVar.load(std::memory_order_acquire) ) in another thread, the second thread is
guaranteed to see everything that happened before the atomic store. This ordering model is a
middle ground between the other two in terms of constraints and it has better performance
than Sequentially Consistent ordering.
Before diving deep into each of these models, let's go over some formal definitions and examples.
Sequenced-before
When evaluation A happens before evaluation B in the same thread, A is sequenced-before B.
// A is sequenced-before B
int num = 10; // A
num = 30; // B
Synchronizes-with
// Thread 1
atomicFlag.store(true, std::memory_order_release);
// Thread 2
while (!atomicFlag.load(std::memory_order_acquire));
In this example, the atomic store in Thread 1 is synchronized with the atomic load in Thread 2.
The use case of this will become clear when we go over the Acquire/Release ordering in more
detail.
Inter-thread happens-before
Evaluation A inter-thread happens-before B in the following situations:
A synchronizes-with B
// Thread 1
atomicNum.store(20, std::memory_order_release); // A
// Thread 2
while (atomicNum.load(std::memory_order_acquire) != 20); // B
A synchronizes-with X and X is sequenced-before B
// Thread 1
atomicFlag.store(true, std::memory_order_release); // A
// Thread 2
while (!atomicFlag.load(std::memory_order_acquire)); // X
std::cout << "Hello"; // B
A is sequenced-before X, and X inter-thread happens-before B
// Thread 1
atomicNum.store(20, std::memory_order_relaxed); // A
atomicFlag.store(true, std::memory_order_release); // X
// Thread 2
while (!atomicFlag.load(std::memory_order_acquire));
int value = atomicNum.load(std::memory_order_relaxed); // B
A inter-thread happens-before X, and X inter-thread happens-before B (transitive property).
// Thread 1
atomicNum.store(20, std::memory_order_relaxed); // A
flag1.store(true, std::memory_order_release);
// Thread 2
while (!flag1.load(std::memory_order_acquire)); // X
flag2.store(true, std::memory_order_release);
// Thread 3
while (!flag2.load(std::memory_order_acquire));
int value = atomicNum.load(std::memory_order_relaxed); // B
Happens-before
We say A happens-before B if either A is sequenced-before B or A inter-thread happens-before B.
Sequencing Models
Sequentially Consistent Ordering
In the absence of any explicit ordering, memory_order_seq_cst is the default. The sequentially
consistent ordering is the most restrictive ordering out of all the options, and it's the most
intuitive one. It guarantees that all the threads will agree on the same sequence of events between
the atomic operations, and the atomic stores are synchronized with subsequent atomic loads .
There is only one downside with this approach. This restriction and the guarantee of ordering
comes at the cost of performance. In multicore systems, different cores may need to perform
additional expensive synchronization instructions in order to guarantee sequential consistency
between each other. One such instruction is XCHG which is discussed later on in this article.
Let's go over an example of what sequentially consistent ordering means:
// flag initially holds false
// Thread 1
atomicNumber.store(20, std::memory_order_seq_cst); // 1
flag.store(true, std::memory_order_seq_cst); // 2
// Thread 2
while (!flag.load(std::memory_order_seq_cst)); // 3
assert(atomicNumber.load() == 20); // 4
In this example, the assertion on line 4 will never fail. This is because the atomic store on line 2
in Thread 1 is synchronized with the atomic load on line 3 in Thread 2 and in Thread 1, the
atomic store in the atomicNumber variable, happens before the atomic store on the flag. Due to
the transitive property of the happens before relationship, line 1 happens before line 4 and
therefore, the assertion will always succeed.
This example is very intuitive and is what we would expect to happen. This is exactly why
memory_order_seq_cst is the default mode for all the atomic operations. Unless we have a good
reason not to, we can usually stick with this memory ordering model. As we will see in a bit, the
assertion above may fail if we used memory_order_relaxed instead.
Relaxed Ordering
memory_order_relaxed is the exact opposite side of the coin. Minimal to no guarantee is made
on the ordering of the atomic operations used with this model. Another interesting thing with
relaxed ordering is that different threads may not agree on the same order of events. Meaning that
2 threads may see 2 different values for the same atomic variable at the exact same time.
Let's see an example of what's totally valid but unexpected when memory_order_relaxed is used:
// Shared
std::atomic<int> num(0);
std::atomic<bool> flag(false);
// Thread 1
num.store(10, std::memory_order_relaxed);
num.store(20, std::memory_order_relaxed);
num.store(30, std::memory_order_relaxed);
num.store(40, std::memory_order_relaxed);
num.store(50, std::memory_order_relaxed);
flag.store(true, std::memory_order_relaxed);
// Thread 2
while (!flag.load(std::memory_order_relaxed));
int value = num.load(std::memory_order_relaxed);
In Thread 2, value can be 10, 20, 30, 40, or 50. All of these are valid outcomes because we used
memory_order_relaxed and there is no guarantee that both threads will see the same value for
the atomic variable num . Some rules still apply, even with relaxed ordering. For instance, let's
assume that thread 2 sees the value 30 the first time it calls
num.load(std::memory_order_relaxed) . In this situation, the subsequent calls to
num.load(std::memory_order_relaxed) in thread 2 cannot see any of the previous atomic stores
on num, so they will either still see 30, or they will see the subsequent atomic stores in thread 1
which are 40 and 50.
The benefit of using a relaxed ordering is that it has the best performance but it's also the trickiest
to get right. Due to the looseness of the rules enforced by memory_order_relaxed , it shouldn't be
used often unless performance if of utmost importance. A common application of
memory_order_relaxed is incrementing a counter because we don't really care about the order in
which the counter gets incremented, as long as it does eventually and atomically.
Acquire/Release Ordering
The Acquire/Release ordering model is a middle ground between the sequentially consistent and
relaxed models. It has better performance than sequentially consistent, and it provides useful
synchronization guarantees in multithreaded programs.
If a load atomic operation tagged with memory_order_acquire in Thread B, reads a value stored
in the same atomic variable with an atomic store tagged with memory_order_release in Thread
A, then Thread A synchronizes with Thread B, and they will participate in a release sequence.
This means that all atomic and non-atomic memory accesses before the release store are visible
after the acquire load. In terms of the definitions we had earlier, every memory access that
appears before the store tagged with memory_order_release , happens before the instructions that
appear after the load tagged with memory_order_acquire .
This synchronization is only established between the threads releasing and acquiring the same
atomic variable. Memory accesses may still propagate in a different order to other threads that
don't perform an acquire operation on the same atomic variable.
// Thread 1
nonAtomicNum = 20
atomicNum.store(40, std::memory_order_relaxed);
flag.store(true, std::memory_order_release); // A
// Thread 2
while (!flag.load(std::memory_order_acquire)); // B
assert(nonAtomicNum == 20); // Never fires
assert(atomicNum.load(std::memory_order_relaxed) == 40); // Never fires
In the example above, Thread 1 synchronizes with Thread 2 on flag, which is an atomic boolean.
Therefore, all the writes before the store tagged with memory_order_release are going to be
visible in Thread 2 after the synchronization. As a result, both the asserts are always guaranteed
to succeed.
Behind the Scenes
x86 vs ARM
In strongly ordered architectures like x86, release-acquire is the default and it is the minimum
amount of constraint imposed by the CPU on the ordering of the instructions. No additional
instruction is generated for memory_order_release and memory_order_acquire as opposed to
memory_order_relaxed in these architectures.
atomicNum.store(10, std::memory_order_release); // A
atomicNum.store(10, std::memory_order_relaxed); // B
Lines A and B in the example above produce the exact same assembly code. The mov instruction
guarantees release-acquire ordering in the x86 architecture (compiled with x86 GCC 13.2):
mov DWORD PTR atomicNum[rip], 10
in weakly ordered systems like ARM however, the CPU may reorder the operations that are
tagged with memory_order_release . In these systems, additional instructions are required to
enforce release-acquire ordering. Line A in the example above generates the following arm
assembly (compiled with ARM GCC 13.2.0):
; counter.store(10, std::memory_order_relaxed)
movs r0, #10
whereas Line B generates the following:
; counter.store(10, std::memory_order_release)
dmb ish
movs r2, #10
dmb stands for Data Memory Barrier which according to the arm documentation, "ensures that all
explicit memory accesses that appear in program order before the DMB instruction are observed
before any explicit memory accesses that appear in program after the DMB instruction". This is
exactly the behavior we would expect from memory_order_release . For operations that are
tagged with memory_order_acquire , the data memory barrier ( dmb ) will be placed after the main
instruction. and with memory_order_seq_cst , the main instruction will be wrapped with two data
memory barriers:
; counter.store(10, std::memory_order_seq_cst);
dbm ish
movs r2, #10
dmb ish
To ensure sequential consistency in x86 architectures, the compiler either uses mov + mfence or
mov + xchg . The mov by itself is not enough since it only has release semantics. xchg has an
implicit lock prefix which makes it a full memory barrier. I got the following assembly when
compiling with x86 GCC 13.2:
; counter.store(10, std::memory_order_seq_cst)
mov eax, 10
xchg eax, DWORD PTR counter[rip]
These additional instructions explain the performance difference between different memory models
and why they work the way they do.