Table of Contents

  1. Sequential consistency
  2. Concurrency needs sync
  3. Memory ordering is crucial
  4. Ordering techniques
  5. Object layout
  6. C++11 techniques

C++11 Memory Model

A memory model, a.k.a memory consistency model, is a specification of the allowed behavior of multithreaded programs executing with shared memory [1]. The most basic model is sequential consistency (SC), where all insructions from all threads (appear to) form a total order that is consistent with the program order on each thread [2].

One of the most important features of C++11 is to provide a multithreading-aware memory model, which enables write multithreaded programs without relying on platform-specific extensions.

[1] Memory consistency primer
[2] Flight data recorder

(1) Sequential consistency ↑top

SC means that all threads agree on the order in which memory operations occured, and that order is consistent with the oder of operations in the program source code.

Some programming languages offer SC in multiprocessor environment. In C++11, you can declare all shared variables as C++11 atomic types with default memory ordering constraints. In Java, you can mark all shared variables as volatile [1] [2].

The compiler inserts additional instructions behind the scenes, e.g. memory fences, to enforce the order.

std::atomic<int> x(0), y(0);

x = 1;
y = 1;

if(x==1 && y==0)
    print ("x first");
if(y==1 && x==0)
    print ("y first");

C++11 atomic types guarantee SC, and thus it must be impsbl to print both messages.

[1] Sequential consistency_1
[2] Sequential consistency_2

(2) Concurrency needs sync ↑top

For performance gains, modern CPUs often execute instructions out of order to fully utilize resources. Since the hardware enforces instructions integrity, we can never notice this in a single thread execution. However, for multiple threads, this can lead to unpredictable bahaviors [4].

In multi-threaded execution, uncontrolled scheduling leads to data race, where results depend on the timing execution of the code. With some bad luck (i.e., context switches that occur at untimely points in the execution), we get the wrong result. [1]

(i) mutual exclusion (atomic)

To achieve atomicity, we can ask hardware for a few useful instructions to build mutual exclusion, which guarantees that if one thread is executing within the critical section, the others will be prevented from doing so.

(ii) waiting for another (conditional variable)

There are many cases where a thread continues its execution only when some condition is met. Thus, one thread must wait for another to complete some action before it continues. [2] [3]

[1] OSTEP threads
[2] OSTEP conditional variable
[3] Wiki conditional variable
[4] Memory fence

(3) Memory ordering is crucial ↑top

Memory operations ordering in default system is very relaxed, and a CPU has much freedom to reorder the operations, and compilers may also arrange the instructions it emits in any order it likes, provided it doesn't affect the apparent operation of the program.

In multi-threaded scenario, to avoid race conditions, ordering should be enforced among accesses from different threads

(i) CPU limited guarantees

(ii) code transformations

The CPUs and other devices in a system can use a variety of tricks to improve performance, including reordering, deferral and combination of memory operations; speculative loads; speculative branch prediction and various types of caching.

compiler optimizations

Independent memory operations are effectively performed in random order, but this can be a problem for CPU-CPU interaction and for I/O, and hence we need ordering.

(iii) ordering is a contract

You promise: to correctly synchronize your program (no race conditions)
"The system" promises: to provide the illusion of executing the program you wrote

(4) Ordering techniques ↑top

To guarantee SC, you must consider how to prevent memory reordering. Ways can be lightweight sync or fence, full fence, or acquire/release semantics.

A release store makes its its prior accesses visible to a thread performing an acquire load that sees (pair with) that store.

Automating acquire and release:
--> don't write fences by hand.
--> do make the compiler write barriers for you by using "critical region" abstractions: Mutexes and std::atomic<> variables.

(i) #1: use mutexes

use mutex locks to protect code that reads/writes shared variables.
Cons: requires care on every use of the shared vars.

//Lock acquire/release:
mut_x.lock(); //"acquire" mut_x ==> ld.acq mut_x
... read/write x ...
mut_x.unlock(); //"release" mut_x ==> st.rel mut_x

(ii) #2: std::atomic<>

special atomic types are automically safe from reordering.
Pros: just tag the var, not every place it's used
Cons: writing correct atomics code is harder than it looks.

std::atomics: read=acquire, write=release
while(whose_turn != me){} //read whose_turn ==> ld.acq whose_turn
... read/write x ...
whose_turn = someone_else; //write whose_turn ==> st.rel whose_turn

(iii) #3: fences & ordered APIs

A memory fence/barrier is a class of instructions that enforces memory loads/stores occur in expected order. Different from high level mutexes and atomics, memory fences are hardware dependent.

Lock-free programming
A lock-free program can never be stalled entirely by any single thread. The entire "locking up" can be in any way, e.g., deadlock, livelock, or even malicious scheduling decisions [1] [2].

[1] Sequential consistency
[2] Lock free programming
[3] Acquire and release semantics

(5) Object layout ↑top

All data in a C++ program is made up of objects, each of which is "a region of storage".
Objects can be of simple fundamental type such as int or float, and they can also be instances of user-defined classes.
Whenever its type, an object is stored in one or more memory locations. Each such memory location is either an object (or subobject) of a scalar type such as short or my_class* or a sequence of adjacent [1].

Given two global variables char c and char d:

//Thread 1
{ lock_guard<mutex> lock(cMutex);
  c = 1;
//Thread 2
{ lock_guard<mutex> lock(dMutex);
  d = 1;

There is no race in ideal C++11, but it is psbl in real life (e.g., [2] and adjacent bit-fields, as one object):

//system lays out c then d contiguously
char tmp[4]; //32-bit scratchpad
memcpy(&tmp[0], &c, 4); //read 32b starting at c
tmp[1] = 1; //set only bits of d
memcpy(&c, &temp[0], 4); //write 32 bits back
//thread 2 sliently also write to c without holding cMutex

[1] Bit field
[2] False sharing

Speculation and register allocation

conditional locks:
--> Problem: your code conditionally takes a lock, but your system has a bug that changes a conditional write to be unconditional.

(6) C++11 techniques ↑top

(i) std::lock_guard

The class lock_guard is a mutex wrapper that provides a convenient RAII-style mechanism for owning a mutex for the duration of a scoped block.
When a lockguard object is created, it attempts to take ownership of the mutex it is given. When control leaves the scope in which the lockguard object was created, the lock_guard is destructed and the mutex is released.

#include <thread>
#include <mutex>
#include <iostream>

int g_i = 0;
std::mutex g_i_mutex; //protects g_i

void safe_incremenet(){
  std::lock_guard<std::mutex> lock(g_i_mutex);
  ++ g_i;
  std::cout << std::this_thread::get_id() << ":" << g_i << '\n';
  //g_i_mutex is automically released when lock goes out of scope

int main(){
  std::cout << __func__ << ": " << g_i << '\n';
  std::thread t1(safe_increment);
  std::thread t2(safe_increment);
  std<<cout << __func__ << ": " << g_i << '\n';

(ii) Use std::atomic for concurrency

Instantiations of std::atomic template offer operations that are guaranteed to be seen as atomic by other threads. Once a std::atomic object has been constructed, operations on it behave as if they were inside a mutex-protected critical section, but the operations are generally implemented using special machine instructions that are more efficient than would be the case if a mutex were employed.

std::atomic<int> ai(0); //init ai to 0
ai = 10; //atomically set ai to 10
std::cout << ai; //atomically read ai's value
++ai; //atomically increment ai to 11
--ai; //atomically decrement ai to 10

During execution of these statements, other threads reading ai may see only values of 0, 10 or 11. No other values are possible (assume this is the only thread modifying ai).

std::atomic guarantees only that the read of ai is atomic, no guarantee that the entire statement proceeds atomically.

Between the time ai's value is read and the operator<< is invoked to write to the standard output, another thread may have modified ai's value.

once a std::atomic object has been constructed, all member functions on it, including those comprising RMW operations, are guaranteed to be seen by other threads as atomic

(iii) Inherently nonportable features

To support low-level programming, C++ defines some features that are machine specific.

C++ threads

(1) Managing threads

(i) basic management

[1] Thread detach
[2] Thread join

//compile: -std=c++0x -pthread
//the func we want to execute on the new thread
void task1(string msg){
    cout << "task1 says: " << msg;

int main(){
    //constructs the new thread and runs it
    //form: thread(Function&& f, Args&&... args);
    thread t1(task1, "Hello");
    //makes the main thread wait for the new thread to finish, then continue
void pause_thread(int n){
    std::this_thread::sleep_for (std::chrono::seconds(n));
    std::cout << "pause of " << n << " seconds ended\n";

int main(){
    std::cout << "spawning and detaching 3 threads ...\n";
    std::thread(pause_thread, 1).detach();
    std::thread(pause_thread, 2).detach();
    std::thread(pause_thread, 3).detach();
    std::cout << "Done spawning thread.\n";
    //give the detached threads time to finish, but no guarantee
    return 0;