Sequential consistency
Concurrency needs sync
Memory ordering is crucial
Ordering techniques
Object layout
C++11 techniques

C++11 Memory Model

A memory model, a.k.a memory consistency model, is a specification of the allowed behavior of multithreaded programs executing with shared memory [1]. The most basic model is sequential consistency (SC), where all insructions from all threads (appear to) form a total order that is consistent with the program order on each thread [2].

One of the most important features of C++11 is to provide a multithreading-aware memory model, which enables write multithreaded programs without relying on platform-specific extensions.

[1] Memory consistency primer
[2] Flight data recorder

(1) Sequential consistency ↑top

SC means that all threads agree on the order in which memory operations occured, and that order is consistent with the oder of operations in the program source code.

Some programming languages offer SC in multiprocessor environment. In C++11, you can declare all shared variables as C++11 atomic types with default memory ordering constraints. In Java, you can mark all shared variables as volatile [1] [2].

The compiler inserts additional instructions behind the scenes, e.g. memory fences, to enforce the order.

std::atomic<int> x(0), y(0);

//thread1
x = 1;
//thread2
y = 1;

//thread3
if(x==1 && y==0)
    print ("x first");
//thread4
if(y==1 && x==0)
    print ("y first");

C++11 atomic types guarantee SC, and thus it must be impsbl to print both messages.

[1] Sequential consistency_1
[2] Sequential consistency_2

(2) Concurrency needs sync ↑top

For performance gains, modern CPUs often execute instructions out of order to fully utilize resources. Since the hardware enforces instructions integrity, we can never notice this in a single thread execution. However, for multiple threads, this can lead to unpredictable bahaviors [4].

In multi-threaded execution, uncontrolled scheduling leads to data race, where results depend on the timing execution of the code. With some bad luck (i.e., context switches that occur at untimely points in the execution), we get the wrong result. [1]

(i) mutual exclusion (atomic)

To achieve atomicity, we can ask hardware for a few useful instructions to build mutual exclusion, which guarantees that if one thread is executing within the critical section, the others will be prevented from doing so.

(ii) waiting for another (conditional variable)

There are many cases where a thread continues its execution only when some condition is met. Thus, one thread must wait for another to complete some action before it continues. [2] [3]

[1] OSTEP threads
[2] OSTEP conditional variable
[3] Wiki conditional variable
[4] Memory fence

(3) Memory ordering is crucial ↑top

Memory operations ordering in default system is very relaxed, and a CPU has much freedom to reorder the operations, and compilers may also arrange the instructions it emits in any order it likes, provided it doesn't affect the apparent operation of the program.

In multi-threaded scenario, to avoid race conditions, ordering should be enforced among accesses from different threads

(i) CPU limited guarantees

on any given CPU, dependent memory accesses will be issued in order;
overlapping loads and stores within a specific CPU will appear to be ordered within that CPU;
overlapping memory accesses may be merged or discarded.

(ii) code transformations

The CPUs and other devices in a system can use a variety of tricks to improve performance, including reordering, deferral and combination of memory operations; speculative loads; speculative branch prediction and various types of caching.

compiler optimizations

what compiler knows
--> all memory operations in this thread and exactly what they do, including data dependencies;
--> how to be conservative enough in the face of psbl aliasing
what doesn't know
--> which mem locs are "mutable shared" variables and could change asynchronously due to mem operation in another thread
--> how to be conservative enough in the face of psbl sharing
solution: tell it --> somehow identify the operations on "mutable shared" locations

Independent memory operations are effectively performed in random order, but this can be a problem for CPU-CPU interaction and for I/O, and hence we need ordering.

(iii) ordering is a contract

You promise: to correctly synchronize your program (no race conditions)
"The system" promises: to provide the illusion of executing the program you wrote

(4) Ordering techniques ↑top

To guarantee SC, you must consider how to prevent memory reordering. Ways can be lightweight sync or fence, full fence, or acquire/release semantics.

A release store makes its its prior accesses visible to a thread performing an acquire load that sees (pair with) that store.

Automating acquire and release:
--> don't write fences by hand.
--> do make the compiler write barriers for you by using "critical region" abstractions: Mutexes and std::atomic<> variables.

(i) #1: use mutexes

use mutex locks to protect code that reads/writes shared variables.
Pros:
Cons: requires care on every use of the shared vars.

//Lock acquire/release:
mut_x.lock(); //"acquire" mut_x ==> ld.acq mut_x
... read/write x ...
mut_x.unlock(); //"release" mut_x ==> st.rel mut_x

(ii) #2: std::atomic<>

special atomic types are automically safe from reordering.
Pros: just tag the var, not every place it's used
Cons: writing correct atomics code is harder than it looks.

std::atomics: read=acquire, write=release
while(whose_turn != me){} //read whose_turn ==> ld.acq whose_turn
... read/write x ...
whose_turn = someone_else; //write whose_turn ==> st.rel whose_turn

(iii) #3: fences & ordered APIs

A memory fence/barrier is a class of instructions that enforces memory loads/stores occur in expected order. Different from high level mutexes and atomics, memory fences are hardware dependent.

Lock-free programming
A lock-free program can never be stalled entirely by any single thread. The entire "locking up" can be in any way, e.g., deadlock, livelock, or even malicious scheduling decisions [1] [2].

[1] Sequential consistency
[2] Lock free programming
[3] Acquire and release semantics

(5) Object layout ↑top

All data in a C++ program is made up of objects, each of which is "a region of storage".
Objects can be of simple fundamental type such as int or float, and they can also be instances of user-defined classes.
Whenever its type, an object is stored in one or more memory locations. Each such memory location is either an object (or subobject) of a scalar type such as short or my_class* or a sequence of adjacent [1].

Every variable is an obj, including those that are members of other objs.
Every obj occupies at least one mem loc.
Variables of fundamental type (e.g., int or char) are exactly one mem loc, whatever their size, even if they're adjacent or part of an array.
Adjacent bit fields are part of the same mem loc.

Given two global variables char c and char d:

//Thread 1
{ lock_guard<mutex> lock(cMutex);
  c = 1;
}
//Thread 2
{ lock_guard<mutex> lock(dMutex);
  d = 1;
}

There is no race in ideal C++11, but it is psbl in real life (e.g., [2] and adjacent bit-fields, as one object):

//system lays out c then d contiguously
char tmp[4]; //32-bit scratchpad
memcpy(&tmp[0], &c, 4); //read 32b starting at c
tmp[1] = 1; //set only bits of d
memcpy(&c, &temp[0], 4); //write 32 bits back
//thread 2 sliently also write to c without holding cMutex

[1] Bit field
[2] False sharing

Speculation and register allocation

speculation:
the system (compiler, CPU, cache, ...) speculates that a condition may be true (e.g., branch prediction), or has reason to believe that a condition is often true (e.g., it was true the last 100 times we executed this code)
To save time, we can optimistically start further execution based on that guess. If it's right, we saved time. If it's wrong, we have to undo any speculative work.

if(cond)    |   {
lock x      |    unique_lock<mutex> hold(mut, defer_lock)
...         |     if(cond)
if(cond)    |      hold.lock();
use x       |    ...
...         |     if(cond)
if(cond)    |      use x
unlock x    |    ...
            |   }//as-if "if(cond) hold.unlock()"

The above general pattern is safe w.r.t C++11 MMs. But beware compiler bugs ...

//x is a shared var
if(cond)
 x = 42;
//cond is speculated to be true, rewrite code
r1 = x;   //read what's there
x = 42;   //oops: optimistic write is NOT conditional
if(!cond) //check if we guessed wrong
x = r1;   //oops: back-out write is NOT sc

conditional locks:
--> Problem: your code conditionally takes a lock, but your system has a bug that changes a conditional write to be unconditional.

(6) C++11 techniques ↑top

(i) std::lock_guard

The class lock_guard is a mutex wrapper that provides a convenient RAII-style mechanism for owning a mutex for the duration of a scoped block.
When a lockguard object is created, it attempts to take ownership of the mutex it is given. When control leaves the scope in which the lockguard object was created, the lock_guard is destructed and the mutex is released.

#include <thread>
#include <mutex>
#include <iostream>

int g_i = 0;
std::mutex g_i_mutex; //protects g_i

void safe_incremenet(){
  std::lock_guard<std::mutex> lock(g_i_mutex);
  ++ g_i;
  std::cout << std::this_thread::get_id() << ":" << g_i << '\n';
  
  //g_i_mutex is automically released when lock goes out of scope
}

int main(){
  std::cout << __func__ << ": " << g_i << '\n';
  
  std::thread t1(safe_increment);
  std::thread t2(safe_increment);
  
  t1.join();
  t2.join();
  
  std<<cout << __func__ << ": " << g_i << '\n';
}

(ii) Use std::atomic for concurrency

Instantiations of std::atomic template offer operations that are guaranteed to be seen as atomic by other threads. Once a std::atomic object has been constructed, operations on it behave as if they were inside a mutex-protected critical section, but the operations are generally implemented using special machine instructions that are more efficient than would be the case if a mutex were employed.

std::atomic<int> ai(0); //init ai to 0
ai = 10; //atomically set ai to 10
std::cout << ai; //atomically read ai's value
++ai; //atomically increment ai to 11
--ai; //atomically decrement ai to 10

During execution of these statements, other threads reading ai may see only values of 0, 10 or 11. No other values are possible (assume this is the only thread modifying ai).

std::atomic guarantees only that the read of ai is atomic, no guarantee that the entire statement proceeds atomically.

Between the time ai's value is read and the operator<< is invoked to write to the standard output, another thread may have modified ai's value.

once a std::atomic object has been constructed, all member functions on it, including those comprising RMW operations, are guaranteed to be seen by other threads as atomic

(iii) Inherently nonportable features

To support low-level programming, C++ defines some features that are machine specific.

Bit-fields
A bit-field holds a specified number of bits, and the memory layout of a bit-field is machine dependent.
volatile
The volatile keyword is a directive to the compiler that it should not perform optimizations on such objects.

C++ threads

(1) Managing threads

(i) basic management

detach thread [1]
detaches the thread represented by the object from the calling thread, allowing them to execute independently from each other.
Both threads continue without blocking nor synchronizing in any way.
When either one ends execution, its resources are released.

Calling detach() on a thread object leaves the thread to run in the background, and it can no longer be joined.
join thread [2]
The function returns when the thread execution has completed.
transferring ownership of a thread std::thread t2=std::move(t1)

[1] Thread detach
[2] Thread join

//compile: -std=c++0x -pthread
//the func we want to execute on the new thread
void task1(string msg){
    cout << "task1 says: " << msg;
}

int main(){
    //constructs the new thread and runs it
    //form: thread(Function&& f, Args&&... args);
    thread t1(task1, "Hello");
    
    //makes the main thread wait for the new thread to finish, then continue
    t1.join();
}

void pause_thread(int n){
    std::this_thread::sleep_for (std::chrono::seconds(n));
    std::cout << "pause of " << n << " seconds ended\n";
}

int main(){
    std::cout << "spawning and detaching 3 threads ...\n";
    std::thread(pause_thread, 1).detach();
    std::thread(pause_thread, 2).detach();
    std::thread(pause_thread, 3).detach();
    
    std::cout << "Done spawning thread.\n";
    //give the detached threads time to finish, but no guarantee
    pause_thread(5);
    return 0;
}

Table of Contents