Contents (click to collapse) >>

# memory ordering

## 1 Preface

Memory ordering describes the order of accesses to computer memory by a CPU. The term can refer either to the memory ordering generated by the compiler during compile time, or to the memory ordering generated by a CPU during runtime.

This post tries to explain memory ordering in a bottom-up approach, from hardware to software, from abstraction of memory order and memory model to language and application level.

This post covers the following questions

1. How hardware/processors reorder out program (instructions)?
2. What is memory model and memory order?
3. How to understand it on a high level?
4. How to use it (lock-free programming)?

## 3 CPU cache

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have different independent caches, including instruction and data caches, where the data cache is usually organized as a hierarchy of more cache levels (L1, L2, L3, L4, etc.). …

figure 1. general CPU architecture

figure 2. Itanium architecture CPU architecture

1. L1分为data部分(L1d)和instruction部分(L1i), 用途就是字面意思, 目前一般来说是 32KB 大小, 采用static memory, 速度最快(也最贵), 最接近CPU的速度.
2. 大小L1 < L2 < L3, 速度L1 > L2 > L3, 查找顺序L1 > L2 > L3
3. L3以及更高level的cache就是若干个核共享的cache
4. 现代CPU基本都是多核, 每个核有自己的cache(local cache), 增强了性能也引入了 额外的使用复杂性

### 3.1 Cache line

Data is transferred between memory and cache in blocks of fixed size, called cache lines or cache blocks. When a cache line is copied from memory into the cache, a cache entry is created. The cache entry will include the copied data as well as the requested memory location (called a tag).

Cache row entries usually have the following structure:

+---------------+---------+-------------------+
|   flag bits   |   tag   |   data block(s)   |
+---------------+---------+-------------------+


The data block (cache line) contains the actual data fetched from the main memory.

figure 3. CPU cache line

figure 4. CPU cache line data layout

figure 4 shows a block of cache with 4k cache entries with 16bytes(128bit) cache line implemented with “direct mapped cache”, there are other cache implementations such as: set associative cache, column associative cache. Check this video for more info.

• valid: 该entry是否有效
• data: cache line data, data部分一般由多个block组成, 这样可以只更新cache line中 部分数据, 通 address的末尾若干位(byte offset)来选取(Mux, multiplexer)其中的某 个具体的data block
• data block: 一个data block的大小, 一般和机器的字长是一样的: 32bit or 64-bit

CPU一个core和另外一个core 或者内存的交互是通过cache entry来同步, 也就 是说CPU的内存操作粒度是cache line. 这么做是为了提升效率, 目前来说一般x86 CPU的 cache line (data)的大小是64bytes, 这个也是为什么我们经常说要内存对齐的原因 __attribute__((align(64)))或者c++ keyword alignas(64)(since c++11), 如果内存对齐了(按照cache line size作为边界), 对于我们常用的Intel CPU(Xeon等) 很多操作其实都变相变成”原子”的了, 因为只需要一次load/store就能完成内存的读取/更新.

One more thing, 如果对数据结构做了alignment, 在一些频繁内存access操作的场景内性 能的提升是非常明显的, Herb Sutter举了这样一个例子.

cache line的实现是比较有意思的, 如上所述, 这里都是硬件电路, 我们可以根据已有的算 法直接用数字逻辑电路(FPGA等)来实现一个类似这样的cache, 这里 是github上开源的一个”direct mapped cache”verilog的实现, 感兴趣可以作为延伸阅读.

### 3.2 Replacement policies

To make room for the new entry on a cache miss, the cache may have to evict one of the existing entries. The heuristic it uses to choose the entry to evict is called the replacement policy. The fundamental problem with any replacement policy is that it must predict which existing cache entry is least likely to be used in the future. Predicting the future is difficult, so there is no perfect method to choose among the variety of replacement policies available. One popular replacement policy, least-recently used (LRU), replaces the least recently accessed entry.

Marking some memory ranges as non-cacheable can improve performance, by avoiding caching of memory regions that are rarely re-accessed. This avoids the overhead of loading something into the cache without having any reuse. Cache entries may also be disabled or locked depending on the context.

### 3.3 Write policies

If data is written to the cache, at some point it must also be written to main memory; the timing of this write is known as the write policy. In a write-through cache, every write to the cache causes a write to main memory. Alternatively, in a write-back or copy-back cache, writes are not immediately mirrored to the main memory, and the cache instead tracks which locations have been written over, marking them as dirty. The data in these locations is written back to the main memory only when that data is evicted from the cache. For this reason, a read miss in a write-back cache may sometimes require two memory accesses to service: one to first write the dirty location to main memory, and then another to read the new location from memory. Also, a write to a main memory location that is not yet mapped in a write-back cache may evict an already dirty location, thereby freeing that cache space for the new memory location.

### 3.4 Cache coherence

Cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, which is particularly the case with CPUs in a multiprocessing system.

figure 5. cache coherent

figure 5, from wikipedia, 展示了cache coherence的大概过程 cache coherence 就是讲 cache line同步给各个核以及内存的交互过程(协同/协调? 具体不知道怎么翻译), CPU的 runtime指令重排以及memory ordering model跟这个实现很有关系

cache coherence是一个非常复杂的过程, intel在其公布的文档中很少有说道他们是如何实 现的, 但是Intel的CPU就是能保证一个核的更新了cache line之后另外的核以及内存对该 cache line的值都能达到一致的状态, 这个有点类似于分布式中的一致性.

1. Two cores try to modify the same cache line “simultaneously”. Who will win? Can both lose? How does intel implementatio handle this case?
2. One core try to read from a memory location which is not in cache while another one has exclusive ownership of a cache line with this memory location and try to write some value into it (simultaneously). Who will win? The cache line state will first transfered to a Shared state and then invalidated or Modified and then Shared?

Intel does not document the details of its coherence protocols, but the ordering model is described in some detail in Section 8.2 of Volume 3 of the Intel Architectures Software Developer’s Manual (Intel document 325384).

The details of the implementation of the coherence protocol (and its associated ordering model) differ across processor models, and can differ based on configuration settings (e.g., Home Snoop vs Early Snoop on Haswell EP, or 1-socket vs 2-socket vs 4-socket configurations). In addition, some coherence transactions can be handled in different ways, with the choice made dynamically at run time.

The most interesting details usually pop up in the Uncore Performance Monitoring Guides for the various processor models. Understanding the implications of these details requires significant experience in microarchitecture and microbenchmarking, though there are many aspects of Intel’s implementations that are probably too hidden to be fully understood by anyone who does not have access to the actual processor RTL (or whatever high-level-design language Intel uses these days).

The short answer to #1 is that coherence processing is serialized at the various coherence agents. For example, if two cores execute a store instruction in the same cycle and both miss in their L1 and L2 caches, then the transaction will go to the L3 slice (or CHA in some processors) responsible for that address, which will process the incoming requests sequentially. One or the other will “win” and will be granted exclusive access to the cache line to perform the store. During this period, the request from the “losing” core will be stalled or rejected, until eventually the first core completes its coherence transaction and the second core’s transaction is allowed to proceed. There are many ways to implement the details, and Intel almost certainly uses different detailed approaches in different processor models.

The short answer to #2 is that also that the transactions will be serialized. Either the owning processor will complete the store, then the line will be transferred to the reading processor, or the cache line will be transferred away from the owning processor to the reading processor first, then returned to the (original) owning processor to complete the store. Again, there are many ways to implement the low-level details, and Intel processors implement several different approaches. If I am reading between the lines correctly, recent Intel chips will use different transactions for this scenario depending on how frequently it happens to a particular address. (E.g., See the description of the HitMe cache in the Xeon E5/E7 v3 processor family uncore performance monitoring guide, document 331051.)

“Dr. Bandwidth”

cache coherence 解决的是核之间数据可见性的问题, 但是并没有解决可见顺序的问题, 顺序问题也是本文要着重要阐述说明的问题.

### 3.5 Store buffer

store buffer类似于一个队列, 但是大小是按照entry计算不是按照bits计算的. 上图的例子是56 entries, 一般来说 store buffer的size 是load buffer的 2/3, 因为大 多数程序读别写多.

The following resources are shared between two threads running in the same core:

• Cache
• Branch prediction resources
• Instruction fetch and decoding
• Execution units

Hyperthreading has no advantage if any of these resources is a limiting factor for the speed.

CPU操作数据变更之后并不会立即写到L1d上, 而是先写到store buffer上, 然后尽快写到L1d cache, 比如如下代码

int sum = 0;
for (int i = 0; i < 10086; ++i) {
sum += i;
}


sum 完成一次赋值之后不会立即写到L1d上, 可能会在store buffer上等一段时间, 然后 第二次又赋值之后才从store buffer刷到L1d上. 这样做是为了提高效率, 预测器(speculator), 把两个相加语句先执行了, 然后再写到L1d里, 这样就好像把 +1 +2 两条的结果直接合并成+3一次写到L1d里, 节省了写cache的时间.

// global ini
int a = 0;
int b = 0;

t1:                          |           t2:
a = 1;                       |           b = 1;
if (b == 0) {                |           if (a == 0) {
// do something            |             // do something
} else {                     |           } else {
goto t1; // retry          |             goto t2; // retry
}                            |           }


### 3.6 Conclusion of CPU cache

CPU cache加快了执行的速度, 但是也引入了额外的使用问题, CPU cache本身的实现逻辑就 很复杂, 即使实现了cache coherence, 由于store buffer的存在, 只要有多个核共同工作 的环境下还 是会有各个核对内存可见顺序的问题, 再加上编译器和CPU的优化, 问题就显得 更加复杂了, 我们使用高级语言所写的程序到真正执行的时候也许已经不是我们想象的那样 了.

## 4 Instruction reordering

1. compile-time: 在编译器重排编译生成指令达到减少使用寄存器load/store的次数, 比 如编译器发现有个寄存器里的值可以先写回到内存里再over write该寄存器的值作它用
2. runtime: 在运行时预测(speculate)在同个instruction stream里哪些指令可以先执行, 同时减少类似重复load/store内存或者寄存器, 比如说对分支的speculative execution

compile-time reordering 也可以叫做compiler reordering, 就是静态的reordering, runtime reordering 也可以叫做CPU reordering, 就是动态的reordering, 下文不在另行 解释.

Thou shalt not modify the behavior of a single-threaded program.

In my point of view, memory order的问题就是因为指令重排引起的, 指令重排导致 原来的内存可见顺序发生了变化, 在单线程执行起来的时候是没有问题的, 但是放到 多核/多线程执行的时候就出现问题了, 为了效率引入的额外复杂逻辑的的弊端就出现了.

int msg = 0;

void foo() {
msg = 10086;
}

void bar() {
// output may be 0, un expected
std::cout << msg;
}
}


thread 1 单线程执行foo()先给a赋值和先给b赋值都是最终的结果都是正确的, 符合重排的最 基本的principle, 所以在不加任何限制的情况下, 无论是compiler生成的汇编或者CPU在直 线instruction stream的时候都可以自由地将ready先于msg赋值.

### 4.1 Compile-time reordering

// cpp code                     | ;asm code
|
int a = 0;                      |
int b = 0;                      |
|
void foo() {                    |   foo():
a = b + 1;                    |           mov     eax, DWORD PTR b[rip]
b = 5;                        |           mov     DWORD PTR b[rip], 5
|           mov     DWORD PTR a[rip], eax
|           ret


code 2a. compile-time reordering, a = b + 1

// cpp code                         | ;asm code
|
int a = 0;                          |
int b = 0;                          |
|
void foo() {                        | foo():
a = b + 1;                        |         mov     eax, DWORD PTR b[rip]
std::memory_order_release);     |         mov     DWORD PTR a[rip], eax
b = 5;                            |         mov     DWORD PTR b[rip], 0
}                                   |         ret


code 2b. compile-time reordering constraint, a = b + 1

// cpp code               | ; asm code
int a = 0;                |
int b = 0;                |
void foo() {              | foo():
a = 1;                  |         mov     DWORD PTR b[rip], 2
b = 2;                  |         mov     DWORD PTR a[rip], 3
a = 3;                  |         ret
}                         |


Herb Sutter在这个talk(atomic Weapons 2 of 2) 里提到了一些编译器在做reordering优化的时候的一些考虑, 讲解了一些有意思的编译器考 虑reordering优化的case.

### 4.2 Runtime reordering

1. 不同的CPU架构的hardware memory model是不一样的, 就是指令重排的策略/限制不一样
2. 大概分为两大类型, weak & strong
1. weak hardware memory model: ARM, Itanium, PowerPC, 对指令重排没有太多”限制”, 我们前边提到的CPU cache的复杂架构和实现对增加内存的可见性(顺序)的复杂度贡献 还是很大的, 但是正是这些比较少的限制, 可以使软件开发人员有更多的选择来实现性 能优化.

2. strong hardware memory model: x86-64 family (Intel and AMD), 限制了很多类型的指令重排

A strong hardware memory model is one in which every machine instruction comes implicitly with acquire and release semantics. As a result, when one CPU core performs a sequence of writes, every other CPU core sees those values change in the same order that they were written.

一个核有一个instruction stream, 假设其中有n个对内存write操作, 如果执行到第k 个write指令时, 前k-1个write都能被其他核观察到, 并且其他核观察到的这个k个 write的顺序和在本核上的顺序是一致的. 要保证这个是有一定的性能损耗, 因为有CPU cache的存在, 增加了这个可见性的难度, strong model在一定程度上也限制了优化.

3. 即使是同为weak/strong, 本身差别也很多, 因为指令集千差万别, 同个指令集的指令组 合起来就有数不清的可能性, 所以各个CPU架构的策略都是不一样

4. 但是, 不管是哪种架构的CPU, CPU指令重排都遵循一定的规则, 这些规则就写在 了各个CPU的使用手册里, 这些手册是系统(OS)开发人员必须读的.
5. 对于Intel等x86系列的CPU, 对应厂商的思路是, 只要我单条指令执行的够快, 就不用太 担心优化的问题.

#### 4.2.1 Gotcha! Runtime reordering

void test() {
for (;;) {
int msg = 0;
msg = 10086;
}
);

if (ready == 1 && msg == 0) {
std::cout << "gocha" << std::endl;
}
}
);
t1.join();
t2.join();
}
}


code 3. simulate runtime reordering with 2 threads

msg = 10086;
// only 1 instruction after last store/assignment


// the static instruction stream (compiled object)

mov [_X], 1                  mov [_Y], 1
mov r1, [_Y]                 mov r2, [_X]


X, Y是内存地址, r1, r2是寄存器, 两个线程执行的汇编代码过程类似. 第一句是一个store操作第二句是一个load操作, 两个汇编语句操作的内存地址都不一样.

X, Y are both initially zero, if CPU instruction reordering occurs in both threads, that is to say r1 == 0 && r2 == 0, the execution order is

  thread1                      thread2

mov r1, [_Y]                 mov r2, [_X]
mov [_X], 1                  mov [_Y], 1


for Intel only store-load reordering is allowed when there is no memory fence to prevent reordering, I will talk this later in Intel’s reordering specification.

g++ -std=c++11 -lpthread


int X, Y;
int r1, r2;

std::mt19937 rng(std::random_device("/dev/random")());
std::uniform_int_distribution<int> random(1, 100000000);
for (;;) {
sem_wait(&begin_sema1);  // Wait for signal
// Random delay is necessary to increase the posibility of capaturing CPU
// reordering, this demo is still working without this random delay
while (random(rng) % 8 != 0) {}

X = 1;
// Prevent compiler reordering explicitly
asm volatile("" ::: "memory");
r1 = Y;

sem_post(&end_sema);  // Notify transaction complete
}
};


main函数如下, 主要是模拟两个线程在两个核上同时执行,

int main(void) {
sem_init(&begin_sema1, 0, 0);
sem_init(&begin_sema2, 0, 0);
sem_init(&end_sema, 0, 0);

// Repeat the experiment ad infinitum
int detected = 0;
for (int iterations = 1; ; ++iterations) {
// Reset X and Y
X = 0;
Y = 0;
sem_post(&begin_sema1);
sem_post(&begin_sema2);
sem_wait(&end_sema);
sem_wait(&end_sema);
// Check if there was a simultaneous reorder
if (r1 == 0 && r2 == 0) {
std::cout << "gotcha! "
<< ++detected << " reorders detected after " << iterations
<< " iterations" << std::endl;
}
}
return 0;  // Never returns
}



code 5. main function of capturing CPU reordering

Intel(R) Xeon(R) CPU E5-2450 0 @ 2.10GHz的机器上运行输出结果如下, 大概有1%的 概率能捕获到CPU reordering.

...
gotcha! 5505 reorders detected after 563904 iterations
gotcha! 5506 reorders detected after 565918 iterations
gotcha! 5507 reorders detected after 569906 iterations
gotcha! 5508 reorders detected after 571118 iterations
gotcha! 5509 reorders detected after 573014 iterations
gotcha! 5510 reorders detected after 573665 iterations
gotcha! 5511 reorders detected after 573871 iterations
gotcha! 5512 reorders detected after 574064 iterations
gotcha! 5513 reorders detected after 575861 iterations
gotcha! 5514 reorders detected after 576289 iterations
gotcha! 5515 reorders detected after 578107 iterations
gotcha! 5516 reorders detected after 580264 iterations
...


#### 4.2.2 Intel x86-64 family reordering specification

This section first introduces what kind of runtime reordering is allowed and what is not, and then introduces some critical for core/cache synchronization.

Most of the content is copied from Intel’s developer manual volume 3 §8.2. I put it here not only for a reference purpose but also for a demo of hardware memory model. x86-64 is the most popular processor, it can be a representative for other CPUs.

Different CPU architectures have different hardware memory models, the modern Intel CPU families: Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium 4, and P6, use so called strong hardware memory order model.

Most of the following specification is copied from Intel’s specification.

In a single-processor system for memory regions defined as write-back cacheable, the memory-ordering model respects the following principles (Note the memory-ordering principles for single-processor and multiple- processor systems are written from the perspective of software executing on the processor, where the term “processor” refers to a logical processor. For example, a physical processor supporting multiple cores and/or Intel Hyper-Threading Technology is treated as a multi-processor systems.):

• Writes are not reordered with older reads.
• Writes to memory are not reordered with other writes, with the following exceptions:
• streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and
• string operations (see §8.2.4.1).
• No write to memory may be reordered with an execution of the CLFLUSH instruction; a write may be reordered with an execution of the CLFLUSHOPT instruction that flushes a cache line other than the one being written.1 Executions of the CLFLUSH instruction are not reordered with each other. Executions of CLFLUSHOPT that access different cache lines may be reordered with each other. An execution of CLFLUSHOPT may be reordered with an execution of CLFLUSH that accesses a different cache line.
• Reads may be reordered with older writes to different locations but not with older writes to the same location.
• Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.
• Reads cannot pass earlier LFENCE and MFENCE instructions.
• Writes and executions of CLFLUSH and CLFLUSHOPT cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
• LFENCE instructions cannot pass earlier reads.
• SFENCE instructions cannot pass earlier writes or executions of CLFLUSH and CLFLUSHOPT.
• MFENCE instructions cannot pass earlier reads, writes, or executions of CLFLUSH and CLFLUSHOPT.

In a multiple-processor system, the following ordering principles apply:

• Individual processors use the same ordering principles as in a single-processor system.
• Writes by a single processor are observed in the same order by all processors.
• Writes from an individual processor are NOT ordered with respect to the writes from other processors.
• Memory ordering obeys causality (memory ordering respects transitive visibility).
• Any two stores are seen in a consistent order by processors other than those performing the stores
• Locked instructions have a total order.

Check §8.2.3 “Examples Illustrating the Memory-Ordering Principles” of Intel Software Developer Manuals Combined Volumes: 3a, 3b, 3c and 3d - System programming guide for detailed instruction reordering examples.

In summary, for simple store/load instructions, I haven’t talked about string instructions yet, the only CPU reordering case is:

Loads May Be Reordered with Earlier Stores to Different Locations
Processor 0                      |         Processor 1
|
mov [_x], 1                      |        mov [_y], 1
mov r1, [_y]                     |        mov r2, [_x]

Initially x = y = 0
r1 = 0 and r2 = 0 is allowed


code 6. case that CPU may reorder the instructions

##### 4.2.2.1 Fence instruction

The SFENCE, LFENCE, and MFENCE instructions provide a performance-efficient way of ensuring load and store memory ordering between routines that produce weakly-ordered results and routines that consume that data. The functions of these instructions are as follows:

• SFENCE — Serializes all store (write) operations that occurred prior to the SFENCE instruction in the program instruction stream, but does not affect load operations.
• LFENCE — Serializes all load (read) operations that occurred prior to the LFENCE instruction in the program instruction stream, but does not affect store operations.
• MFENCE — Serializes all store and load operations that occurred prior to the MFENCE instruction in the program instruction stream.

That is to say, we can use one of the above fence instructions to explicitly prevent the CPU from reordering.

Processor 0                        |           Processor 1
|
mov [_x], 1                        |          mov [_y], 1
mfence        ;;;full memory fence |          mfence       ;;;full memory fence
mov r1, [_y]                       |          mov r2, [_x]

Initially x = y = 0
r1 = 0 and r2 = 0 is impossible


code 7. introduce mfence to prevent CPU reordering

The above code has introduce a full memory fence, which is relative heavy operation, of code 6,

##### 4.2.2.2 Locked instruction

Locked instructions will lock the bus and synchronize all the caches of CPU,

Intel 64 and IA-32 processors provide a LOCK# signal that is asserted automatically during certain critical memory operations to lock the system bus or equivalent link. While this output signal is asserted, requests from other processors or bus agents for control of the bus are blocked.

which is usually a read-modify-write(RMW) does, the common case is a compare-and-swap operation

// cpp code                            | ; asm code
std::atomic<int> ai = { 0 };           |
|
void cas() {                           | cas():
int a;                               |    xor     eax, eax
ai.compare_exchange_weak(a, 10086,   |    mov     edx, 10086
std::memory_order_relaxed);      |    lock cmpxchg  DWORD PTR ai[rip], edx
}                                      |    ret


§8.1 “LOCKED ATOMIC OPERATIONS” of Intel’s developer manual volume 3 has described how does the locked instruction work and how the atomic operations are guaranteed.

## 5 Acquire and Release semantics

• acquire semantics, 硬件层面就是保证在此之后的所有指令在执行的时候都不能重排到这 个指令之前, 软件层面在此之后的所有操作都不能重排到这个语义(fence)之前
• release semantics, 硬件层面就是保证在此之前的所有store都已经”release”可见, 软件层面在此之前的所有操作都不能重排到这个语义(fence)之后

Acquire semantics is a property that can only apply to operations that read from shared memory, whether they are read-modify-write operations or plain loads. The operation is then considered a read-acquire. Acquire semantics prevent memory reordering of the read-acquire with any read or write operation that follows it in program order.

acquirerelease是配对的, 硬件层面, 这个语义(也许是一条CPU instruction)在执行 这个”指令”之后的所有内存的load操作都看到的 其他核”最新”的变更(depends on memory model of the CPU), 也就是说在这个”指令”之后 的所有对的指令都不能reorder 到这个”指令”之前. 同样的, 在ARM上这个”指令”是dmb, 在PowerPC上是lwsync, 在x86-64上不需要单独的指令.

prevent memory reordering of the read-acquire with any read or write operation that follows it in program order.

Release semantics is a property that can only apply to operations that write to shared memory, whether they are read-modify-write operations or plain stores. The operation is then considered a write-release. Release semantics prevent memory reordering of the write-release with any read or write operation that precedes it in program order.

release samentic和字面意思很像. 硬件层面, 这个语义(也许是一条CPU instruction)在执行这个 “指令”之后, 发生在这个”指令”之前(happens-before)的所有内存的store/load操作在所有 核都可见, 更直观的解释就是在这个”指令”, 之后的所有指令不能reorder到这个”指令” 之前. 在ARM上这个”指令”是dmb, 在PowerPC上是lwsync, 在x86-64的CPU上不需要 单独 的一个”指令”完成这个事情, 因为x86-64的strong hardware memory model类型的处 理器. 软件层面, 如果我们所编写代码包含了这个语义, 编译器在翻译成汇编的时候要在优化的时 候考虑程序字面的要求意思, 确保

prevent memory reordering of the write-release with any read or write operation that precedes it in program order

memory order有软件层面的也有硬件层面的order, 对应的就是compile-time reordering和runtime reordering.

### 5.1 Sequential Consistency

Leslie Lamport, 1979, who defined that a multiprocessing system had sequential consistency if:

“…the results of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.”

This definition was one of the first statements on what constituted “correct” behavior for a symmetric multiprocessing (SMP) system. It has been restated in various forms, among them the following:

Sequential consistency requires that a shared memory multiprocessor appear to be a multiprogramming uniprocessor system to any program running on it.

Sequential consistency requires that:

1. All instructions are executed in order, result in a single total order
2. Every write operation becomes instantaneously visible throughout the system

figure 6中所示(纵向是时间). 考虑有两个线程, 线程1需要执行AB, 线程2需要执行CD, 如果ABCD每个操作都是原子的, 在并行系统中, 如果这个系统是保持sequential consistency的, 那么最终的执行顺序是类似左边中的一种, 但是绝对不会是类似于最右边 的那种, 并且对于每个线程, 最终看到的执行顺序也只有一种–各个线程(核)达成一致.

figure 6. shows that what is sequential consistency and what is not

std::atomic_bool x, y;
std::atomic_int z;

void write_x() {
x.store(true, std::memory_order_seq_cst); //2
}

void write_y() {
y.store(true, std::memory_order_seq_cst); //3
}

++z;
}

++z;
}

int main() {
x = false;
y = false;
z = 0;
a.join(); b.join(); c.join(); d.join();
assert(z.load() != 0); //1, z must be 1 or 2
}


code 8. demo of sequential consistency in C++

acquire或者release 语义在保证原子性的同时还限制本核内的reorder行为, 但并不会进行多核同步(同relaxed), 多核之间可见性的行为限制不了, 所以使用这两个语义也达不到顺序一致的效果.

seq_cst除了保证relaxed的原子性以及本核内的reorder行为(acqure+release), 同时还确保本核对其他核同步时, 使用时总线达到一个”独占”的效果, 显然, 这个同步的过程时间相对比较长.

load acquire 和 store release 只能保证一个核的序不能保证全局序, relaxed, acquire 以及 release 只能保证原子性, 以及本核上对应语义的序, 但是不能保 证其他核看到本核同步结果的顺序. 而seq_cst在同步结果时向其他核进行同步, 确定了一 个全局序.

c和d对于序的理解冲突了, 要理解这个原因, 我们简化模型, 考虑a在写x=1时, b也在同时写 y=1, 而片上总线是同一个, 这就和我们并发操作同个数据类似, core3 先看到 x=1 后看到 y=1 core4 先看到 y=1 后看到 x=1

thread a         thread b          thread c         thread d
+-----+          +-----+           +-----+          +-----+
|core1|          |core2|           |core3|          |core4|
|     |          |     |           |     |          |     |
+-----+          +-----+           +-----+          +-----+
x=1              y=1              x==1?            y==1?
|                |               ^ ^              ^ ^
|                |               | |              | |
\________________|_______________/ |              | |
\               |                 |              | |
\              \_________________|______________/ |
\              \                |                |
\______________\_______________|________________/
\              |
\_____________/



thread a         thread b          thread c         thread d
+-----+          +-----+           +-----+          +-----+
|core1|          |core2|           |core3|          |core4|
|     |          |     |           |     |          |     |
+-----+          +-----+           +-----+          +-----+
x=1              y=1              x==1?            y==1?
|                |               ^ ^              ^ ^
|                |               | |              | |
\________________|_______________|_|______________/ |
|                 |                |
|                 |                |
\_________________|________________/


(这个例子几乎对所有支持SC memory model的语言都成立, 比如rust)

sequential consistency在有些时候, 可能会有性能瓶颈, 因为它需要确保操作在所有线程 之前全局同步, 一些操作需要进行排队吞吐就降低了.

C++ 11默认的memory model就是sequential consistency, 后文会进行详细描述.

Transitivity/causality: x and y are std::atomic, all variables initially zero.

Thread 1        Thread 2                 Thread 3
g = 1;          if(x == 1) y = 1;        if(y == 1) assert(g == 1);
x = 1;


It must be impossible for the assertion to fail – wouldn’t be SC.

Total store order: x and y are std::atomic and initially zero.

Thread 1      Thread 2      Thread 3                 Thread 4
x = 1;        y = 1;        if (x==1 && y==0)        if (y==1 && x==0)
print("x first");        print("y first");


It must be impossible to print both messages – wouldn’t be SC.

### 5.2 Out of thin air

// cpp                     | ;asm code
int v = 0;                 |
|
void foo(int set_v) {      | foo(int):
if (set_v) {             |   mov     %eax, 10086
v = 10086;             |   test    %edi, %edi               ;test condition
}                        |   cmove   %eax, DWORD PTR v(%rip)  ;load maybe
}                          |   mov     DWORD PTR v(%rip), %eax  ;store always
|   ret


code 7. out of thin air

v = 10010;


note: gcc 4.4.7之后就已经修复了这个问题

Sequentially Consistent概念跟Acquire and Release semantics是息息相关的, 我们知道, 程序的编译以及指令的执行在单线程内如果都遵循指令重排的基本原则 Acquire and Release semantics可以控制单线程内部的顺序, 通过配对使用, 可以使 线程间构成synchronizes-with

### 5.3 Full Fence

A full fence is a bidirectional restriction fence which has both acquire and relese semantics.

Nothing goes up and nothing goes down.

Sequential consistency is usually implemented with full fence.

## 6 Control instruction reordering

### 6.1 Low-level reordering constraints

code 4中我们在c++代码中插入了一条汇编语句

    asm volatile("" ::: "memory");


    asm volatile("mfence" ::: "memory");


mfence是一条Intel汇编指令, 在Fence Instruction章节中我们 介绍了几种现代Intel CPU的memory fence指令, 简单再来再提一下

• sfence, store fence, 在instruction stream中, 所有在sfence之前的store指令都不能 重排到sfence之后, 但是不影响sfence之前的load指令重排到sfence之后
• mfence, “full fence”, 在instruction stream中, 所有在mfence之前的store/load指令 都不能重排到mfence之后

// cpp                                  | ;asm
|
extern int a, b;                        |  bar():
|      mov     eax, DWORD PTR b[rip]
void bar() {                            |      mov     DWORD PTR a[rip], eax
a = b;                                |      mfence
asm volatile("mfence" ::: "memory");  |      mov     DWORD PTR b[rip], 1
b = 1;                                |      ret
}                                       |
|


code 9. constrains reordering with low-level fence instruction

### 6.2 Lock, Mutex

Locks/Mutexes are a system level mechanism that ensures there is only one thread run at a time, which “perfectly” prevent concurrency and data races.

If we run the program with a single thread, due to the rule of reordering. There won’t be any undefined behaviors introduce by reordering because there is only one thread can stay in critical section protected by the lock.

// cpp                        | ;asm
std::mutex mtx;               | bar():
void bar() {                  |   push    rbx
{                           |   mov     ebx, OFFSET FLAT:_ZL28__gthrw___pth...
lock_guard<mutex> l(mtx); |   test    rbx, rbx
a = b;                    |   je      .L9
}                           |   mov     edi, OFFSET FLAT:mtx
}                             |   test    eax, eax
|   jne     .L21
| .L9:
|   mov     eax, DWORD PTR b[rip]
|   mov     DWORD PTR a[rip], eax ;;;assign a;;;;
|   test    rbx, rbx
|   je      .L10
|   mov     edi, OFFSET FLAT:mtx
| .L10:
|   mov     DWORD PTR b[rip], 1   ;;;assign b;;;;
|   pop     rbx
|   ret
| .L21:
|   mov     edi, eax
|   call    std::__throw_system_error(int)


code 10. prevent reordering with lock, compiled with x86-64 gcc 8.2 -O3

BTW, a lock carries the acquire and release semantics, usually both of them, the underlying implementation of locks/mutexes usually introduces low-level full memory fences, and also work as a full fence at runtime – nothing can pass the lock.

Briefly speaking, lock – acquire semantics, unlock – release semantics.

  std::unique_lock<std::mutex> lock(mtx);

lock.lock();  // nothing following in the critical section reordered
// above this lock(), acquire semantics

// critical section
...

lock.unlock(); // nothing prior in the critical section reordered
// past this unlock(), release semantics


### 6.3 Memory Fences

Memory Fence和Memory Barrier其实是同个意思, 中文翻译应该都叫内存屏障 (叫内存栅栏好像有点不那么顺口:smirk:), 只是英文叫法不一样, 看多了Linux内核的人可能更加习惯叫barrier, 但是我更习惯叫Memory Fence, 因为相关的 汇编指令就叫fence.

## 7 Synchronizes-with relation

The following concepts are used to describe sequence of each operation in multi-thread environment, we just need to know that:

• synchronizes-with inter thread, release and acquire semantics description
• sequenced-before intra thread order description
• happens-before inter thread order description

Michael Wong’s slides shows the idea of sequenced-before + memory-ordering to implement happens-before and synchronizes-with.

## 8 Memory Model of C++11

What is software memory model:

• A set of rules that describe allowable semantics for memory accesses on a computer program
• Defines the expected value or values returned by any loads in the program
• Determines when a program produces undefined behavior (e.g load from uninitialized variables)
• Stroustrup, The C++ Programming Language 4th Edition §41.2 represents a contract between the implementers and the programmers to ensure that most programmers do not have to think about the details of modern computer hardware
• Critical component of concurrent programming

The default memory model is sequentially consistent, and the memory order is a very important, may be the most important, part of C++ memory model.

### 8.1 Memory Order of C++11

The C++ standard says that (namespace std is omitted here):

• memory_order_relaxed: allow reordering, no explicit fence

Relaxed operation: there are no synchronization or ordering constraints imposed on other reads or writes, only this operation’s atomicity is guaranteed.

• memory_order_consume: keep data dependency only, reordering may still happen, weaker than acquire and can be replaced entirely with acquire. It’s discouraged to use consume.

A load operation with this memory order performs a consume operation on the affected memory location: no reads or writes in the current thread dependent on the value currently loaded can be reordered before this load. Writes to data-dependent variables in other threads that release the same atomic variable are visible in the current thread. On most platforms, this affects compiler optimizations only.

• memory_order_acquire: any memory access cannot be reordered upwards this point, any associated atomic variables can be accessed correctly

A load operation with this memory order performs the acquire operation on the affected memory location: no reads or writes in the current thread can be reordered before this load. All writes in other threads that release the same atomic variable are visible in the current thread.

• memory_order_release: any memory access cannot be reordered downwards this point, all memory precedes this point is visible to other threads

A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable (see Release-Acquire ordering below) and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic.

• memory_order_acq_rel: acquire + release, works for RMW operation such as fetch_add

A read-modify-write(RMW) operation with this memory order is both an acquire operation and a release operation. No memory reads or writes in the current thread can be reordered before or after this store. All writes in other threads that release the same atomic variable are visible before the modification and the modification is visible in other threads that acquire the same atomic variable.

• memory_order_seq_cst: load acquire + store release + single total order

A load operation with this memory order performs an acquire operation, a store performs a release operation, and read-modify-write performs both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modifications in the same order.

The definition of each memory ordering is very clear, they intend to fulfill the C++ memory model, and they can be applied to the function families as following, see <atomic> for more info.

std::atomic<T>::load()
std::atomic<T>::store()
std::atomic<T>::compare_and_exchange_*()
std::atomic<T>::exchange()
std::atomic_store()
std::atomic_compare_and_exchange()
std::atomic_exchange()


(from here on in this section, namespace std may be omitted in code snippet)

We can conclude the strength of constraints of the memory ordering as follows:

                  /----------> release ----------\
/                                \
relaxed --->                                  ---> acq_res ---> seq_cst
\                                /
\---> consume ---> acquire ----/

an arrow A -> B means constraint of B is stronger than A


And we should note that, by definition in the c++ standard of acquire and release semantics, memory_order_consume and memory_order_acquire will not make sense for atomic load operations; memory_order_release will not make sense for atomic store operations.

atomic<int> A;

A.store(10086, memory_order_acquire); // does not make sense, may be undefined behavior
A.store(10086, memory_order_consume); // does not make sense, may be undefined behavior
A.load(10086, memory_order_release); // does not make sense, may be undefined behavior

A.store(10086, memory_order_relaxed); // store-relaxed operation
A.store(10086, memory_order_release); // store-release operation
A.store(10086, memory_order_acq_rel); // store-release operation
A.store(10086, memory_order_seq_cst); // SC store-release operation



memory_order_consume may be a little confusing, this memory order looses the load-acquire constraint by considering the data dependency, to explain it in detail I illustrate a code snippet here.

extern int X;
void foo() {
atomic<int> A {10086};
int B = 10010;
int C = 10000;

X = A.load(memory_order_consume); // a "consume fence"

B = B + 1; // may/can be reordered before x
C = A.load(memory_order_relaxed) + 1; // can not be reordered before X
// because it depends on A
}


Note that every consume can be replaced with acquire, but not vice versa, the data dependency is some time hard to figure out, and don’t be clever when you don’t fully understand the consequences by replacing acquire with consume if you think it “may” improve the performance.

consume in most cases is a compiler option, the code gen is almost the same as acquire, consume is a compiler optimization option only.

### 8.2 Standalone fence, stronger but slower

Standalone fence is described in section 29.8 of working draft.

Fences can have acquire semantics, release semantics, or both. A fence with acquire semantics is called an acquire fence. A fence with release semantics is called a release fence.

A fence can be generated with std::atomic_thread_fence() and corresponding std::memory_order.

A fence with acquire semantics is called an acquire fence, std::atomic_thread_fence(std::memory_order_acquire). A fence with release semantics is called a release fence, std::atomic_thread_fence(std::memory_order_release).

Let’s analyse a code snippet:

extern int tmp;

tmp = 10086;
A.store(10010, std::memory_order_release);


is it equivalent to the following?

extern int tmp;

tmp = 10086;
// can be reorded? float up?
A.store(10010, std::memory_order_relaxed);


There may be a mis-concept that store to A with memory_order_relaxed can be reordered upwards that standalone release fence, due to mis-understanding of release operation

Release semantics is a property that can only apply to operations that write to shared memory, whether they are read-modify-write operations or plain stores. The operation is then considered a write-release. Release semantics prevent memory reordering of the write-release with any read or write operation that precedes it in program order.

Release semantics constrains nothing will be reordered after the write-release, but cannot prevent some other operations from floating past the write-release, and even past that store tmp = 10086 before release fence, if standalone release fence performs in the same way, it won’t make any sense that we still need a standalone fence.

Fortunately, C++ standard disallows that reordering before release fence, its the same for the acquire fence too.

N3337 29.8.2 says:

A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.

It may be a little hard to understand, to explain this, I draw this figure:

                               sync      Y          M.load
A   ===^===^===^===^===^===   ------>    B   ===v===v===v===v===v===
X          M.store


The constraints/consequences are:

1. A synchronizes with B
2. X cannot float past A — X cannot be reordered. X is allowed to be reordered before A if it is pure release semantics
3. Y cannot sink past B — Y cannot be reordered. Y is allowed to be reordered past B if it is pure acquire semantics

In other words by cppreference:

atomic_thread_fence imposes stronger synchronization constraints than an atomic store operation with the same std::memory_order. While an atomic store-release operation prevents all preceding writes from moving past the store-release, an atomic_thread_fence with memory_order_release ordering prevents all preceding writes from moving past all subsequent stores.

Standalone atomic_thread_fence with memory_order_release or memory_order_acquire is stricter than store-release or load-acquire operation.

This post by Jeff Preshing discusses the example given previously in details.

And what we need to notice is that standalone fences are sub-optimal and it is a performance pessimization.

### 8.3 compare_exchange_weak or compare_exchange_strong?

Sometimes it maybe confusing that which one to use for CAS, compare_exchange_weak or compare_exchange_strong, for cpp users, as always we should analyse the difference and than choose which on to use.

Here is my conclusion on this question:

• compare_exchange_weak may fail with spurious error, even if the expected value matched due to some reasons (system level or hardware level)
• compare_exchange_weak is “cheaper” than compare_exchange_strong, as if the strong one keeps trying in a loop to exchange when “spurious error” occurs with the weak one
• conventionally, which to use:
• if there is a CAS loop (spin), use compare_exchange_weak
  while (compare_exchange_weak(:::)) { ::: }

• if there is a single test, use compare_exchange_strong
  if (compare_exchange_strong(:::)) { ::: }


## 9 volatile

volatile in C++ is only a compile-time keyword which prevents the compiler reordering the specified (such as redundant read), that a hint only, but no atomicity at runtime guaranteed.

A typical case

extern int a;
volatile int b = a; // load a
// do somthing that not related to a
...
b = a; // load a again, this redundant statement won't be reordered or optimized


volatile in Java or C#, it kind of means atomic, the write or read to the specified var is “atomic”, no partial write or read will happen.

And also, note that

• in .NET and Java key word volatile ensures sequentially consistent memory order, kind of like atomic with default memory order in C++
• C++ std::atomic default memory order is sequential-consistency we can weaken that, but I didn’t find out how to do that with volatile in .NET or Java.

## 10 Code generation

This section introduces the code generation that implements C++ memory model on different types of processors.

They are compiled with gcc 8.2, 8.2, 6.3, 5.4 with option -std=c++17 -O3. The versions of gcc varies because they are the newest gcc version for corresponding processors on .

int a = 1;
int b = 2;
std::atomic<int> a1 {3};
extern int c;

void foo() {
a = 1;
b = 2;
a1.compare_exchange_strong(a, 10086, std::memory_order_relaxed);
a = c;
}


code. code snippet for code generation

operation x86-64 power64le
atomic.store(relaxed) mov li;stw
atomic.exchange(seq_cst) xchg sync;lwarx;stwcx;bne*;isync
atomic.exchange(acq_rel) xchg lwsync;lwarx;stwcx;bne*;isync
atomic.exchange(acquire) xchg lwarx;stwcx;bne*;isync
atomic.exchange(consume) xchg lwarx;stwcx;bne*;isync
atomic.exchange(release) xchg lwsync;lwarx;stwcx;bne*
atomic.exchange(relaxed) xchg lwarx;stwcx;bne*
atomic.compare_exchange_strong(seq_cst) lock cmpxchg sync;lwarx;cmpwi;bne*;stwcx.;bne*;isync
atomic.compare_exchange_strong(acq_rel) lock cmpxchg lwsync;lwarx;cmpwi;bne*;stwcx.;bne*;isync
atomic.compare_exchange_strong(acquire) lock cmpxchg lwarx;cmpwi;bne*;stwcx.;bne*;isync
atomic.compare_exchange_strong(consume) lock cmpxchg lwarx;cmpwi;bne*;stwcx.;bne*;isync
atomic.compare_exchange_strong(release) lock cmpxchg lwsync;lwarx;cmpwi;bne*;stwcx.;bne*
atomic.compare_exchange_strong(relaxed) lock cmpxchg lwarx;cmpwi;bne*;stwcx.;bne*

table. code generation part 1

the fetch_add/fetch_sub will generate add or xadd on x86-64 according to the use of reuturn value, if not used simple lock add is enough.

operation arm64 mips64
atomic.store(seq_cst) stlr li;sync
atomic.store(acq_rel) stlr li;sync
atomic.store(release) stlr li;sync
atomic.store(relaxed) str li
atomic.exchange(seq_cst) ldaxr;staxr;cbnz* sync;li;sc;beq*;sync
atomic.exchange(acq_rel) ldaxr;staxr;cbnz* sync;li;sc;beq*;sync
atomic.exchange(acquire) ldaxr;stxr;cbnz* li;sc;beq*;sync
atomic.exchange(consume) ldaxr;stxr;cbnz* li;sc;beq*;sync
atomic.exchange(release) ldxr;staxr;cbnz* sync;li;sc;beq*
atomic.exchange(relaxed) ldxr;stxr;cbnz* li;sc;beq*
atomic.compare_exchange_strong(seq_cst) ldaxr;cmp;bne;stlxr;cbnz* sync;ll;bne*;li;sc;beq*;nop;sync
atomic.compare_exchange_strong(acq_rel) ldaxr;cmp;bne;stlxr;cbnz* sync;ll;bne*;li;sc;beq*;nop;sync
atomic.compare_exchange_strong(acquire) ldaxr;cmp;bne;stxr;cbnz* ll;bne*;li;sc;beq*;nop;sync
atomic.compare_exchange_strong(consume) ldaxr;cmp;bne;stxr;cbnz* ll;bne*;li;sc;beq*;nop;sync
atomic.compare_exchange_strong(release) ldxr;cmp;bne;stlxr;cbnz* sync;ll;bne*;li;sc;beq*;nop
atomic.compare_exchange_strong(relaxed) ldxr;cmp;bne;stxr;cbnz* ll;bne*;li;sc;beq*;nop

table. code generation part 2

* means there is a loop

x86-64’s code generation is much simpler compared to the rest due to it’s strong hardware memory model, but there is no free lunch, the lock instruction locks the whole bus, it’s very expensive.

With stronger constraints, comes less flexibility for optimization.

There is no free lunch for ARM, POWER and MIPS too. Although they are weaker than x86-64 and more flexible, however, it’s easier to write buggy code while doing optimization.

## 11 Lock-free programming

Definition, examples of lock-free programming, and usage of atomics and memory ordering for lock-free.

Lock-free programming is doing trick with CAS at most of time, we need to consider all of the corner cases under concurrent situations.

In my point of view, it’s the key factor to ensure your lock-free program is correct iff what you operate is what you want to operate, even if it keeps changing all the time

### 11.1 Definition - what is lock-free

The Lock-Free property guarantees that at least some thread is doing progress on its work. In theory this means that a method may take an infinite amount of operations to complete, but in practice it takes a short amount, otherwise it won’t be much useful.

Definition: A method is Lock-Free if it guarantees that infinitely often some thread calling this method finishes in a finite number of steps.

The Wait-Free property guarantees that any given thread provided with a time-slice will be able to make some progress and eventually complete. It guarantees that the number of steps is finite, but in practice it may be extremely large and depend on the number of active threads, which is why there aren’t many practical Wait-Free data structures.

Definition: A method is Wait-Free if it guarantees that every call finishes its execution in a finite number of steps.

The above is quoted from this article, which is explained very well.

To understand it in a simple way, I put strengths of lock-freedom and properties as following:

• Not lock-free
• explicit lock/mutex
• Lock-free, “someone makes progress”
• no explicit lock/mutex
• every step taken achieves global progress (for some sensible definition of progress)
• guarantee system-wide throughput
• Wait-free, “no one ever waits”
• every operation will complete in a bounded #steps no matter what else is going on
• guarantee system-wide throughput and starvation-freedom
• all wait-free algorithms are lock-free, but not vice versa

When we say that a particular data structure is Lock-Free it means that all its operations are Lock-Free, or better.

We should analyse the program to see what kind of “lock-free” it is, program without explicit locks is lock-free but performance varies among different implementations.

#### 11.1.1 Lock-freedom examples

This is a piece of code that is lock-free but not wait-free.

size_t sleep_micro_sec = 5;
std::atomic<bool> locked_flag_ = ATOMIC_VAR_INIT(false);

void foo() {
bool exp = false;
while (!locked_flag_.compare_exchange_strong(exp, true)) {
exp = false;
if (sleep_micro_sec == 0) {
} else if (sleep_micro_sec != 0) {
}
}
}


This is a piece of code that is (usually) wait-free with low(not that high) contention. We said it is “usually” because it may spin for long with high contention.

template<typename T>
void slist<T>::push_front(const T& t) {
auto p = new Node;
p->t = t;
// spin until success
}


### 11.2 Double-checked lock pattern, DCLP

GOF’s singleton pattern, lazy evaluation

Singleton* Singleton::instance = nullptr;

Singleton* Singleton::get_instance() {
if (instance == nullptr) {
lock_guard<mutex> lock(mtx);
if (instance == nullptr) {
instance = new Singleton(); // reordering or visibility of instance issue
}
}
return instance;
}


This may not work, due to other threads may see partially constructed instance, if Singleton() is inlined explicitly or optimized by compiler.

The writes that initialize the Singleton object and the write to the instance field can be done or perceived out of order.

Even if the compiler does not reorder those writes, on a multiprocessor the processor or the memory system may reorder those writes, as perceived by a thread running on another processor.

To rescue that, use atomic for variable instance:

atomic<Singleton*> Singleton::instance {nullptr};


It’s sequential consistency by default, all write in Singleton() will and must be done before assigning to instance, no reordering to the assignment.

There are also other approach, like Scott Mayer’s, easy and simple

Singleton* Singleton::get_instance() {
static Singleton a;
return &a;
}


Here’s another variant, a lock free approach with an extra flag.

atomic<Singleton*> Singleton::instance {nullptr};
atomic<bool> Singleton::create {false};

Singleton* Singleton::get_instance() {
if (!create.exchange(true)) // try to construct
instance = new Singleton(); // construct
else while(instance.load() == nullptr) { } // spin
}
}


the above code use default memory order – sequential consistency, too strong for this snippet.

atomic<Singleton*> Singleton::instance {nullptr};
atomic<bool> Singleton::create {false};

Singleton* Singleton::get_instance() {
if (!create.exchange(true, memory_order_acq_rel)) // strong enough?
instance.store(new Singleton(), memory_order_release); // construct
else while(instance.load(memory_order_acquire) == nullptr) { } // spin
}
}


#### 11.2.1 Lazy evaluation

The right way for lazy evaluation, using std::call_once

Singleton* Singleton::get_instance() {
static std::once_flag create;
std::call_once(create, [] { instance = new Singleton(); });
return instance;
}


### 11.3 Lock-free queue

“Brute-force” lock-free singly-linked list, no pop.

Fixed size vector with atomic cursor and semaphore, one-to-many producer-consumer pattern.

template <typename T>
class slist {
public:
slist();
~slist();
void push_front(const T& t);
void pop_front();
private:
struct Node {
T t;
Node* next;
}
};


Constructor doesn’t need to do anything, but destructor need to recycle internal nodes.

Destructor

template<typename T>
slist<T>::~slist() {
while (first != nullptr) {
auto unliked = first;
first = first-next;
delete unliked;
}
}


push_front() is easy with atomic weapons, to deal with concurrent insertions, we just need to prepend to the head we see exactly with CAS.

template<typename T>
void slist<T>::push_front(const T& t) {
auto p = new Node;
p->t = t;
// spin until success
}


pop_front() is that easy too?

template<typename T>
void slist<T>::pop_front() {
// spin until success
while (p != nullptr
// problem:
// ABA and dereferrencing deleted p by p->next in other threads after deletion
delete p;
}


It seems that works fine, but there are problems:

1. ABA problem
2. Deference deleted pointer

case for ABA problem:

Initial state:
+------+    +-------+    +-------+    +-------+    +-------+
+------+    +-------+    +-------+    +-------+    +-------+

Modify the list:

|     // pop                  // pop                     // push
t
m t1 -------------------------------------------------------------
|                             CAS(5, 4)
| t2 -------------------------------------------------------------
l                                                        CAS(4, 9)
n                                                        CAS(9, 5)
e t2 -------------------------------------------------------------
|     CAS(5, 4)
| t4 -------------------------------------------------------------
v

t4:
+------+    +=======+    +=======+    +-------+    +-------+    +-------+
+---o--+    +=======+    +=======+    +-------+    +-------+    +-------+
\                                    ^
\__________________________________/



case for dereferencing deleted pointer problem (thread 2 dereference a mangling pointer deleted by thread 1):

// for convinence of explaination, pop_front() also can be written as
template<typename T>
void slist<T>::pop_front() {
do {
auto n = p->next; // dereference p, which is mean to be head
} while (p != nullptr && !head.compare_exchange_weak(p, n));
delete p;
}

// think of 2 concurrent pops, vertical direction shows in which time elapses

|
void slist<T>::pop_front() {               | void slist<T>::pop_front() {
do {                                     |   do {
} while (p != nullptr                    |
delete p;                                |
}                                          |      auto n = p->next; // BANG!
|    } while (p != nullptr
|    delete p;
v  }


There are some (common and may be more) hints to solve the ABA problem:

We need to solve the ABA issue: Two nodes with the same address, but different identities (existing at different times).

1. Option 1: Use lazy garbage collection.
• Solves the problem. Memory can’t be reused while pointers to it exist.
• But: Not an option (yet) in portable C++ code, and destruction of nodes becomes nondeterministic.
2. Option 2: Use reference counting (garbage collection).
• Solves the problem in cases without cycles. Again, avoids memory reuse.
3. Option 3: Make each pointer unique by appending a serial number, and increment the serial number each time it’s set.
• This way we can always distinguish between A and A’.
• But: Requires an atomic compare-and-swap on a value that’s larger than the size of a pointer. Not available on all hardware & bit-nesses.
4. Option 4: Use hazard pointers.
• Maged Michael and Andrei Alexandrescu have covered this in detail.
• But: It’s very intricate. Tread with caution.

It’s getting complicated and hard, and of course, if we don’t recycle memory there won’t be any ABA problem at all! :smirk: Think about it and don’t recycle anything if you push a lot and pop is much less, it will work well.

There is an attempt for option 1, 2 and 4 https://stidio.github.io/2017/01/cpp11_atomic_and_lockfree_program/, it may be a bit more complicated, I will discus it in future post (lock-free application series?), if I had time and found that this topic is still interesting.

For option 3, here is an attempt, apparently this attempt has hardware limitation. The idea is using a separate head structure with version.

Comparing the content of the head instead of pure address which may be changed without notifications, which is the essence of solving the ABA problem.
template <typename T>
class slist {
...
struct Node {
T t;
Node* next;
}
struct HeadNode { // 128 bit
size_t ver = 0;
Node* ptr = nullptr;
}
// this may be an issue
};

template<typename T>
void slist<T>::push_front(const T& t) {
auto p = new Node;
do {
p->next = h.ptr;
p->t = t;
}

template<typename T>
void slist<T>::pop(const T& t) {
do { // does not dereference the pointer to data
delete h.ptr;
}


It resolves the problems I mentioned before:

• every time the head has been updated, version of head increased, ABA problem resolved
• we don’t dereference any pointers in the linked list, porblem of dereferencing deleted pointer resolved

However, it’s not perfect, it may not work as lock-free on all hardwares due to the “big” structure NodeHead, some hardware may not have large enough register, here may be 128 bits, to complete CAS within one instruction. To check if it works, the simplest way may be using std::atomic_is_lock_free() to determine it.

The following show that atomicity is hardware dependent. (__atomic_load)

gcc8.2 -std=c++17 -O3

// cpp                                | ; asm
struct alignas(16) BigStruct16 {      | test_big_struct_lock_free1():
size_t ver;                         |        sub     rsp, 24
int64_t data;                       |        mov     esi, 5
};                                    |        mov     rdi, rsp
struct alignas(64) BigStruct64 {      |        add     rsp, 24
size_t ver;                         |        ret
int64_t data;                       |
};                                    | test_big_struct_lock_free2():
|         push    rbp
void test_big_struct_lock_free1() {   |         mov     ecx, 5
std::atomic<BigStruct64> a64;       |         mov     edi, 32
auto c = a64.load();                |         mov     rbp, rsp
}                                     |         and     rsp, -32
|         sub     rsp, 64
void test_big_struct_lock_free2() {   |         lea     rdx, [rsp+32]
std::atomic<BigStruct64> a64;       |         mov     rsi, rsp
}                                     |         leave
|         ret
// on most of modern Intel's x86-64 processors there are 128-bit registers
// BigStruct16 is lock free

##### 11.3.1.1 Conclusion

To make this list pop single element is not that easy, to get rid of ABA, we may also first pop all the elements at once?

template<typename T>
auto slist<T>::pop_all() {
}


Nice, we are finally done.

(Wait, we didn’t talk about which memory order to be apply to atomic operations yet?)

### 11.4 shared_ptr ref_count

Bug mentioned by Herb in his talk atomic weapons.

It is very interesting and helpful for us to understand memory order much better.

Fist, we give the correct (fixed) code for shared ptr’s reference counting.

Increment

control_block_ptr = other->control_block_ptr;


Decrement

if (!control_block_ptr->refs
.fetch_sub(1, memory_order_acq_rel)) { // key
delete control_block_ptr;
}


Increment can be relaxed (not a publish operation, no one depends on this increment). Decrement can be acq_rel (both acq+rel necessary, probably sufficient).

Let’s analyse the wrong one that fetch_sub uses pure memory_order_release.

VS2012’s bug with ARM architecture, x86-64 is much stronger it’s OK with it.

if (!control_block_ptr->refs
.fetch_sub(1, memory_order_release)) { // buggy
delete control_block_ptr;
}


e.g, wrong memory order, pure release, consider 2 threads need to decrement

// Thread 1, 2->1                      |    // Thread 2, 1->0
// A: use of object                    |    :::
if (!control_block_ptr->refs           |    if (!control_block_ptr->refs
.fetch_sub(                      |          .fetch_sub(
1, memory_order_release)) {    |            1, memory_order_release)) {
// branch not taken                  |       delete control_block_ptr; // B
}                                      |    }


The explanation given by Herb Sutter

• No acquire/release => no coherent communication guarantee that thread 2 sees thread 1’s writes in the right order. To thread 2, line A could appear to move below thread 1’s decrement even though it’s a release(!).
• Release doesn’t keep line B below decrement in thread 2.

The second reason is obvious, because it’s a release-operation, it’s free to reorder that delete floating up.

It’s hard to understand the first one for the first time, to there are 2 questions:

1. how can decrementing from 2 to 1 goes before line A
2. how can line A come after line B

To figure it out, we need to some transformations that compiler and CPU will do with the pure release semantics.

Transformation 1, break thread 1 into pieces (the load and store is for demonstration, they actually may be a single instruction on some hardwares)

// Thread 1, 2->1

// A: use of object

decrement ref_cnt 2->1
store ref_cnt
"release fence"
if (ref_cnt == 0) { // ref_cnt in register
// branch not taken
}


Transformation 2, reordering happens to decrementing of ref_cnt

// Thread 1, 2->1

decrement ref_cnt 2->1 // reordered
store ref_cnt

// A: use of object

"release fence"
if (ref_cnt == 0) { // ref_cnt in register
// branch not taken
}


Transformation 3, break thread 2 into pieces

// Thread 2, 1->0
:::
decrement ref_cnt 1->0
store ref_cnt
"release fence"
if (ref_cnt == 0) { // ref_cnt in register
delete control_block_ptr; // B
}


Transformation 4, reordering happens to that delete

// Thread 2, 1->0
:::
decrement ref_cnt 1->0
store ref_cnt
if (ref_cnt == 0) { // ref_cnt in register
delete control_block_ptr; // B
}
"release fence"


Transformation 2 and 4 are both legal forms according to the standard:

N3337 29.3.2

An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic operation B that performs an acquire operation on M and takes its value from any side effect in the release sequence headed by A.

in other words, by cppreference:

All writes in the current thread are visible in other threads that acquire the same atomic variable, and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic

Contrast transformation 2 with 4, vertical direction is the execution order.

// Thread 1, 2->1                     |         // Thread 2, 1->0
decrement ref_cnt 2->1                |
store ref_cnt                         |
|         decrement ref_cnt 1->0
|         store ref_cnt
|         if (ref_cnt == 0) { // reg
|           delete control_block_ptr; // B
// A: use of object                   |         }
"release fence"                       |         "release fence"
if (ref_cnt == 0) { // reg            |
// branch not taken                 |
}                                     |


figure , release doesn’t keep synchronizes-with relation

Thread 1’s use of object comes after thread 2’s delete of that object, bang! This is the situation Herb describes, when 2 threads try to decrement, there actually is a data dependency, one must “wait” for, synchronizes-with another, if using release only, there is no such a synchronize with semantics.

With synchronizes-with, there won’t be any transforms like figure given above with fetch_sub(memory_order_acq_rel).

With semantics of “acquire + release”, “A: use of object” cannot move down into or past the “critical section”, and “delete control_block_ptr” cannot move up into or past “critical section”, problem solved.

// Thread 1, 2->1                     |        // Thread 2, 1->0
// A: use of object //cannot move down|
"acquire & release fence"             |
decrement ref_cnt 2->1                |        :::
store ref_cnt               synchronizes-with
"acquire & release fence"  ----------------->  "acquire & release fence"
|        decrement ref_cnt 1->0
|        store ref_cnt
|        "acquire & release fence"
if (ref_cnt == 0) {                   |        if (ref_cnt == 0) { // can not move up
// branch not taken                 |          delete control_block_ptr; // B
}                                     |        }


Note: Putting the term “release fence” and “acquire fence” in double quote is to distinguish with the standalone fence in C++11.

### 11.5 rwlock of glibc

From version 2.28 on, rwlock in glibc is implemented with lock-free technique, which may be faster than shared lock provided by C++ standard lib (since c++14), although there are still some low level locks required.

This is a very difficult lock-free algorithm, if you understand all the purposes of the atomic operations, you may be good to say that you understand lock-free programming well.:smirk:

glibc 2.28 release https://www.sourceware.org/ml/libc-alpha/2018-08/msg00003.html

nptl/pthread_rwlock_common.c contains most of the implementation.

## 12 Recap and conclusion

This post first introduces the hardware architecture cache and cache line, which may introduce OoO, and then discusses about runtime and compile-time reordering introduced by optimization, and later on, we discusses about how to control the reordering. Hence, we introduce the hardware and software memory model. After Introduction of MM, we talk lots of c++ memory ordering, and MM of C++11. And finally we bring up the lock-free programming which is the real application of MM.

Hardware, compiler and C++ are complicated and powerful, we may not know all the stuffs they’ve done for us, but we need to know the principle, that’s what this post tries todo.

Modern C++, especially C++11, abstracts the memory model very well, the concepts and potability of <atomic> are very good, helps C++ users a lot, it’s very hard to do that, but C++ standard committee actually did, many thanks to them!

We talked lock-free programing at last, and found it’s a very tough task to do general tasks with lock-free technique, we have to do a lot trade-offs and be very very very careful when we are doing lock-free programming.

Finally,

With Great Power Comes Lots Of Fun!

## 13 Useful resources

### 13.1 Static materials

Jeff Preshing’s posts of “lock-free programing” series, if you are new to lock-free programming or software/hardware memory ordering, read them in the given order, I can ensure that it may enlighten you greatly.

And you are welcomed to discuss with me about any

Intel Software Developer Manuals Combined Volumes: 3a, 3b, 3c and 3d - System programming guide

Volume 3 §8.1 and §8.2 are the same important reference, they show the Intel x86-64 family processors’ reordering detailed specification, which is worth reading.

what is the benefit of instruction at compile time and runtime

The above materials are about hardware level that related to memory order.

double-checked locking is broken for java

This post shows and analyses the broken DLCP for java in compile-time, the constructor is inlined by javac.

However, I do not see any thing wrong with gcc, the generated assembly code is just fine, the memory order is correct… Maybe explicit or compiler optimization inline is needed, that a complicated case.

c++ working drat

Here is the C++ Standard Draft Sources, check it for the latest C++ working draft.

C++ working draft N3337

This is the latest publication of C++11 working draft.

This mailing list about the conditional access to memory, which covers the discussion about out-of-thin-air, and it may affect sequentially consistency when compiler optimize this kind of conditional access.

The C++ Programming Language 4th Edition by Bjarne Stroustrup

Chapter 41 talks about concurrency, memory model and atomic, which is really helpful for understanding these concepts.

The purpose/definition of memory model is defined in §41.1.

obstruction-freedom

Lock-Free and Wait-Free, definition and examples

An article puts definitions of blocking, lock-free, wait-free and obstruciton-free all together.

Blocking 1. Blocking 2. Starvation-Free Obstruction-Free 3. Obstruction-Free Lock-Free 4. Lock-Free (LF) Wait-Free 5. Wait-Free (WF) 6. Wait-Free Bounded (WFB) 7. Wait-Free Population Oblivious (WFPO)

rust memory ordering

Introduces the memory ordering in rust.

lock-free statck written with C11

This is a versioning solution to solve the ABA problem of lock-free singly-liked list.

lock-free singly-linked list written with C++11

This is an attempt to implement a lock-free singly-linked list using hazard pointer and reference counting technique.

lock-free data structure lib in C

Learn from others! This lock-free lib is written in C, may lack of portibility.

Some disscusion on store buffer.

A primer on memory consistency and cache coherence

C++ Concurrency in Action, 2nd Edition

### 13.2 Videos and talks

Herb Sutter’s talk about lock-free programming, part 1 introduce some fundamentals, and part 2 is mainly about some concrete examples like: lock-free single list with push and pop and some analysis. Herb Sutter also implicitly talks

Herb Sutter’s talk about atomic, part 1 is mainly about the fundamentals that what is memory order and why they exists, and what is acquire and release semantics, what does the compiler do for optimization related with instruction compile-time reordering.

Part 2 is mainly about the optimization that the compiler does about memory order. In this part Herb shows the emitted assembly code for different architecture CPUs under different circumstances, which can help understanding the software acquire and release semantics on hardware level.

CppCon 2014: Jeff Preshing “How Ubisoft Develops Games for Multicore - Before and After C++11”

Jeff’s talk about atomic lib of C++11.

ACCU 2017 - Atomic’s memory orders, what for? - Frank Birbacher

This talk is mainly about “synchronize-with” relation explanation.

The C++ Memory Model - Valentin Ziegler @ Meeting C++ 2014

This talk is about memory model of C++11, the first half explains “Sequential Consistency” in CPP memory model very well, the second half is not that informative.

CppCon 2015: Michael Wong “C++11/14/17 atomics and memory mode

This talk introduce the memory model of C++11/14/17 on a high level. The speaker introduces a lot of concepts and definitions of design of memory model of C++11/14/17, like:

1. sequential consistency
2. what is memory model.
3. synchronizes-with, sequenced-before, happens-before
4. etc. it’s worth watching though the content arrangement seems a little messy.