Reading List

Reading List

0. Three topics are chosen around multiprocessors: deterministic replay, cache and main memory;

1. Words in blue starting with '↳' under each paper title is a brief summary of the paper;

2. [Italic red words within brackets] are my classification labels for the paper;

3. Three selected papers are listed for each topic;

4. Main factors in choosing the papers are contents, novality, conferences and citations, etc.

Determistic Replay in Multiprocessors In multiprocessors, multithreaded programs are executed non-deterministically; for bug reproduction and fault tolerance, deterministic replay was proposed to record sufficient execution events and replay these events later.
A "Flight Data Recorder" for enabling full-system multiprocessor deterministic replay [hardware, offline, one-run] ↳Processor-based offline full-system deterministic replay of multiprocessor executions. FDR is a practical low-overhead hardware recorder for cache-coherent multiprocessors.	ISCA'2003 [cite'353]	pdf
PRES: Probabilistic replay with execution sketching on multiprocessors [software, offline, several-run] ↳A software-only solution to reproduce concurrency bug on multiprocessors in multiple runs, which greatly lowers record overhead.	SOSP'2009 [cite'162]	pdf
Respec: Efficient online multiprocessor replay via speculation and external determinism [software, online, one-run] ↳The first system to support low-overhead, online deterministic replay on multiprocessors without hardware support.	ASPLOS'2010 [cite'99]	pdf

Cache Management in CMP CMPs often execute a wide variety of applications with differing requirements. To maximize performance, cache should be configured with respect to workload characteristics.
Token coherence: Decoupling performance and correctness [coherence] ↳A new coherence framework to enable coherence protocols by separating performance from correctness.	ISCA'2003 [cite'259]	pdf
ASR: Adaptive selective replication for CMP caches [partition wrt average access time] ↳Dynamically monitors workload behaviors and then adjusts the replication level to minimize average access time.	MICRO'2006 [cite'193]	pdf
Ubik: Efficient cache sharing with strict QoS for latency-critical workloads [partition wrt QoS] ↳Proposed Ubik, a dynamic partitioning technique that predicts and exploits the transient behavior of latency-critical workloads to maintain their tail latency while maximizing the cache space available to batch applications.	ASPLOS'2014 [cite'3]	pdf

Main Memory in CMP CMPs have limited off-chip bandwidth, which is competed by various applications. The potential interference may harm both system performance and individual application performance.
Scaling the bandwidth wall: Challenges in and avenues for CMP scaling [general study] ↳Developed an analytical model to study the bandwidth wall problems for CMP systems.	ISCA'2009 [cite'142]	pdf
Understanding how off-chip memory bandwidth partitioning in chip multiprocessors affects system performance [partition] ↳Constructed an analytical model to understand how bandwidth partitioning affects performance, and how bandwidth and cache partitioning interact with one another.	HPCA'2010 [cite'62]	pdf
Thread cluster memory scheduling: Exploiting differences in memory access behavior [scheduling] ↳A new memory scheduling that addresses system throughput and fairness separately with the goal of achieving the best of both.	MICRO'2010 [cite'175]	pdf