Reading List

0. Three topics are chosen around multiprocessors: deterministic replay, cache and main memory;
1. Words in blue starting with '' under each paper title is a brief summary of the paper;
2. [Italic red words within brackets] are my classification labels for the paper;
3. Three selected papers are listed for each topic;
4. Main factors in choosing the papers are contents, novality, conferences and citations, etc.

Determistic Replay in Multiprocessors
In multiprocessors, multithreaded programs are executed non-deterministically; for bug reproduction and fault tolerance, deterministic replay was proposed to record sufficient execution events and replay these events later.
A "Flight Data Recorder" for enabling full-system multiprocessor deterministic replay [hardware, offline, one-run]
↳Processor-based offline full-system deterministic replay of multiprocessor executions. FDR is a practical low-overhead hardware recorder for cache-coherent multiprocessors.
ISCA'2003 [cite'353] pdf
PRES: Probabilistic replay with execution sketching on multiprocessors [software, offline, several-run]
↳A software-only solution to reproduce concurrency bug on multiprocessors in multiple runs, which greatly lowers record overhead.
SOSP'2009 [cite'162] pdf
Respec: Efficient online multiprocessor replay via speculation and external determinism [software, online, one-run]
↳The first system to support low-overhead, online deterministic replay on multiprocessors without hardware support.
ASPLOS'2010 [cite'99] pdf


Cache Management in CMP
CMPs often execute a wide variety of applications with differing requirements. To maximize performance, cache should be configured with respect to workload characteristics.
Token coherence: Decoupling performance and correctness [coherence]
↳A new coherence framework to enable coherence protocols by separating performance from correctness.
ISCA'2003 [cite'259] pdf
ASR: Adaptive selective replication for CMP caches [partition wrt average access time]
↳Dynamically monitors workload behaviors and then adjusts the replication level to minimize average access time.
MICRO'2006 [cite'193] pdf
Ubik: Efficient cache sharing with strict QoS for latency-critical workloads [partition wrt QoS]
↳Proposed Ubik, a dynamic partitioning technique that predicts and exploits the transient behavior of latency-critical workloads to maintain their tail latency while maximizing the cache space available to batch applications.
ASPLOS'2014 [cite'3] pdf


Main Memory in CMP
CMPs have limited off-chip bandwidth, which is competed by various applications. The potential interference may harm both system performance and individual application performance.
Scaling the bandwidth wall: Challenges in and avenues for CMP scaling [general study]
↳Developed an analytical model to study the bandwidth wall problems for CMP systems.
ISCA'2009 [cite'142] pdf
Understanding how off-chip memory bandwidth partitioning in chip multiprocessors affects system performance [partition]
↳Constructed an analytical model to understand how bandwidth partitioning affects performance, and how bandwidth and cache partitioning interact with one another.
HPCA'2010 [cite'62] pdf
Thread cluster memory scheduling: Exploiting differences in memory access behavior [scheduling]
↳A new memory scheduling that addresses system throughput and fairness separately with the goal of achieving the best of both.
MICRO'2010 [cite'175] pdf