University of Minnesota
Special Topics in Distributed Systems

Reading List (Tentative)

Note: Some papers are available through the ACM or IEEE Digital Library. These can be accessed for free from within the campus network (via UMN VPN). If you can’t access these easily, please let me know.

More papers may be added to the list later as new conference and journal papers are published. Some new papers may not be available online immediately, but links will be added when available.

Papers marked as (Short) are Short papers.

Background Papers

  1. (Data-Parallel Computing) MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat. OSDI 2004.

  2. (Data-Parallel Computing) Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker and Ion Stoica. NSDI 2012.

  3. (Stream Processing) Storm@Twitter. Toshniwal et al. SIGMOD'14.

  4. (Stream Processing) The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. Tyler Akidau et al., VLDB’15.

  5. (Distributed Machine Learning) Scaling Distributed Machine Learning with the Parameter Server. Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. OSDI'14.

Geo-distributed Analytics

  1. AWStream: Adaptive Wide-Area Streaming Analytics. Ben Zhang et al., SIGCOMM'18.
  2. Multi-Query Optimization in Wide-Area Streaming Analytics. Albert Jonathan et al., SOCC’18.
  3. Bohr: Similarity Aware Geo-distributed Data Analytics. Hangyu Li et al. Conext’18.
  4. Dynamic and Decentralized Global Analytics via Machine Learning , Hao Wang et al., SOCC'18.
  5. Siphon: Expediting Inter-Datacenter Coflows in Wide-Area Data Analytics, Shuhao Liu et al., ATC'18.
  6. Yugong: Geo-distributed data and job placement at scale, Yuzhen Huang et al., VLDB'19.

OS and Databases

  1. Automatic Database Management System Tuning Through Large-scale Machine Learning, Dana Van Aken et al, SIGMOD 2017
  2. The Case for Learned Index Structures, Tim Kraska et al., SIGMOD 2018
  3. ApproxJoin: Approximate Distributed Joins, Do Le Quoc et al., SoCC'18
  4. (Short) Driving Cache Replacement with ML-based LeCaR, Giuseppe Vietri et al., HotStorage'18
  5. (Short) Neural Trees: Using Neural Networks as an Alternative to Binary Comparison in Classical Search Trees, Douglas Santry, HotStorage'20
  6. (Short) Virtual Address Translation via Learned Page Table Indexes, Artemiy Margaritov et al., MLSys: Workshop on ML for Systems, NIPS 2018
  7. (Short) When is the Cache Warm? Manufacturing a Rule of Thumb, Lei Zhang et al. HotCloud'20

Video Streaming and Analytics

  1. Chameleon: Video Analytics at Scale via Adaptive Configurations and Cross-Camera Correlations, Junchen Jiang et al., Sigcomm'18
  2. Mainstream: Dynamic Stem-Sharing for Multi-Tenant Video Processing, Angela H. Jiang et al., ATC’18
  3. Learning in situ: a randomized experiment in video streaming, Francis Y. Yan et al., NSDI'20
  4. Server-driven video streaming for Deep learning inference, Kuntai Du et al., Sigcomm’20

Distributed ML

  1. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters, Yanghua Peng et al., Eurosys'18
  2. Gandiva: Introspective Cluster Scheduling for Deep Learning, Wencong Xiao et. al., OSDI'18
  3. TIFL: A Tier-based Federated Learning System, Zheng Chai et al., HPDC'20
  4. Resource Elasticity in Distributed Deep Learning, Andrew Or et al., MLSys'20
  5. The Non-IID Data Quagmire of Decentralized Machine Learning, Kevin Hsieh et al., ICML'20


  1. Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization, Yilong Geng et al., NSDI'18
  2. Taiji: managing global user traffic for large-scale internet services at the edge, David Chou et al., SOSP'19
  3. Is Big Data Performance Reproducible in Modern Cloud Networks?, NSDI'20
  4. Learning Relaxed Belady for Content Distribution Network Caching, Zhenyu Song et al., NSDI’20

Data Processing Systems

  1. Model-free Control for Distributed Stream Data Processing using Deep Reinforcement Learning, Teng Li et al., VLDB'18
  2. Learning scheduling algorithms for data processing clusters, Hongzi Mao et al., Sigcomm'19
  3. Autopilot: workload autoscaling at Google, Krzysztof Rzadca et al., Eurosys'20
  4. AutoToken: Predicting Peak Parallelism for Big Data Analytics at Microsoft, Rathijit Sen et al., VLDB'20

Whither Data-Aware Systems?

  1. AutoSys: The Design and Operation of Learning-Augmented Systems, Chieh-Jan Mike Liang et al., ATC'20
  2. Is Big Data Performance Reproducible in Modern Cloud Networks?, Alexandru Uta, et al., NSDI'20
  3. Interpreting Deep learning-based networking systems, Zili Meng et al., Sigcomm'20
  4. (Short) Seer: Leveraging Big Data to Navigate the Increasing Complexity of Cloud Debugging, Yu Gan et al., HotCloud'18
  5. (Short) Software-defined Software: A Perspective of Machine Learning-based Software Production, Rubao Lee, ICDCS'18