AI System School
??? System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI)
Updates:
- Video Tutorials [YouTube] [bilibili] [小红书]
- We are preparing a new website [Lets Go AI] for this repo!!!
Path to System for AI [Whitepaper You Must Read]
A curated list of research in machine learning systems. Link to the code if available is also present. Now we have a team to maintain this project. You are very welcome to pull request by using our template.
System for AI (Ordered by Category)
ML / DL Infra
- Data Processing
- Training System
- Inference System
- Machine Learning Infrastructure
LLM Infra
Domain-Specific Infra
- Video System
- AutoML System
- Edge AI
- GNN System
- Federated Learning System
- Deep Reinforcement Learning System
System for ML/LLM Conference
Conference
- OSDI
- SOSP
- SIGCOMM
- NSDI
- MLSys
- ATC
- Eurosys
- Middleware
- SoCC
- TinyML
General Resources
- Survey
- Book
- Video
- Course
- Blog
Survey
- Toward Highly Available, Intelligent Cloud and ML Systems [Slide]
- A curated list of awesome System Designing articles, videos and resources for distributed computing, AKA Big Data. [GitHub]
- awesome-production-machine-learning: A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning [GitHub]
- Opportunities and Challenges Of Machine Learning Accelerators In Production [Paper]
- Ananthanarayanan, Rajagopal, et al. "
- 2019 {USENIX} Conference on Operational Machine Learning (OpML 19). 2019.
- How (and How Not) to Write a Good Systems Paper [Advice]
- Applied machine learning at Facebook: a datacenter infrastructure perspective [Paper]
- Hazelwood, Kim, et al. (HPCA 2018)
- Infrastructure for Usable Machine Learning: The Stanford DAWN Project
- Bailis, Peter, Kunle Olukotun, Christopher Ré, and Matei Zaharia. (preprint 2017)
- Hidden technical debt in machine learning systems [Paper]
- Sculley, David, et al. (NIPS 2015)
- End-to-end arguments in system design [Paper]
- Saltzer, Jerome H., David P. Reed, and David D. Clark.
- System Design for Large Scale Machine Learning [Thesis]
- Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications [Paper]
- Park, Jongsoo, Maxim Naumov, Protonu Basu et al. arXiv 2018
- Summary: This paper presents a characterizations of DL models and then shows the new design principle of DL hardware.
- A Berkeley View of Systems Challenges for AI [Paper]
Book
- Computer Architecture: A Quantitative Approach [Must read]
- Distributed Machine Learning Patterns [Website]
- Streaming Systems [Book]
- Kubernetes in Action (start to read) [Book]
- Machine Learning Systems: Designs that scale [Website]
- Trust in Machine Learning [Website]
- Automated Machine Learning in Action [Website]
Video
- ScalaDML2020: Learn from the best minds in the machine learning community. [Video]
- Jeff Dean: "Achieving Rapid Response Times in Large Online Services" Keynote - Velocity 2014 [YouTube]
- From Research to Production with PyTorch [Video]
- Introduction to Microservices, Docker, and Kubernetes [YouTube]
- ICML Keynote: Lessons Learned from Helping 200,000 non-ML experts use ML [Video]
- Adaptive & Multitask Learning Systems [Website]
- System thinking. A TED talk. [YouTube]
- Flexible systems are the next frontier of machine learning. Jeff Dean [YouTube]
- Is It Time to Rewrite the Operating System in Rust? [YouTube]
- InfoQ: AI, ML and Data Engineering [YouTube]
- Netflix: Human-centric Machine Learning Infrastructure [InfoQ]
- SysML 2019: [YouTube]
- ScaledML 2019: David Patterson, Ion Stoica, Dawn Song and so on [YouTube]
- ScaledML 2018: Jeff Dean, Ion Stoica, Yangqing Jia and so on [YouTube] [Slides]
- A New Golden Age for Computer Architecture History, Challenges, and Opportunities. David Patterson [YouTube]
- How to Have a Bad Career. David Patterson (I am a big fan) [YouTube]
- SysML 18: Perspectives and Challenges. Michael Jordan [YouTube]
- SysML 18: Systems and Machine Learning Symbiosis. Jeff Dean [YouTube]
- AutoML Basics: Automated Machine Learning in Action. Qingquan Song, Haifeng Jin, Xia Hu [YouTube]
Course
- CS692 Seminar: Systems for Machine Learning, Machine Learning for Systems [GitHub]
- Topics in Networks: Machine Learning for Networking and Systems, Autumn 2019 [Course Website]
- CS6465: Emerging Cloud Technologies and Systems Challenges [Cornell]
- CS294: AI For Systems and Systems For AI. [UC Berkeley Spring] (Strong Recommendation) [Machine Learning Systems (Fall 2019)]
- CSE 599W: System for ML. [Chen Tianqi] [University of Washington]
- EECS 598: Systems for AI (W'21). [Mosharaf Chowdhury] [Systems for AI (W'21)]
- Tutorial code on how to build your own Deep Learning System in 2k Lines [GitHub]
- CSE 291F: Advanced Data Analytics and ML Systems. [UCSD]
- CSci 8980: Machine Learning in Computer Systems [University of Minnesota, Twin Cities]
- Mu Li (MxNet, Parameter Server): Introduction to Deep Learning [Best DL Course I think] [Book]
- 10-605: Machine Learning with Large Datasets. [CMU]
- CS 329S: Machine Learning Systems Design. [Stanford]
Blog
- Parallelizing across multiple CPU/GPUs to speed up deep learning inference at the edge [Amazon Blog]
- Building Robust Production-Ready Deep Learning Vision Models in Minutes [Blog]
- Deploy Machine Learning Models with Keras, FastAPI, Redis and Docker [Blog]
- How to Deploy a Machine Learning Model -- Creating a production-ready API using FastAPI + Uvicorn [Blog] [GitHub]
- Deploying a Machine Learning Model as a REST API [Blog]
- Continuous Delivery for Machine Learning [Blog]
- Kubernetes CheatSheets In A4 [GitHub]
- A Gentle Introduction to Kubernetes [Blog]
- Train and Deploy Machine Learning Model With Web Interface - Docker, PyTorch & Flask [GitHub]
- Learning Kubernetes, The Chinese Taoist Way [GitHub]
- Data pipelines, Luigi, Airflow: everything you need to know [Blog]
- The Deep Learning Toolset — An Overview [Blog]
- Summary of CSE 599W: Systems for ML [Chinese Blog]
- Polyaxon, Argo and Seldon for Model Training, Package and Deployment in Kubernetes [Blog]
- Overview of the different approaches to putting Machine Learning (ML) models in production [Blog]
- Being a Data Scientist does not make you a Software Engineer [Part1]
Architecting a Machine Learning Pipeline [Part2]
- Model Serving in PyTorch [Blog]
- Machine learning in Netflix [Medium]
- SciPy Conference Materials (slides, repo) [GitHub]
- 继Spark之后,UC Berkeley 推出新一代AI计算引擎——Ray [Blog]
- 了解/从事机器学习/深度学习系统相关的研究需要什么样的知识结构? [Zhihu]
- Learn Kubernetes in Under 3 Hours: A Detailed Guide to Orchestrating Containers [Blog] [GitHub]
- data-engineer-roadmap: Learning from multiple companies in Silicon Valley. Netflix, Facebook, Google, Startups [GitHub]
- TensorFlow Serving + Docker + Tornado机器学习模型生产级快速部署 [Blog]
- Deploying a Machine Learning Model as a REST API [Blog]
- Colossal-AI: A Unified Deep Learning System for Big Model Era [Blog] [GitHub]
- Data Engineer Roadmap [Scaler Blogs]