Spring 25 CSCI 8980 Introduction to LLM System
Instructor: Zirui “Ray” Liu
Time: Spring 2025, M/W 4:00 PM - 05:15 PM
Location: Amundson Hall 158
Course Description
Recent progress of Artificial Intelligence has been largely driven by advances in large language models (LLMs) and other generative methods. The success of LLMs is driven by three key factors: the scale of data, the size of models, and the computational resources available. For example, Llama 3, a cutting- edge LLM with 405 billion parameters, was trained over 15 trillion tokens using 16,000 H100 GPUs. Thus, training, serving, fine-tuning, and evaluating such models demand sophisticated engineering practices that leverage modern hardware and software infrastructures. Building scalable systems for LLMs is essential for further advancing AI capabilities. In this course, students will learn the design principle of a good Machine Learning System for LLMs. Specifically, this course mainly covers three topics (1) the components of modern transformer-based language models; (2) the knowledge about GPU programming; (3) how to train and deploy LLMs in a scalable way.
Grading Policy
The grading policy is subject to minor change.
The course will include five short quizzes and a final project. The final project can be completed individually or collaboratively in groups of up to three students. Presentations for the final project will occur during the last three weeks of the semester and will include a Q&A session. For the final project, students must select a research paper focused on MLSys. They will critically analyze the paper to identify its limitations or areas that could be improved. Based on this analysis, students are expected to propose and develop a solution to address the identified issues or introduce an innovative idea to enhance the research.
Acknowledgement
Many of my course contents are inspired by the great materials from CMU’s Large Language Model Systems Course and CMU’s Deep Learning Systems Course
Course Schedule (tentative)
The course schedule may be changed.
Week | Topic | Papers & Materials | Slides | Quiz |
---|---|---|---|---|
Week 1 | Course overview | - | Slide | - |
Week 2 | Automatic Differentiation — Reverse Mode Automatic Differentiation | Wikipedia of Automatic Differentiation Andrej Karpathy’s tinygrad | Slide | - |
Automatic Differentiation — Computation Graph | PyTorch’s Blog on Computation Graph Andrej Karpathy’s tinygrad | - | ||
Week 3 | Transformers — Multi-Head Attention, Layer Normalization, Feed Forward Network | Sasha Rush’s notebook: The Annotated Transformers Andrej Karpathy’s minGPT | Slide | - |
Llama’s change over original Transformer – RMSNorm & SwiGLU activation & Gated MLP | The official Llama 1 report PyTorch docs on RMSNorm Noam Shazeer’s report: GLU Variants Improve Transformer | - | ||
Week 4 | Llama’s change over original Transformer — Rotary Positional Embedding (ROPE) | EleutherAI’s blog on Rotary Positional Embeddings The original ROPE paper | Slide | - |
Pretraining Data Curation | The official Llama 3 report Dolma Corpus report | Slide | - | |
Week 5 | Post-Training Overview | TULU report | Slide | - |
GPU Programming Basics | CUDA C++ Programming Guide | Slide | - | |
Week 6 | Overview of Parallelism — Data Parallelism, Pipeline Parallelism, Tensor Parallelism | HuggingFace’s blog on multi-GPU training GPipe | Slide | - |
Zero & Full Sharded Data Parallel — Model & Optimizer State Sharding | ZeRO Optimizer Pytorch Team’s FSDP Tutorial | Slide | - | |
Week 7 | Guest Lecture 1 – Byron Hsu from xAI | - | - | |
Guest Lecture 2 – Guanchu Wang from Rice | - | - | ||
Week 8 | Spring Break | - | - | |
Week 9 | Guest Lecture 3 – Yuke Wang from AWS/Rice | - | - | |
Grading Policy + Performance Engineering | Horace He’s post on making DL go Brrr | Slide | - | |
Week 10 | Inference Workload Overview + Continuous Batching | NVIDIA’s blog on Inference Optimization | Slide | Quiz on Automated Differentiation (on 3/24) |
Inference Workload Overview + KV Cache | - | Slide | ||
Week 11 | Review Course | - | ||
Paged Attention | vLLM Repo | Quiz on Positional Embedding (on 4/2) | ||
Week 12 | Flash-Attention | Flash-Attention Repo | Slide | |
Quantization | GPTQ SmoothQuant AWQ | Slide | ||
Week 13 | Presentation Policy+Mixed of Experts | DeepSpeed-MoE DeepSeek-MoE | Slide | Quiz on Parallelism (on 4/14) |
Diffuision | DDPM DDIM | Slide |