Spring 25 CSCI 8980 Introduction to LLM System

Instructor: Zirui “Ray” Liu

Time: Spring 2025, M/W 4:00 PM - 05:15 PM

Location: Amundson Hall 158

Course Description

Recent progress of Artificial Intelligence has been largely driven by advances in large language models (LLMs) and other generative methods. The success of LLMs is driven by three key factors: the scale of data, the size of models, and the computational resources available. For example, Llama 3, a cutting- edge LLM with 405 billion parameters, was trained over 15 trillion tokens using 16,000 H100 GPUs. Thus, training, serving, fine-tuning, and evaluating such models demand sophisticated engineering practices that leverage modern hardware and software infrastructures. Building scalable systems for LLMs is essential for further advancing AI capabilities. In this course, students will learn the design principle of a good Machine Learning System for LLMs. Specifically, this course mainly covers three topics (1) the components of modern transformer-based language models; (2) the knowledge about GPU programming; (3) how to train and deploy LLMs in a scalable way.

Grading Policy

The grading policy is subject to minor change.

The course will include five short quizzes and a final project. The final project can be completed individually or collaboratively in groups of up to three students. Presentations for the final project will occur during the last three weeks of the semester and will include a Q&A session. For the final project, students must select a research paper focused on MLSys. They will critically analyze the paper to identify its limitations or areas that could be improved. Based on this analysis, students are expected to propose and develop a solution to address the identified issues or introduce an innovative idea to enhance the research.

Acknowledgement

Many of my course contents are inspired by the great materials from CMU’s Large Language Model Systems Course and CMU’s Deep Learning Systems Course

Course Schedule (tentative)

The course schedule may be changed.

Week	Topic	Papers & Materials	Slides	Quiz
Week 1	Course overview	-	Slide	-
Week 2	Automatic Differentiation — Reverse Mode Automatic Differentiation	Wikipedia of Automatic Differentiation Andrej Karpathy’s tinygrad	Slide	-
	Automatic Differentiation — Computation Graph	PyTorch’s Blog on Computation Graph Andrej Karpathy’s tinygrad		-
Week 3	Transformers — Multi-Head Attention, Layer Normalization, Feed Forward Network	Sasha Rush’s notebook: The Annotated Transformers Andrej Karpathy’s minGPT	Slide	-
	Llama’s change over original Transformer – RMSNorm & SwiGLU activation & Gated MLP	The official Llama 1 report PyTorch docs on RMSNorm Noam Shazeer’s report: GLU Variants Improve Transformer		-
Week 4	Llama’s change over original Transformer — Rotary Positional Embedding (ROPE)	EleutherAI’s blog on Rotary Positional Embeddings The original ROPE paper	Slide	-
	Pretraining Data Curation	The official Llama 3 report Dolma Corpus report	Slide	-
Week 5	Post-Training Overview	TULU report	Slide	-
	GPU Programming Basics	CUDA C++ Programming Guide	Slide	-
Week 6	Overview of Parallelism — Data Parallelism, Pipeline Parallelism, Tensor Parallelism	HuggingFace’s blog on multi-GPU training GPipe	Slide	-
	Zero & Full Sharded Data Parallel — Model & Optimizer State Sharding	ZeRO Optimizer Pytorch Team’s FSDP Tutorial	Slide	-
Week 7	Guest Lecture 1 – Byron Hsu from xAI	-		-
	Guest Lecture 2 – Guanchu Wang from Rice	-		-
Week 8	Spring Break	-		-
Week 9	Guest Lecture 3 – Yuke Wang from AWS/Rice	-		-
	Grading Policy + Performance Engineering	Horace He’s post on making DL go Brrr	Slide	-
Week 10	Inference Workload Overview + Continuous Batching	NVIDIA’s blog on Inference Optimization	Slide	Quiz on Automated Differentiation (on 3/24)
	Inference Workload Overview + KV Cache	-	Slide
Week 11	Review Course	-
	Paged Attention	vLLM Repo		Quiz on Positional Embedding (on 4/2)
Week 12	Flash-Attention	Flash-Attention Repo	Slide
	Quantization	GPTQ SmoothQuant AWQ	Slide
Week 13	Presentation Policy+Mixed of Experts	DeepSpeed-MoE DeepSeek-MoE	Slide	Quiz on Parallelism (on 4/14)
	Diffuision	DDPM DDIM	Slide