Spring 25 CSCI 8980 Introduction to LLM System
Instructor: Zirui “Ray” Liu
Time: Spring 2025, M/W 4:00 PM - 05:15 PM
Course Description
Recent progress of Artificial Intelligence has been largely driven by advances in large language models (LLMs) and other generative methods. The success of LLMs is driven by three key factors: the scale of data, the size of models, and the computational resources available. For example, Llama 3, a cutting- edge LLM with 405 billion parameters, was trained over 15 trillion tokens using 16,000 H100 GPUs. Thus, training, serving, fine-tuning, and evaluating such models demand sophisticated engineering practices that leverage modern hardware and software infrastructures. Building scalable systems for LLMs is essential for further advancing AI capabilities. In this course, students will learn the design principle of a good Machine Learning System for LLMs. Specifically, this course mainly covers three topics (1) the components of modern transformer-based language models; (2) the knowledge about GPU programming; (3) how to train and deploy LLMs in a scalable way.
Grading Policy
The grading policy is subject to minor change.
The course will include five short quizzes and a final project. The final project can be completed individually or collaboratively in groups of up to three students. Presentations for the final project will occur during the last three weeks of the semester and will include a Q&A session. For the final project, students must select a research paper focused on MLSys. They will critically analyze the paper to identify its limitations or areas that could be improved. Based on this analysis, students are expected to propose and develop a solution to address the identified issues or introduce an innovative idea to enhance the research.
Course Schedule (tentative)
The course schedule may be changed.
Week | Topic | Papers & Materials |
---|---|---|
Week 1 | Course overview | - |
Week 2 | Automatic Differentiation — Reverse Mode Automatic Differentiation | Wikipedia of Automatic Differentiation Andrej Karpathy’s tinygrad |
Automatic Differentiation — Computation Graph | PyTorch’s Blog on Computation Graph Andrej Karpathy’s tinygrad | |
Week 3 | Transformers – Multi-Head Attention, Layer Normalization, Feed Forward Network | Sasha Rush’s notebook: The Annotated Transformers Andrej Karpathy’s minGPT |
Llama’s change over original Transformer – RMSNorm & SwiGLU activation & Gated MLP | The official Llama 1 report Pytorch’doc on RMSNorm Noam Shazeer’s report: GLU Variants Improve Transformer | |
Week 4 | Llama’s change over original Transformer – Rotary Positional Embedding (ROPE) | EleutherAI’s blog on Rotary Positional Embeddings The original ROPE paper |
Dissusion of the difference from Llama 1 to Llama 3 | The official Llama 1 report The official Llama 2 report The official Llama 3 report | |
Week 5 | GPU programming basic & Modern Computing Server | Get a quote of GPU server by yourself:) CUDA C++ programming guide |
Neural Network Training – Adam Optimizer & Mixed Precision Training | The original Adam Optimizer paper Pytorch’s Adam Optimizer doc The original mixed precision training paper Pytorch’s blog on mixed precision training | |
Week 6 | Distributed Large Model Training – Data Parallelism, Model Parallelism, Pipeline Parallelism | HuggingFace’s blog on multi-GPU training GPipe |
Distributed Large Model Training – Model & Optimizer State Sharding | ZeRO Optimizer Model FLOPs Utilization of PaLM | |
Week 7- End | TBD | - |