Spring 25 CSCI 8980 Introduction to LLM System

Instructor: Zirui “Ray” Liu

Time: Spring 2025, M/W 4:00 PM - 05:15 PM

Course Description

Recent progress of Artificial Intelligence has been largely driven by advances in large language models (LLMs) and other generative methods. The success of LLMs is driven by three key factors: the scale of data, the size of models, and the computational resources available. For example, Llama 3, a cutting- edge LLM with 405 billion parameters, was trained over 15 trillion tokens using 16,000 H100 GPUs. Thus, training, serving, fine-tuning, and evaluating such models demand sophisticated engineering practices that leverage modern hardware and software infrastructures. Building scalable systems for LLMs is essential for further advancing AI capabilities. In this course, students will learn the design principle of a good Machine Learning System for LLMs. Specifically, this course mainly covers three topics (1) the components of modern transformer-based language models; (2) the knowledge about GPU programming; (3) how to train and deploy LLMs in a scalable way.

Grading Policy

The grading policy is subject to minor change.

The course will include five short quizzes and a final project. The final project can be completed individually or collaboratively in groups of up to three students. Presentations for the final project will occur during the last three weeks of the semester and will include a Q&A session. For the final project, students must select a research paper focused on MLSys. They will critically analyze the paper to identify its limitations or areas that could be improved. Based on this analysis, students are expected to propose and develop a solution to address the identified issues or introduce an innovative idea to enhance the research.

Course Schedule (tentative)

The course schedule may be changed.

Week Topic Papers & Materials
Week 1 Course overview -
Week 2 Automatic Differentiation — Reverse Mode Automatic Differentiation Wikipedia of Automatic Differentiation
Andrej Karpathy’s tinygrad
  Automatic Differentiation — Computation Graph PyTorch’s Blog on Computation Graph
Andrej Karpathy’s tinygrad
Week 3 Transformers – Multi-Head Attention, Layer Normalization, Feed Forward Network Sasha Rush’s notebook: The Annotated Transformers
Andrej Karpathy’s minGPT
  Llama’s change over original Transformer – RMSNorm & SwiGLU activation & Gated MLP The official Llama 1 report
Pytorch’doc on RMSNorm
Noam Shazeer’s report: GLU Variants Improve Transformer
Week 4 Llama’s change over original Transformer – Rotary Positional Embedding (ROPE) EleutherAI’s blog on Rotary Positional Embeddings
The original ROPE paper
  Dissusion of the difference from Llama 1 to Llama 3 The official Llama 1 report
The official Llama 2 report
The official Llama 3 report
Week 5 GPU programming basic & Modern Computing Server Get a quote of GPU server by yourself:)
CUDA C++ programming guide
  Neural Network Training – Adam Optimizer & Mixed Precision Training The original Adam Optimizer paper
Pytorch’s Adam Optimizer doc
The original mixed precision training paper
Pytorch’s blog on mixed precision training
Week 6 Distributed Large Model Training – Data Parallelism, Model Parallelism, Pipeline Parallelism HuggingFace’s blog on multi-GPU training
GPipe
  Distributed Large Model Training – Model & Optimizer State Sharding ZeRO Optimizer
Model FLOPs Utilization of PaLM
Week 7- End TBD -