Introduction to
Language Models
Eve Fleisig & Kayo Yin
CS 294-162
August 28, 2023
Language Modeling
Image credit: jalammar.github.io/illustrated-word2vec/
Masked Language Modeling
BERT
Image credit: jalammar.github.io/illustrated-bert/
Causal Language Modeling
GPT
Image credit: jalammar.github.io/illustrated-gpt2/
BERT vs. GPT
● Bidirectional encoder models (BERT) do better than generative models at
non-generation tasks, for comparable training data/model complexity.
● Generative models (GPT) have training efficiency and scalability advantages
that may make them ultimately more accurate. They can also solve
downstream tasks in a zero-shot setting.
Transformer
Image credit: jalammar.github.io/illustrated-transformer/
Transformer
Image credit: jalammar.github.io/illustrated-transformer/
Transformer
Image credit: jalammar.github.io/illustrated-transformer/ v
Attention
Self-Attention
Self-Attention
Image credit: jalammar.github.io/illustrated-gpt2/
Self-Attention
Image credit: jalammar.github.io/illustrated-gpt2/
Self-Attention
Self-Attention
Self-Attention
Self-Attention
Multi-headed Attention
Multi-headed Attention
Transformer
Image credit: jalammar.github.io/illustrated-transformer/
Transformer Input
Transformer Encoder
Image credit: jalammar.github.io/illustrated-transformer/
Adding the Decoder
Image credit: jalammar.github.io/illustrated-transformer/
BERT
Image credit: jalammar.github.io/illustrated-bert/
BERT
GPT
GPT
T5
Text-to-Text Transfer Transformer
Pretraining & Fine-tuning
Pretraining & Fine-tuning
Pretraining & Fine-tuning
Unsupervised objective
Supervised objective
Prefixes & Prompting
Few- & Zero-Shot Learning
Few- & Zero-Shot Learning
Few- & Zero-Shot Learning
Few- & Zero-Shot Learning
Generalization to new tasks without fine-tuning enabled by:
Scaling
Data Compute
Scaling Data
Common Crawl dataset: introduced with T5; still in use
GPT-3 Training Data:
Scaling Data & Compute
Kaplan et al., 2020;
Hoffmann et al., 2022
Reinforcement Learning from Human Feedback
Reinforcement Learning from Human Feedback
Reinforcement Learning from Human Feedback
Discussion
● What are the advantages and disadvantages of different training or tuning methods
that have been tried (task-specific training, pretrain/fine-tune, prompting, RLHF)?
● What is the role of systems research in scaling up LLMs? How could advances in
systems research change scaling “laws”?
● What security considerations do we need to consider when deploying LLMs into the
real world?
● How can we improve the energy efficiency and carbon footprint of LLMs?