Introduction
Ts. Nguyễn Văn Vinh - UET
Content
● The Turing Test
● Overview of LLMs
○ How do LLMs work, What LLMs can do, Limitations of LLMs,
What is the future
● Course Logistics
The Imitation Game (aka The Turing Test)
Proposed in 1950 by Alan M. Turing who is considered
the father of theoretical computer science.
Tests a machine's ability to exhibit intelligent behaviour
equivalent to, or indistinguishable from, that of a human
– via language.
Language modeling has since been proposed as a
benchmark to measure progress toward AI.
“I believe that in about fifty years’ time it will be possible to programme computers, with a storage
capacity of about 10^9, to make them play the imitation game so well that an average interrogator will
not have more than 70% chance of making the right identification after five minutes of questioning.”
— A. Turing. Computing Machinery and Intelligence. Mind, 1950.
Eras of Language Modeling
Symbolic Era Statistical Era Scale Era
Pre-1990 1990-2010 2010 onwards
Rule-based approaches Data-driven approaches Deep learning and neural nets
Expert systems Probabilistic models General-purpose LMs
Limited generalization Introduction of corpora Massive datasets and compute
Turing Test ELIZA ChatGPT
1950 1966 2022
ELIZA (1966)
Early NLP program developed by
Joseph Weizenbaum at MIT.
Created illusion of a conversation by
rephrasing user statements as
questions using pattern matching
and substitution methodology.
One of the first programs capable of
attempting the Turing test.
Try it out at https://web.njit.edu/~ronkowit/eliza.html
Has AI Passed The Turing Test?
How do we even tell?
“The best-performing GPT-4 prompt passed in 49.7% of games,
outperforming ELIZA (22%) and GPT-3.5 (20%), but falling short
of the baseline set by human participants (66%).”
C. Jones and B. Bergen. Does GPT-4 pass the Turing test? 2024.
“ChatGPT-4 exhibits behavioral and personality traits that are
statistically indistinguishable from a random human from tens of
thousands of human subjects from more than 50 countries.”
Q. Mei et al. A Turing test of whether AI chatbots are behaviorally
similar to humans. PNAS, 2024.
A Social Turing Game
Chat with someone for two minutes and guess if it was a fellow human or an AI bot.
The AI bots in the game are chosen from a mix of different LLMs, including Jurassic-2,
GPT-4, Claude, and Cohere.
https://www.humanornot.ai/
Part of a larger scientific research project by AI21 Labs.
D. Jannai et al. Human or Not? A Gamified Approach to the Turing Test. 2023.
Question: Can you identify a flaw of using this game as a Turing Test?
Has AI Passed The Turing Test?
How do we even tell?
Is the test even a valid measure of AI’s capabilities?
What are the ethical implications of passing the test?
And many others …
Overview of LLMs
How do LLMs work What LLMs can do
What is the technology underlying a What functionality beyond chatbots
chatbot like chatGPT? does the technology enable?
Limitations of LLMs What is the Future
What fundamental challenges remain How is research addressing those
to be addressed? challenges?
How Do LLMs Work
Let’s Take a History Tour!
“Those who cannot remember the past are condemned to repeat it.”
— George Santayana. The Life of Reason, 1905.
Linguistic Foundations
Rule-based approaches
Example rule in a chatbot based on AIML
(Artificial Intelligence Markup Language)
which was developed in 1992-2002.
AIML formed the basis for a highly extended
Eliza called A.L.I.C.E. ("Artificial Linguistic
Internet Computer Entity").
Linguistic Foundations
Semantic parsing: analyzing the linguistic structure of text
The introduction of corpora …
Example of
constituency parsing
The Penn Treebank (PTB) corpus developed during
using a
context-free grammar.
1989-1996 was widely used for evaluating models for
sequence labelling. The task consists of annotating
each word with its Part-of-Speech tag.
M. Marcus et al. Building a Large Annotated Corpus of
English: The Penn Treebank. Computational Linguistics,
1993.
Same example using
dependency parsing.
Word Embeddings
● Represent each word using a “vector” of numbers.
● Converts a “discrete” representation to “continuous”.
● Many benefits:
○ More “fine-grained” representations of words.
○ Useful computations such as cosine and Euclidean distance.
○ Visualization and mapping of words onto a semantic space.
○ Can be learnt in self-supervised manner from a large corpus.
● Examples:
○ Word2Vec (2013), GloVe, BERT, ELMo
Seq2Seq Models The inputs to each unit consists of
the current input xt, previous hidden
state ht-1, and previous context ct-1
● Recurrent Neural Networks (RNNs)
● Long Short-Term Memory Networks (LSTMs)
● Capture dependencies between input tokens
● Gates control the flow of information
A single LSTM unit displayed as a
computation graph.
The outputs are a new hidden state ht
and an updated context ct.
A simple RNN shown unrolled in time. Network layers are recalculated for
each time step, while weights U, V and W are shared across all time steps.
Self-Attention and Transformers
● Allows to “focus attention” on particular aspects of
the input while generating the output.
● Done by using a set of parameters, called "weights,"
that determine how much attention should be paid
to each input token at each time step.
● These weights are computed using a combination of
the input and the current hidden state of the model.
In encoding the word "it", one attention head is
focusing most on "the animal", while another is
focusing on "tired". The model's representation
A. Vaswani et al. Attention Is All You Need. NeurIPS 2017. of the word "it" thus bakes in some of the
representation of both "animal" and "tired".
https://jalammar.github.io/illustrated-transformer/
Pre-Training: Data Preparation
A typical data preparation pipeline for pre-training LLMs:
W. Zhao et al. A Survey of Large Language Models. 2023.
Pre-Training Data Quality Reduces Reliance on Compute
S. Hooker. On the Limitations of Compute Thresholds as a Governance Strategy. 2024.
Pre-Training: Parallelism
4D Parallelism to minimize bottlenecks and maximizes efficiency: combines
Data, Context, Pipeline (Vertical), and Tensor (Horizontal) Parallelism.
● Data Parallelism parallelizes tasks to speed up data processing and model iterations.
● Context Parallelism splits input sequences into chunks to be processed separately.
K. Pijanowski and M. Galarnyk. S. Li et al. Sequence Parallelism: Long Sequence
What is Distributed Training? 2022. Training from System Perspective. 2021.
Pre-Training: Parallelism
● Pipeline Parallelism separates a model based on its layers, allowing higher throughput.
● Tensor Parallelism splits matrices across GPUs to reduce peak memory consumption.
Pre-Training: Scaling Laws
Given a fixed compute budget, what is the optimal model size and training
dataset size for training a transformer LM?
Chinchilla Scaling Law:
For every doubling of model size,
the number of training tokens must
also be doubled.
J. Hoffmann et al. Training Compute-Optimal Large Language Models. 2022.
Post-Training: Instruction-Tuning and Alignment
Pre-Training Instruction Reinforcement Learning
Fine-tuning from Human Feedback
Massive amounts of data Teach model to respond Teach model to produce output closer
from Internet, books, etc. to instructions. to what humans like.
Problem: A model that can
babble on about anything,
but not aligned with what
we want (e.g. Question-
Answering)
Evaluation
● Datasets
○ GLUE, SuperGLUE (General language understanding)
○ HumanEval (Coding)
○ HellaSwag (Commonsense reasoning)
○ GSM-8K (Math)
● Human Preferences
○ Chatbot Arena: Crowdsourced platform where humans vote on pairwise
comparisons of different LLMs (akin to Elo rating system in Chess).
● LLMs as Judges
○ LLM can approximate human preference with far lower cost!
○ L. Zheng et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
NeurIPS 2023 Datasets and Benchmarks Track.
How Do LLMs Work: Key Topics
● Transformer Architecture
○ Self-Attention, Input/Output Processing, Architecture Variations,
Training and Inference
● Pre-Training
○ Data Preparation (Tokenization, etc.), Parallelism, Scaling Laws
● Post-Training
○ Instruction Following/Tuning, Alignment
● Evaluation
What LLMs Can Do
Evolution of LMs from Perspective of Task-Solving Capacity
W. Zhao et al. A Survey of Large Language Models. 2023.
Few-Shot Prompting
Ideal output!
GPT-4: “am”
Q: “Elon Musk”
A: “nk”
Q: “Bill Gates” LLM “ka”
A: “ls”
Q: “Barack Obama”
A:
T. Brown et al. Language Models are Few-Shot Learners. NeurIPS 2020.
Chain-of-Thought Prompting
Q: “Elon Musk” A: the last letter of "Barack"
A: the last letter of "Elon" is "n". is "k". The last letter of
The last letter of "Musk" is "k". "Obama" is "a".
Concatenating "n", "k" leads to Concatenating "k", "a" leads
"nk". so the output is "nk" LLM to "ka". So, the output is
"ka".
Q: “Barack Obama”
A:
J. Wei et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.
CoT as an Emergent Property of Model Scale
J. Wei et al. Emergent Abilities of Large Language Models. TMLR 2022.
From Prompting to Fine-Tuning
Unlike prompting, fine-tuning actually changes the model under the hood, giving
better domain- or task-specific performance.
Source: Andrej Karpathy @karpathy (not to scale)
Case Study in Law: Harvey AI
● Startup building a custom-trained case law model for drafting documents, answering
questions about complex litigation scenarios, and identifying material discrepancies
between hundreds of contracts.
● Added 10 billion tokens worth of data to power the model, starting with case law
from Delaware, and then expanding to include all of U.S. case law.
● Attorneys from 10 large law firms preferred custom model’s output versus GPT-4’s
97% of the time. Main benefit was reduced hallucinations!
Open AI Customer Stories: Harvey. April 2024.
Case Study in Law: Harvey AI
Parameter Efficient Fine-Tuning (PEFT)
GPT-3 175B validation accuracy vs. number
of trainable parameters of several adaptation
methods on WikiSQL. LoRA exhibits better
frozen trainable scalability and task performance.
parameters parameters
Techniques like LoRA construct a low-rank
parameterization for parameter efficiency
during training.
For inference, the model can be converted to
its original weight parameterization to ensure
unchanged inference speed.
E. Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.
Design Spaces and The CLAM Framework
CLAM enables unlimited chaining of popular
optimization techniques in parameter-efficient
finetuning, quantization, and pruning on nearly
every modern LLM.
N. Velingker et al. CLAM: Unifying Finetuning, Quantization, and Pruning through Unrestricted Chaining of
LLM Adapter Modules. 2024.
What LLMs Can Do: Key Topics
● Prompt Engineering
○ Few-Shot, Chain-of-Thought (CoT), etc.
● Adaptation (aka Fine-Tuning)
○ Parameter-Efficient Techniques (PEFT)
○ Design Spaces
○ The CLAM Framework
Limitations of LLMs
Unreliable Reasoning Even On Simple Tasks
Probably due to tokenization!
Generated by gpt-4o’s tokenizer.
Try it out at:
https://tiktokenizer.vercel.app/
Jailbreaking Can Bypass Safety
Process of manipulating prompts to
bypass an LLM’s safeguards, leading to
harmful outputs.
PAIR—which is inspired by social
engineering attacks—uses an attacker
LLM to automatically generate
jailbreaks for a separate targeted LLM.
The attacker LLM iteratively queries the
target LLM to update and refine a
candidate jailbreak, often in fewer than
twenty queries.
P. Chao et al. Jailbreaking Black Box Large Language Models in Twenty Queries. 2023.
Long Contexts Can Hurt Accuracy
Changing the location of relevant information within the
model’s input context results in a U-shaped performance
curve—models are better at using relevant information
that occurs at the very beginning (primacy bias) or end of
its input context (recency bias), and performance
degrades significantly when models must access and use
information located in the middle of its input context.
N. Liu. Lost in the Middle: How Language Models Use Long Contexts. TACL 2023.
Limitations of LLMs: Key Topics
● Reasoning and Planning
● Hallucinations
● Limited Context
● Safety
● Interpretability
● Cost and Energy
What Is The Future (???)
Scope of Course
Areas of emphasis:
● Foundations: lectures cover broadly applicable and (relatively) established techniques
● Systems: homeworks implementing those techniques using deep learning frameworks
● Research: topics derived from recent papers in top ML conferences (NeurIPS/ICLR/ICML)
● Experimentation: team project to implement and empirically evaluate a new technique
Topics not covered:
● Application Domains: we won’t dive into specific domains like NLP, Vision, or Robotics
● Theory: limited to mathematical concepts needed to understand and implement techniques
● Classical ML: we won’t cover classical ML approaches that predate LLMs
● AI Application Dev: we won’t teach you the AI dev stack or how to build enterprise AI apps
Learning Objectives
● Analyze design decisions in modern and upcoming transformer
architectures.
● Determine the hardware, software, and data requirements for pre-training
or fine-tuning an LLM for new tasks.
● Understand where LLMs should and should not be used based on their
capability and reliability.
● Leverage a deep understanding of LLM theory and software to design
prompts and applications around them.
Homeworks
HW0: Introductory assignment comparing and analyzing outputs from
different LLMs.
HW1: Build and understand the Transformer architecture from the ground up.
HW2: Explore techniques to adapt pre-trained LLMs to new tasks in an
efficient and performant manner.
HW3: Leverage patterns in pretrained weights to compress LLMs for
memory-efficient inference and fine-tuning.
HW4: Investigate the intersection of LLMs with symbolic reasoning and apply
it to challenging reasoning tasks.
Reference
● Adaptation for CIS 7000 (2024), University of Pennsylvania