[go: up one dir, main page]

0% found this document useful (0 votes)
18 views45 pages

Introduction and Course Overview LLMs 2025

The document provides an overview of the Turing Test, large language models (LLMs), and their evolution from early rule-based systems to modern deep learning architectures. It discusses the capabilities, limitations, and future directions of LLMs, including their applications, evaluation methods, and the challenges they face. Additionally, it outlines course logistics and learning objectives related to LLMs and their implementation.

Uploaded by

24021359
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views45 pages

Introduction and Course Overview LLMs 2025

The document provides an overview of the Turing Test, large language models (LLMs), and their evolution from early rule-based systems to modern deep learning architectures. It discusses the capabilities, limitations, and future directions of LLMs, including their applications, evaluation methods, and the challenges they face. Additionally, it outlines course logistics and learning objectives related to LLMs and their implementation.

Uploaded by

24021359
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Introduction

Ts. Nguyễn Văn Vinh - UET


Content

● The Turing Test

● Overview of LLMs

○ How do LLMs work, What LLMs can do, Limitations of LLMs,


What is the future

● Course Logistics
The Imitation Game (aka The Turing Test)

Proposed in 1950 by Alan M. Turing who is considered


the father of theoretical computer science.

Tests a machine's ability to exhibit intelligent behaviour


equivalent to, or indistinguishable from, that of a human
– via language.

Language modeling has since been proposed as a


benchmark to measure progress toward AI.

“I believe that in about fifty years’ time it will be possible to programme computers, with a storage
capacity of about 10^9, to make them play the imitation game so well that an average interrogator will
not have more than 70% chance of making the right identification after five minutes of questioning.”
— A. Turing. Computing Machinery and Intelligence. Mind, 1950.
Eras of Language Modeling

Symbolic Era Statistical Era Scale Era

Pre-1990 1990-2010 2010 onwards

Rule-based approaches Data-driven approaches Deep learning and neural nets

Expert systems Probabilistic models General-purpose LMs

Limited generalization Introduction of corpora Massive datasets and compute

Turing Test ELIZA ChatGPT

1950 1966 2022


ELIZA (1966)

Early NLP program developed by


Joseph Weizenbaum at MIT.

Created illusion of a conversation by


rephrasing user statements as
questions using pattern matching
and substitution methodology.

One of the first programs capable of


attempting the Turing test.

Try it out at https://web.njit.edu/~ronkowit/eliza.html


Has AI Passed The Turing Test?

How do we even tell?

“The best-performing GPT-4 prompt passed in 49.7% of games,


outperforming ELIZA (22%) and GPT-3.5 (20%), but falling short
of the baseline set by human participants (66%).”

C. Jones and B. Bergen. Does GPT-4 pass the Turing test? 2024.

“ChatGPT-4 exhibits behavioral and personality traits that are


statistically indistinguishable from a random human from tens of
thousands of human subjects from more than 50 countries.”

Q. Mei et al. A Turing test of whether AI chatbots are behaviorally


similar to humans. PNAS, 2024.
A Social Turing Game

Chat with someone for two minutes and guess if it was a fellow human or an AI bot.
The AI bots in the game are chosen from a mix of different LLMs, including Jurassic-2,
GPT-4, Claude, and Cohere.

https://www.humanornot.ai/

Part of a larger scientific research project by AI21 Labs.


D. Jannai et al. Human or Not? A Gamified Approach to the Turing Test. 2023.

Question: Can you identify a flaw of using this game as a Turing Test?
Has AI Passed The Turing Test?

How do we even tell?

Is the test even a valid measure of AI’s capabilities?

What are the ethical implications of passing the test?

And many others …


Overview of LLMs

How do LLMs work What LLMs can do

What is the technology underlying a What functionality beyond chatbots


chatbot like chatGPT? does the technology enable?

Limitations of LLMs What is the Future

What fundamental challenges remain How is research addressing those


to be addressed? challenges?
How Do LLMs Work
Let’s Take a History Tour!

“Those who cannot remember the past are condemned to repeat it.”
— George Santayana. The Life of Reason, 1905.
Linguistic Foundations

Rule-based approaches

Example rule in a chatbot based on AIML


(Artificial Intelligence Markup Language)
which was developed in 1992-2002.

AIML formed the basis for a highly extended


Eliza called A.L.I.C.E. ("Artificial Linguistic
Internet Computer Entity").
Linguistic Foundations

Semantic parsing: analyzing the linguistic structure of text

The introduction of corpora …


Example of
constituency parsing
The Penn Treebank (PTB) corpus developed during
using a
context-free grammar.
1989-1996 was widely used for evaluating models for
sequence labelling. The task consists of annotating
each word with its Part-of-Speech tag.

M. Marcus et al. Building a Large Annotated Corpus of


English: The Penn Treebank. Computational Linguistics,
1993.
Same example using
dependency parsing.
Word Embeddings

● Represent each word using a “vector” of numbers.

● Converts a “discrete” representation to “continuous”.

● Many benefits:
○ More “fine-grained” representations of words.
○ Useful computations such as cosine and Euclidean distance.
○ Visualization and mapping of words onto a semantic space.
○ Can be learnt in self-supervised manner from a large corpus.

● Examples:
○ Word2Vec (2013), GloVe, BERT, ELMo
Seq2Seq Models The inputs to each unit consists of
the current input xt, previous hidden
state ht-1, and previous context ct-1

● Recurrent Neural Networks (RNNs)


● Long Short-Term Memory Networks (LSTMs)
● Capture dependencies between input tokens
● Gates control the flow of information

A single LSTM unit displayed as a


computation graph.

The outputs are a new hidden state ht


and an updated context ct.

A simple RNN shown unrolled in time. Network layers are recalculated for
each time step, while weights U, V and W are shared across all time steps.
Self-Attention and Transformers

● Allows to “focus attention” on particular aspects of


the input while generating the output.
● Done by using a set of parameters, called "weights,"
that determine how much attention should be paid
to each input token at each time step.
● These weights are computed using a combination of
the input and the current hidden state of the model.

In encoding the word "it", one attention head is


focusing most on "the animal", while another is
focusing on "tired". The model's representation
A. Vaswani et al. Attention Is All You Need. NeurIPS 2017. of the word "it" thus bakes in some of the
representation of both "animal" and "tired".
https://jalammar.github.io/illustrated-transformer/
Pre-Training: Data Preparation

A typical data preparation pipeline for pre-training LLMs:

W. Zhao et al. A Survey of Large Language Models. 2023.


Pre-Training Data Quality Reduces Reliance on Compute

S. Hooker. On the Limitations of Compute Thresholds as a Governance Strategy. 2024.


Pre-Training: Parallelism

4D Parallelism to minimize bottlenecks and maximizes efficiency: combines


Data, Context, Pipeline (Vertical), and Tensor (Horizontal) Parallelism.

● Data Parallelism parallelizes tasks to speed up data processing and model iterations.
● Context Parallelism splits input sequences into chunks to be processed separately.

K. Pijanowski and M. Galarnyk. S. Li et al. Sequence Parallelism: Long Sequence


What is Distributed Training? 2022. Training from System Perspective. 2021.
Pre-Training: Parallelism

● Pipeline Parallelism separates a model based on its layers, allowing higher throughput.
● Tensor Parallelism splits matrices across GPUs to reduce peak memory consumption.
Pre-Training: Scaling Laws

Given a fixed compute budget, what is the optimal model size and training
dataset size for training a transformer LM?

Chinchilla Scaling Law:

For every doubling of model size,


the number of training tokens must
also be doubled.

J. Hoffmann et al. Training Compute-Optimal Large Language Models. 2022.


Post-Training: Instruction-Tuning and Alignment

Pre-Training Instruction Reinforcement Learning


Fine-tuning from Human Feedback

Massive amounts of data Teach model to respond Teach model to produce output closer
from Internet, books, etc. to instructions. to what humans like.

Problem: A model that can


babble on about anything,
but not aligned with what
we want (e.g. Question-
Answering)
Evaluation

● Datasets
○ GLUE, SuperGLUE (General language understanding)
○ HumanEval (Coding)
○ HellaSwag (Commonsense reasoning)
○ GSM-8K (Math)
● Human Preferences
○ Chatbot Arena: Crowdsourced platform where humans vote on pairwise
comparisons of different LLMs (akin to Elo rating system in Chess).
● LLMs as Judges
○ LLM can approximate human preference with far lower cost!
○ L. Zheng et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
NeurIPS 2023 Datasets and Benchmarks Track.
How Do LLMs Work: Key Topics

● Transformer Architecture
○ Self-Attention, Input/Output Processing, Architecture Variations,
Training and Inference
● Pre-Training
○ Data Preparation (Tokenization, etc.), Parallelism, Scaling Laws
● Post-Training
○ Instruction Following/Tuning, Alignment
● Evaluation
What LLMs Can Do
Evolution of LMs from Perspective of Task-Solving Capacity

W. Zhao et al. A Survey of Large Language Models. 2023.


Few-Shot Prompting

Ideal output!
GPT-4: “am”
Q: “Elon Musk”
A: “nk”

Q: “Bill Gates” LLM “ka”


A: “ls”

Q: “Barack Obama”
A:

T. Brown et al. Language Models are Few-Shot Learners. NeurIPS 2020.


Chain-of-Thought Prompting

Q: “Elon Musk” A: the last letter of "Barack"


A: the last letter of "Elon" is "n". is "k". The last letter of
The last letter of "Musk" is "k". "Obama" is "a".
Concatenating "n", "k" leads to Concatenating "k", "a" leads
"nk". so the output is "nk" LLM to "ka". So, the output is
"ka".
Q: “Barack Obama”
A:

J. Wei et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.
CoT as an Emergent Property of Model Scale

J. Wei et al. Emergent Abilities of Large Language Models. TMLR 2022.


From Prompting to Fine-Tuning

Unlike prompting, fine-tuning actually changes the model under the hood, giving
better domain- or task-specific performance.

Source: Andrej Karpathy @karpathy (not to scale)


Case Study in Law: Harvey AI

● Startup building a custom-trained case law model for drafting documents, answering
questions about complex litigation scenarios, and identifying material discrepancies
between hundreds of contracts.

● Added 10 billion tokens worth of data to power the model, starting with case law
from Delaware, and then expanding to include all of U.S. case law.

● Attorneys from 10 large law firms preferred custom model’s output versus GPT-4’s
97% of the time. Main benefit was reduced hallucinations!

Open AI Customer Stories: Harvey. April 2024.


Case Study in Law: Harvey AI
Parameter Efficient Fine-Tuning (PEFT)

GPT-3 175B validation accuracy vs. number


of trainable parameters of several adaptation
methods on WikiSQL. LoRA exhibits better
frozen trainable scalability and task performance.
parameters parameters

Techniques like LoRA construct a low-rank


parameterization for parameter efficiency
during training.

For inference, the model can be converted to


its original weight parameterization to ensure
unchanged inference speed.

E. Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.


Design Spaces and The CLAM Framework

CLAM enables unlimited chaining of popular


optimization techniques in parameter-efficient
finetuning, quantization, and pruning on nearly
every modern LLM.

N. Velingker et al. CLAM: Unifying Finetuning, Quantization, and Pruning through Unrestricted Chaining of
LLM Adapter Modules. 2024.
What LLMs Can Do: Key Topics

● Prompt Engineering
○ Few-Shot, Chain-of-Thought (CoT), etc.

● Adaptation (aka Fine-Tuning)


○ Parameter-Efficient Techniques (PEFT)
○ Design Spaces
○ The CLAM Framework
Limitations of LLMs
Unreliable Reasoning Even On Simple Tasks

Probably due to tokenization!

Generated by gpt-4o’s tokenizer.

Try it out at:


https://tiktokenizer.vercel.app/
Jailbreaking Can Bypass Safety

Process of manipulating prompts to


bypass an LLM’s safeguards, leading to
harmful outputs.

PAIR—which is inspired by social


engineering attacks—uses an attacker
LLM to automatically generate
jailbreaks for a separate targeted LLM.
The attacker LLM iteratively queries the
target LLM to update and refine a
candidate jailbreak, often in fewer than
twenty queries.

P. Chao et al. Jailbreaking Black Box Large Language Models in Twenty Queries. 2023.
Long Contexts Can Hurt Accuracy

Changing the location of relevant information within the


model’s input context results in a U-shaped performance
curve—models are better at using relevant information
that occurs at the very beginning (primacy bias) or end of
its input context (recency bias), and performance
degrades significantly when models must access and use
information located in the middle of its input context.

N. Liu. Lost in the Middle: How Language Models Use Long Contexts. TACL 2023.
Limitations of LLMs: Key Topics

● Reasoning and Planning


● Hallucinations
● Limited Context
● Safety
● Interpretability
● Cost and Energy
What Is The Future (???)
Scope of Course

Areas of emphasis:
● Foundations: lectures cover broadly applicable and (relatively) established techniques
● Systems: homeworks implementing those techniques using deep learning frameworks
● Research: topics derived from recent papers in top ML conferences (NeurIPS/ICLR/ICML)
● Experimentation: team project to implement and empirically evaluate a new technique

Topics not covered:


● Application Domains: we won’t dive into specific domains like NLP, Vision, or Robotics
● Theory: limited to mathematical concepts needed to understand and implement techniques
● Classical ML: we won’t cover classical ML approaches that predate LLMs
● AI Application Dev: we won’t teach you the AI dev stack or how to build enterprise AI apps
Learning Objectives

● Analyze design decisions in modern and upcoming transformer


architectures.
● Determine the hardware, software, and data requirements for pre-training
or fine-tuning an LLM for new tasks.
● Understand where LLMs should and should not be used based on their
capability and reliability.
● Leverage a deep understanding of LLM theory and software to design
prompts and applications around them.
Homeworks

HW0: Introductory assignment comparing and analyzing outputs from


different LLMs.

HW1: Build and understand the Transformer architecture from the ground up.

HW2: Explore techniques to adapt pre-trained LLMs to new tasks in an


efficient and performant manner.

HW3: Leverage patterns in pretrained weights to compress LLMs for


memory-efficient inference and fine-tuning.

HW4: Investigate the intersection of LLMs with symbolic reasoning and apply
it to challenging reasoning tasks.
Reference

● Adaptation for CIS 7000 (2024), University of Pennsylvania

You might also like