Deep Learning Applications
in Natural Language Processing
Huan Sun, Zhen Wang
Computer Science and Engineering
TDAI Foundations of Data Science & Artificial Intelligence
Deep Learning Summer School
Acknowledgement
• Stanford CS224n (Winter 2022) by Chris Manning,Anna Goldie
• UT Austin NLP courses by Greg Durrett
• Ohio State NLP CSE5525
• Textbook: Jurafsky and Martin, Speech and Language Processing
• References on the slides
What is
Natural Language Processing
(NLP)?
Source: https://www.citizenme.com/ai-citizenme-and-you-
part-3-can-ai-read-or-hear/robot-reading/
A bit history…
• 1950 – 1969
• 1969 – 1992
• 1993 – 2012
• 2013 – present
Christopher Manning.“Human Language Understanding & Reasoning” in Daedalus, Spring 2022
A bit history…
• 1950 – 1969
• Machine translation (word-level lookups, rule-based mechanisms)
• 1969 – 1992
• Rule-based NLP demonstration systems
• Start to model the complexity of human language understanding
• 1993 – 2012
• Constructing annotated linguistic resources
• Supervised machine learning
• 2013 – present
• Deep learning
Christopher Manning.“Human Language Understanding & Reasoning” in Daedalus, Spring 2022
A bit history…
• 1950 – 1969
• Machine translation (word-level lookups, rule-based mechanisms)
• 1969 – 1992
• Rule-based NLP demonstration systems
• Start to model the complexity of human language understanding
• 1993 – 2012
• Constructing annotated linguistic sources 8:
2 0 1
• Supervised machine learning
a in ed
re -tr is e d
2013 – 2017 P
p e r v
s u
• 2013 – present self- odels
m
• Deep learning 2018 – present
Christopher Manning.“Human Language Understanding & Reasoning” in Daedalus, Spring 2022
“
--Christopher Manning.“Human Language Understanding & Reasoning,”
Daedalus, Spring 2022
Why do we care in TDAI?
• Text data is everywhere
• Scientific articles
• Clinical texts
• Social media posts
• Financial news
• NLP:A key component in interdisciplinary
collaboration
Tutorial Structure
Part I (~75 mins):
• Tasks
• Deep Learning Models
Break (~15mins)
Part II: (~45 mins):
• Large Language Models
• Demo
QA (~15 mins)
Tutorial Structure
Part I (~75 mins):
• Tasks
• Deep Learning Models
Break (~15mins)
Part II: (~45 mins):
• Large Language Models
• Demo
QA (~15 mins)
Popular Tasks
Bioinformatics
• Classification (language understanding)
• Sentiment analysis
Political Science
• Sequence labeling (language understanding)
• Part of Speech (POS) tagging
• Named entity recognition (NER)
Cheminformatics
• Sequence-to-sequence problem (language generation)
• Language modeling
• Machine translation Business
Intelligence
• Text summarization
• Dialogue response generation
…
Sentiment Analysis
Source: https://twitter.com/friends_quotes1/status/649997787199873024
Sentiment Analysis
Given a piece of text: Predict label:
Classification: binary or multiclass
Named Entity Recognition (NER)
Example Source: https://monkeylearn.com/blog/named-entity-recognition/
Named Entity Recognition (NER)
Sequence labeling: BIO tagging scheme
O B-ORG O B-PER I-PER O O B-LOC O O B-MV I-MV
Ousted WeWork founder Adam Neumann lists his Manhattan penthouse for $37.5 million
Named Entity Recognition (NER)
Sequence labeling: BIO tagging scheme
Beginning of a PERSON entity Inside of a PERSON entity
O B-ORG O B-PER I-PER O O B-LOC O O B-MV I-MV
Ousted WeWork founder Adam Neumann lists his Manhattan penthouse for $37.5 million
Popular Tasks
Bioinformatics
• Classification (language understanding)
• Sentiment analysis
Political Science
• Sequence labeling (language understanding)
• Part of Speech (POS) tagging
• Named entity recognition (NER)
Cheminformatics
• Sequence-to-sequence problem (language generation)
• Language modeling
• Machine translation Business
Intelligence
• Text summarization
• Dialogue response generation
…
Language Modeling
Credit: Stanford CS224n,Winter 2022
Language Modeling
• We use language models every day!
Credit: Stanford CS224n,Winter 2022
Dialogue Response Generation
[Sun et al., NAACL’21]
Tutorial Structure
Part I (~75 mins):
• Tasks
• Deep Learning Models
Break (~15mins)
Part II: (~45 mins):
• Large Language Models
• Demo
QA (~15 mins)
Deep Learning Models for NLP
• How to model a word?
• How to model a sequence of words?
• What is a “pre-trained” model?
How to Model a Word?
• Distributional semantics: A word’s meaning is given by the words
that frequently appear close-by
• “You shall know a word by the company it keeps” (J. R. Firth 1957: 11)
• When a word w appears in a text, its context is the set of words that appear
nearby (within a fixed-size window)
• We use the many contexts of w to build up a representation of w
Credit: Stanford CS224n,Winter 2022
How to Model a Word?
Credit: Stanford CS224n,Winter 2022
How to Model a Word?
Credit: Stanford CS224n,Winter 2022
How to Model a Word?
• Word2Vec [Mikolov et al., 2013]
How to Model a Word?
• Skip-gram [Mikolov et al., 2013]
Credit: Stanford CS224n,Winter 2022
How to Model a Word?
• Skip-gram [Mikolov et al., 2013]
Credit: Stanford CS224n,Winter 2022
How to Model a Word?
• Skip-gram [Mikolov et al., 2013]
Credit: Stanford CS224n,Winter 2022
How to Model a Word?
• Skip-gram [Mikolov et al., 2013]
Credit: Stanford CS224n,Winter 2022
How to Model a Word?
• Skip-gram [Mikolov et al., 2013]
?
Credit: Stanford CS224n,Winter 2022
How to Model a Word?
• Skip-gram [Mikolov et al., 2013]
Credit: Stanford CS224n,Winter 2022
How to Model a Sequence of Words?
• (Simple/Vanilla/Elman) Recurrent Neural Network (RNN) [Elman, 1990]
“beautiful”
“is”
“OSU”
Image source: Jurafsky & Martin
How to Model a Sequence of Words?
• Recurrent Neural Network (RNN) [Elman, 1990]
A feedforward
network
Image source: Jurafsky & Martin
How to Model a Sequence of Words?
• Recurrent Neural Network (RNN) [Elman, 1990]
Source: https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks#architecture
How to Model a Sequence of Words?
• Recurrent Neural Network (RNN) [Elman, 1990]
How to Model a Sequence of Words?
• Recurrent Neural Network (RNN) [Elman, 1990]
• What are the commonly used activation functions?
Applications of RNNs
Tx (Ty): Number of timesteps on the input (output) side.
Source: https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks#architecture
Applications of RNNs
(sequence-to-sequence)
Source: https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks#architecture
Loss function of RNNs
Type Many-to-one:
Positive
The movie … great
Example: Sentiment Analysis
Loss: Negative log likelihood of gold label
40
Loss function of RNNs
Type Many-to-Many: O B-ORG O …
Ousted WeWork founder …
Example: Named Entity Recognition
Loss: Negative log likelihood of gold labels, summed over all time steps
41
Optimization of RNNs
Image source: Jurafsky & Martin
Optimization of RNNs:Vanishing/Exploding Gradient
Source: https://web.stanford.edu/class/cs224n/slides/cs224n-2022-lecture06-fancy-rnn.pdf
Other Variants of RNNs
Source: https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks#architecture
Transformer
Vaswani et al., “Attention is all you need,” 2017.
Used in (almost) every state-of-the-art NLP method!
Source: https://movieweb.com/transformers-projects-annoucement-paramount/
Transformer
Source: https://jalammar.github.io/illustrated-gpt2/
Transformers
48
Transformers
49
Transformer
A key design:
Self-attention
50
Transformers
51
Credit: Stanford CS224n,Winter 2022, https://jalammar.github.io/illustrated-gpt2/
Credit: Stanford CS224n,Winter 2022
Multi-head attention?
• High-level idea:
Perform self-attention multiple (i.e., h)
times in parallel and combine the results
?
Credit: Stanford CS224n,Winter 2022
54
A high-level view of transformer encoder
Input: a sequence of word
vectors
Output: a sequence of
“contextualized” word
vectors
Figure credit: https://www.arxiv-vanity.com/papers/1908.04211/
A high-level view of transformer decoder
Input:A sequence of words
Output: Probability distribution over the next word
Source: https://jalammar.github.io/illustrated-gpt2/
By now, we should know
• How to model a word
• How to model a sequence of words
Next
• How to model a word?
• How to model a sequence of words?
• What is a “pre-trained” model?
Recall: What is a language model (LM)?
Credit: Stanford CS224n,Winter 2022
A Transformer Decoder based Language Model
Source: https://jalammar.github.io/illustrated-gpt2/
“Pre-training” a Transformer Decoder based Language Model
• Generative Pre-trained Transformer (GPT)
• GPT, GPT-2, GPT-3, …
“self-supervision”,“downstream task agnostic”
Source: https://jalammar.github.io/illustrated-gpt2/ & Wikipedia
Pre-training using Masked Language Modeling (MLM)
BERT [Devlin et al., 2019]
Transformer Encoder MLM example:
“located”
BERT
“Columbus is [MASK] in Ohio."
Bidirectional representation
Pre-training + Fine-tuning
BERT [Devlin et al., 2019]
Self-supervision based on natural sentences Task-specific data 64
Denoising Sequence-to-Sequence Pre-training
BART [Lewis et al., 2019]:
Pre-training sequence-to-
sequence models
65
(BERT, RoBERTa, BART,
“Foundation Models” T5, GPT-3, PaLM…)
• Pre-trained on broad data (usually with self-supervised data at scale)
• Adaptable to a wide range of downstream tasks with minimal effort
“On the Opportunities and Risks of Foundation Models,” Stanford HAI, 2021.
Tutorial Structure
Part I (~75 mins):
• Tasks
• Deep Learning Models
Break (~15mins)
Part II: (~45 mins):
• Large Language Models
• Demo
QA (~15 mins)
Part II:
Large Language Models
& Demo
Outline: Further Discussion on Large Language Models
• An overview of popular large language models
• A general recipe of training large language models
• What can large language models do now?
• Promising future directions
Three Types of Language Models
Source: https://movieweb.com/transformers-projects-annoucement-paramount/; https://jalammar.github.io/illustrated-gpt2
Recap: Three Types of Large Language Models
Type
Encoder-only
Decoder-only
Encoder-decoder
Source: Stanford CS224N: NLP with Deep Learning
Recap: Three Types of Large Language Models
Type Features Exemplars
1. Gets bidirectional context – can
BERT and its many variants,
condition on future!
Encoder-only (e.g., RoBERTa, ALBERT)
2. Good at Natural Language
XLNet, ELECTRA
Understanding (NLU)
Decoder-only
Encoder-decoder
Source: Stanford CS224N: NLP with Deep Learning
Recap: Three Types of Large Language Models
Type Features Exemplars
1. Gets bidirectional context – can
BERT and its many variants,
condition on future!
Encoder-only (e.g., RoBERTa, ALBERT)
2. Good at Natural Language
XLNet, ELECTRA
Understanding (NLU)
1. Predicting the next word
GPT/GPT-2/GPT3, LaMDA,
Decoder-only 2. Good at Natural Language Generation
Gopher, PaLM
(NLG)
Encoder-decoder
Source: Stanford CS224N: NLP with Deep Learning
Recap: Three Types of Large Language Models
Type Features Exemplars
1. Gets bidirectional context – can
BERT and its many variants,
condition on future!
Encoder-only (e.g., RoBERTa, ALBERT)
2. Good at Natural Language
XLNet, ELECTRA
Understanding (NLU)
1. Predicting the next word
GPT/GPT-2/GPT3, LaMDA,
Decoder-only 2. Good at Natural Language Generation
Gopher, PaLM
(NLG)
Encoder-decoder Suitable for sequence-to-sequence tasks T5, BART, Meena
Source: Stanford CS224N: NLP with Deep Learning
Further Discussion on Large Language Models
• An overview of popular large language models
• A general recipe of training large language models
• What can large language models do now?
• Promising future directions
How to Train Large Language Models?
A Recipe for Modern LLMs!
Foundation Models
Big Neural Network Big Computer Big Dataset
(More parameters) (More GPUs) (More data)
Recipe Credit: Ilya Sutskever’s talk on HAI Spring Conference 2022: Foundation Models
More Parameters: An Exponential Growth
Image credit: EI Seminar - Luke Zettlemoyer - Large Language Models: Will they keep getting bigger?
More GPUs: Computation Cost for Training LLMs
More Data: MassiveText Dataset
• Many huge datasets are collected
• MassiveText
• Diverse10-lingual textual dataset
composed of web, Github, news,
Wikipedia, Books, C4
• Disk size is 10.5 TB
• Token count is around 5T tokens
• Document count is 2.32B with
average 2k tokens per document
Table credit: https://vaclavkosar.com/ml/massivetext-dataset-pretraining-deepminds-gopher
Further Discussion on Large Language Models
• An overview of popular large language models
• A general recipe of training large language models
• What can large language models do now?
• Promising future directions
What Large Language Models Can Do Now?
Backbone model for nearly all NLP tasks now
• Small or medium language models: Pre-
training & fine-tuning paradigm
In-context learning without gradient updates
• Very large language models: Generalization
with natural language instructions
Multimodal learning
• Language, vision, speech
Pre-training & Fine-tuning: Superior Performance
Source: https://medium.com/synapse-dev/understanding-bert-transformer-attention-isnt-all-you-need-5839ebd396db
A New Paradigm: In-context Learning or Prompting
GPT-3 (Brown et al., 2020)
Generating Coherent Story
GPT-3 (Brown et al., 2020)
Source: https://www.buildgpt3.com/post/88/
Chain-of-thought Prompting
PaLM (Chowdhery et al., 2022)
Chain-of-thought Prompting
PaLM (Chowdhery et al., 2022)
GitHub Copilot: Writing Useable Code
• Synthesize 28.8% functionally correct programs based on the docstrings
Codex (Chen et al., 2021)
Creating Images based on Text Captions
• A teddy bear on a skateboard in times square
DALLE-2 (Ramesh et al., 2022)
Creating Images based on Text Captions
• An astronaut riding a horse in a photorealistic style.
DALLE-2 (Ramesh et al., 2022)
Creating Images based on Text Captions
• A dramatic renaissance painting of Elon Musk buying Twitter
DALLE-2 (Ramesh et al., 2022)
Creating Images based on Text Captions
• Teddy bears working on new AI research on moon in the 1980s
DALLE-2 (Ramesh et al., 2022)
“
Christopher Manning.“Human Language Understanding & Reasoning” in
Daedalus, Spring 2022
Further Discussion on Large Language Models
• An overview of popular large language models
• A general recipe of training large language models
• What can large language models do now?
• Promising future directions
The Future of Large Language Models
Social Technical
Applications
Responsibility Advances
• Benchmarking foundation models • Diffusion models • Domain adaptation
• Documenting the ecosystem • Retrieval-based models • Differential privacy
• Economic impact on writing jobs • Efficient training • Writing assistance
• Homogenization of outcomes • Lightweight fine-tuning • Prototyping social spaces
• Reducing model biases • Decentralized training • Robotics (video, control)
• Enhance model fairness • Understanding in-context • Audio (speech, music)
• Reducing negative impacts on the learning • Neuroscience
environment (Green AI) • Understanding the role od data • Medicine (images, text)
• Approximating optimal • Bioinformatics
representations • Chemistry
• Structured state space sequence • Law
models
Partially adapted from Percy Liang’s talk on HAI Spring Conference 2022: Foundation Models
Demo
1. Sentiment Analysis with BERT
II. Text Generation on GPT-3
Demo 1: Sentiment Analysis with BERT
§ We will show how to fine-tune BERT for sentiment analysis
§ Colab: TDAI Summer School Tutorial
§ Adapted from Venelin Valkov’s Tutorial
§ Data: Google Play app reviews dataset with five review scores
§ ~16K samples in total
§ We normalize scores to three classes (negative, neutral, positive)
Hands on: Fine-tuning BERT on Sentiment Analysis
Key Points:
§ Keep the main body of BERT unchanged
§ Add a linear output layer on top of the model
Fine-tuning Procedure:
§ Tokenize review text and map them to corresponding vocabulary ids
§ Input the tokens into BERT and extract the last hidden state in [CLS]
§ Pass the [CLS]’s hidden state in the linear output layer with a softmax
to obtain class probabilities
Hands on: Fine-tuning BERT on Sentiment Analysis
Key Steps:
§ Data Preprocessing
1. Tokenization
2. Truncate and pad
3. Special tokens
4. Attention masking
5. Convert to ids
§ Model Building
§ Load original BERT
§ Add a linear output layer
Hands on: Fine-tuning BERT on Sentiment Analysis
Key Steps:
§ Training and Inference
§ Fine-tuning hyper-parameters
§ AdamW optimizer
§ Fine-tune for 3 epochs
§ Learning rate: 2e-5 to 0
§ Linear schedule
§ Linearly increate to lr
§ Linearly decrease to 0
Demo II: Text Generation on GPT-3
• We will show how to generate coherent text with OpenAI API
• https://beta.openai.com/playground
Goals:
• Learn important generation parameters
• Get a sense of how to craft prompts for GPT-3
Hands on: Text Generation on GPT-3
Important generation parameters
• Engine – different GPT-3 models
• Temperature – control generation randomness
• Maximum length
Example
Tutorial Structure
Part I (~75 mins):
• Tasks
• Deep Learning Models
Break (~15mins)
Part II: (~45 mins):
• Large Language Models
• Demo
QA (~15 mins)
Thank You
& QA