0% found this document useful (0 votes)

108 views7 pages

CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers

The document provides instructions for a project on neural language models and classifiers using either the dynet or pytorch libraries in Python. It describes downloading a movie review dataset from CMS and pre-processing it. It then outlines training a simple neural sentiment classifier using a Deep Averaging Network with one hidden layer for binary sentiment classification. It discusses overfitting and regularization using dropout. Finally, it describes training a neural language model to predict word probabilities given previous context words using a simple feed-forward neural network model.

Uploaded by

Edward Lee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

108 views7 pages

CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers

Uploaded by

Edward Lee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

CS4740/5740 Introduction to NLP

Fall 2017
Neural Language Models and Classiﬁers
Final report: due via Gradescope & CMS by Friday 12/1 11:59PM

1 Overview
In this project you will gain familiarity and working knowledge of neural
language models. This time we provide the bulk of the code in Python,
using the dynet and pytorch libraries. You will have to understand the
code, extend it slightly, and run various experiments for the report. Choose
your library among the two, and make sure you have the latest release in-
stalled. We recommend dynet for NLP tasks, since it is both easier to code
for and about 2-3x faster on CPUs.

Useful reading.
• Chapter 8 in Jurafsky & Martin, 3rd ed. Especially 8.3 and 8.5. (Note
that Chapter 8 is required reading anyway, so do read it. (https:
//web.stanford.edu/~jurafsky/slp3/8.pdf)
• Graham Neubig’s slides (http://phontron.com/class/nn4nlp2017/
assets/slides/nn4nlp-02-lm.pdf) and lecture notes (http://
phontron.com/class/mtandseq2seq2017/mt-spring2017.chapter5.
pdf)

Note on other programming languages. If you have been working in

some other language, we strongly encourage giving Python a try for this
assignment, since it allows you to complete the programming part of the
assignment with writing probably 10 lines of code. Give it a shot, and if
you still think you would rather work from scratch in a different language,
come talk to us.

2 Dataset
We’re back to movie reviews for this assignment, but we are providing
a larger version of the dataset from Pang & Lee this time. Do not down-

1
load the full dataset from it’s on-line site: we are providing you with a
slightly reorganized version via CMS, in order to make for a more inter-
esting project.

Task 1. Download the data and the code from CMS. Look through the
preprocess.py ﬁle and understand what it does.a What is the size
of the preprocessed vocabulary? Explain how you obtained your
answer.
a Along the way, try to identify some things you were making overcomplicated in

your own preprocessing in previous assignments.

3 Sentiment classification
We will first train a simple neural sentiment classifier using a very simple
model: a deep averaging network (DAN) with one hidden layer.1 Impor-
tantly, we’re using this model as a generic introduction to how neural
networks are constructed, trained and evaluated.

What is a hidden layer? One of the fundamental concepts in neural

networks is that of the hidden layer. This is a transformation that takes
an input vector of size n (i.e., x ∈ Rn ) and produces an output f (x) ∈ Rm
of size m. Most typically, hidden layers consist of an aﬃne transformation
followed by an element-wise nonlinearity

f (x) = σ(b + W · x) W ∈ Rm×n , b ∈ Rn

where common nonlinearities σ include the hyperbolic tangent tanh

exp(2zi ) − 1
σ(z)i =
exp(2zi ) + 1

or the rectiﬁed linear unit ReLU. We’ll stick with tanh.

Word vectors. Neural networks are based on continuous mathematical

functions applied to numbers. This ﬁts like a glove to data from contin-
uous sensor measurements, or image data (pixel intensities), but natural
language sentences come in the form of lists of discrete words

sentence = [w1 , w2 , ..., wlen(sentence) ]

where, for simplicity, we can assume each word is represented by a non-

negative integer index into a vocabulary: wi ∈ N. To get from this dis-
crete representation to a vector of continuous values x ∈ Rn , a common
1 Iyyer, Mohit, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. ”Deep un-

ordered composition rivals syntactic methods for text classiﬁcation.” In: ACL 2015.

2
approach is to limit the vocabulary size to a value |V | and learn a matrix
of word vectors E ∈ R|V |×n . Then, we can use Ew ∈ Rn as a vector repre-
sentation of word w.

Special word indices. Commonly, we reserve the ﬁrst few indices as

special word vectors for special meanings. In this assignment, we will use
the index w = 0 for unknown words, w = 1 for the beginning-of-sentence
marker, and w = 2 for the end-of-sentence marker. (Double-check this
yourself with preprocess.py!)

Deep Averaging Network. A deep averaging network is characterized

by taking, as input, the average of all word vectors in the input sentence
len(sentence)
∑
1
x= Ewi
len(sentence) i=1
It then passes x through several hidden layers. (In our case, just one.)
h = σ(b + W · x)
Finally, since we are solving a binary classiﬁcation problem (where 1 indi-
cates a sentence from a positive review and a 0 indicates a sentence from
a negative review), we compute a single score
z = bout + wout · h
where wout ∈ Rm , bout ∈ R. We then convert z to a probability by “squash-
ing” it to the [0, 1] interval
1
p = P (positive|sentence) = logistic(z) =
1 + exp(−z)
Code: Look at the _predict method in the dan_classifier.py script
to see how this is implemented. The method takes a batch of labeled sen-
tences and returns the predicted probability p for each given sentence.

Training. We want to ﬁnd the best model parameters (E, W , b, Wout , bout ).
To do this, we optimize with respect to the binary logistic loss
ℓ(sentence, y) = −y log(p) − (1 − y) log(1 − p)
where y is the true label: y = 1 if the sentence is truly positive, else y = 0.
In this assignment we use mini-batch training which is faster: we
group all of the training data into sets of (say) 32 sentences. We then com-
pute the batch-level loss:
∑
ℓ(batch) = ℓ(sentence, y)
(sentence,y)∈batch

and make a parameter update with respect to this batch-level loss.

3
Evaluation. We will monitor the accuracy on the validation set after ev-
ery full pass over all training batches. (A full pass is commonly called an
epoch.) Keeping with the “batching” paradigm, we need a function that
takes a validation batch and returns the number of correctly classiﬁed
sentences (i.e., sentences for which y = [[p > 0.5]] when the gold standard
says that y is of positive sentiment and sentences for which y = [[p <= 0.5]]
when the gold standard says that y is of negative sentiment. A stub of this
function is provided, but it always returns 0.

Task 2. Finish the implementation of the num_correct method.

Train the model and plot the training loss and the validation set ac-
curacy after each epoch. What do you observe?

Overﬁtting and regularization. If after a few epochs, the validation

performance stops improving and it actually starts getting worse, but the
training loss keeps decreasing, the model is overﬁtting. (Even if this is not
the case in your experiment, you still need to complete this section.) To
address this, it is common to regularize our model. A common regular-
ization strategy for neural nets is dropout, which amounts to randomly
setting some elements in a vector to 0, regardless of their value. In dynet,
dropout can be applied to a vector using dy.dropout and in pytorch we
may use nn.functional.dropout, look up the documentation of these
functions for more details. Warning: Dropout should only be applied dur-
ing training, but never when computing validation or test scores. Use the
train argument to the _predict method to tell the difference.

Task 3. Apply dropout with probability 0.5 to each word embedding,

prior to averaging. Run again (as in Task 2), and compare the results.
(Add the curves to the same plots.)

4 Neural language models

In the previous sections, we trained a neural model to estimate the clas-
siﬁcation probability P (sentiment=positive|sentence). In this section, we
will apply similar techniques to predict the language model probability:
P (wk = j|wk−1 , wk−2 , wk−3 ...). (Looks familliar?)
We will start with a bigram model P (wk = j|wk−1 ). We will use a
very simple feed-forward model. We use, again, an embedding matrix E.
The biggest differences from the DAN model is that (1) this time we make
a prediction for each word, rather than for each sentence, and (2) instead
of a binary prediction, we need to predict a probability for all |V | possible

4
next words. Both these reasons mean that the language model will be
much slower to train than the classiﬁer!
At positition k > 0 in a sentence, a simple, bigram feed-forward lan-
guage model performs the following steps:

1. Get the embedding of the previous (context) word wk−1 : ck = Ewk−1

2. Pass the embedding through a hidden layer: hk = σ(b + W · ck )
3. Compute a non-normalized score vector zk = Wout · hk ∈ R|V |
4. Compute the normalized probabilities using a softmax (or, equiva-
lently, in log space)

exp(zj )
P (wk = j|wk−1 ) = ∑|V |
i=0 exp(zi )
|V |
∑
log P (wk = j|wk−1 ) = zj − log exp(zi )
i=0

For training, the loss at word k is given simply by

ℓk (sentence) = − log P (wk = y|wk−1 )

where y is the index of the actual word observed at position k. This is also
known as the negative log-likelihood (NLL). The batch loss is the sum of
all word-level losses in the batch
∑ len(sentence)
∑
ℓ(batch) = ℓk (sentence)
sentence∈batch k=1

Code: Study the implementation of batch_loss in simple_lm.py.

Evaluation. As usual, we evaluate a language model using perplexity.

Unlike the classiﬁer, where we needed a separate method for evaluation,
here we can reuse batch_loss(..., train=False) because of a fun-
damental connection between perplexity and the total NLL:

∑ ∑ len(sentence)
∑
NLL(dataset) = ℓ(batch) = ℓk (sentence)
batch∈dataset sentence∈dataset k=1

Task 4. What is the (mathematical) connection between NLL(dataset)

and perplexity? (Hint: this is already used in the validation part of
the provided code.) Given this connection, how can we interpret what
our language model is minimizing over the training set?

5
Using unlabeled data. An observant reader will notice that we are not
using the sentiment labels at all in training our language model.2 This
means we can throw the unlabeled data in the mix and maybe get a better
language model.

Task 5. Load the unlabeled data (use the commented code to help you)
and combine it with the labeled data, then train a language model on
the resulting bigger dataset. Plot the validation perplexity for the two
cases (i.e., with and without the extra unlabeled data) on the same
graph, and discuss what you observe. Do not plot the training per-
plexities on the same plot: they are not comparable to each other.
Why are they not comparable?

Generalization to n-gram contexts. Predicting the next word using only

one previous word seems a bit simplistic. How much better can we get by
incorporating wider contexts?
To accomplish this, we will simply concatenate the word embeddings
in the context. Concretely, to get a 3-word context (and thus essentially a
4-gram LM), we replace ck = Ewk−1 with ck = [Ewk−1 , Ewk−2 , Ewk−3 ] where
[ ] denotes concatenation.3 (dy.concatenate/torch.cat).

Task 6. What is the dimension of ck , as a function of the embedding

dimension n, when using p words in the context? Assuming we don’t
want to make Wout any bigger, how must we change the size of W ?
Using these observations, extend the implementation to a 3-gram
and a 4-gram LM (i.e., context windows of 2 and 3 words, respectively).
Train under both scenarios: labeled data only, then using all data.
Report the best perplexities achieved for each conﬁguration (con-
text size = 1, 2, 3; unlabeled=True/False). Discuss the results.

5 Reusing learned word embeddings

In this section we will investigate whether we can get better performance
in sentiment classiﬁcation if we initialize the word embeddings in DAN
with the values learned by the language model.
The provided language model scripts always save the trained embed-
dings at the end. We may use the saved embeddings to initialize our DAN
2 Unlike
in Project 1, we’re not seeking separate positive and negative language models.
3 This
leaves open the question of what to do if, e.g., k − 2 falls before the beginning of
the sentence. You can pick any reasonable approach here, we suggest using the index of the
end-of-sentence token.

6
classiﬁer. In dynet, use clf.embed.populate(filename, "/embed"),
and in pytorch use clf.embed.weight = torch.load(filename).

Task 7. For every different language model trained in the previous

section, train a DAN sentiment classiﬁer preinitialized with the corre-
sponding embeddings. Discuss the validation accuracy.
Make note of what the best embeddings are according to the val-
idation set. Modify dan_classifier.py to print the test accuracy.
Report the test accuracy of the model without pretrained embeddings,
and with the best embeddings. Do not use the test scores to guide any
decisions, and do not train anything further after seeing test scores.

6 Report
The required tasks are highligted with borders in this PDF. Follow the
instructions in the boxes to the letter. Make sure to answer all additional
questions: every question mark inside a border of this PDF must have a
clearly-worded answer in your report.
You will not submit any code. Instead, every time a task requires
you to modify the code, indicate in your report what changes you made.
(Don’t rely on line numbers, because as you modify the code the offsets
can change. Keep changes minimal, don’t include plotting code.) For in-
stance, to explain how you load the test set, you’d write something like4

Code modiﬁcation. Right after the lines

with open(os.path.join('processed', 'valid_ix.pkl'),
'rb') as f:
valid_ix = pickle.load(f)

we added
with open(os.path.join('processed', 'test_ix.pkl'),
'rb') as f:
test_ix = pickle.load(f)

Rough grading rubric

• Task 1: 10p • Task 6: 15p

• Task 2: 15p
• Task 7: 15p
• Task 3: 10p
• Task 4: 10p • Writing (clarity, quality, atten-
• Task 5: 15p tion to detail): 10p
4 We recommend using the minted package in LATEX for nice-looking code.

2.3.2 ZTE OCS OCU Product Description
67% (3)
2.3.2 ZTE OCS OCU Product Description
83 pages
Ba LLMS W3 S2 2024 2025
No ratings yet
Ba LLMS W3 S2 2024 2025
64 pages
465-Lecture 1 (Deep Learning)
No ratings yet
465-Lecture 1 (Deep Learning)
47 pages
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
07-Dlintro Deep Learning NLP
No ratings yet
07-Dlintro Deep Learning NLP
31 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
HW1P1 F23
No ratings yet
HW1P1 F23
37 pages
CS-875-Lecture 4
No ratings yet
CS-875-Lecture 4
47 pages
L4 Cse256 Fa24 We
No ratings yet
L4 Cse256 Fa24 We
68 pages
4 Classification 2
No ratings yet
4 Classification 2
55 pages
Lecture13 - ML Linear & Log-Linear Models
No ratings yet
Lecture13 - ML Linear & Log-Linear Models
34 pages
Lecture14 - ML (FF, Autoenc, Dense Networks)
No ratings yet
Lecture14 - ML (FF, Autoenc, Dense Networks)
28 pages
Republic of The Philippines Sarangani District Office Hall of Justice BLDG., Capitol Compound Alabel, Sarangani Province
No ratings yet
Republic of The Philippines Sarangani District Office Hall of Justice BLDG., Capitol Compound Alabel, Sarangani Province
141 pages
Astrology and Pimples: Head & Neck
No ratings yet
Astrology and Pimples: Head & Neck
3 pages
XCS224N Module4 Slides
No ratings yet
XCS224N Module4 Slides
91 pages
Detailed Pricing Strategy Real Estate India
No ratings yet
Detailed Pricing Strategy Real Estate India
16 pages
Sousa-Horowitz - Stars and Stripes Forever
100% (3)
Sousa-Horowitz - Stars and Stripes Forever
14 pages
Cs224n 2025 Lecture03 Neuralnets
No ratings yet
Cs224n 2025 Lecture03 Neuralnets
96 pages
Anlp 02 Wordrep Textclass
No ratings yet
Anlp 02 Wordrep Textclass
59 pages
07 Dlintro
No ratings yet
07 Dlintro
39 pages
Anlp 02 Wordrep Textclass
No ratings yet
Anlp 02 Wordrep Textclass
58 pages
3 - Deep Learning
No ratings yet
3 - Deep Learning
33 pages
Final DL
No ratings yet
Final DL
26 pages
Revisiting Simple Neural Probabilistic Language Models (2021)
No ratings yet
Revisiting Simple Neural Probabilistic Language Models (2021)
8 pages
cl12 Huggingface
No ratings yet
cl12 Huggingface
34 pages
DLT Experiment 2
No ratings yet
DLT Experiment 2
7 pages
A M3 RD Ipjn Yd Ps GKF
No ratings yet
A M3 RD Ipjn Yd Ps GKF
20 pages
LLM For Maths People
No ratings yet
LLM For Maths People
53 pages
Parts of Speech Tagger
No ratings yet
Parts of Speech Tagger
12 pages
NLP Essentials
No ratings yet
NLP Essentials
22 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
Iam Ug
No ratings yet
Iam Ug
364 pages
Dbms Assignment
No ratings yet
Dbms Assignment
7 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
No ratings yet
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
66 pages
NLP Short
No ratings yet
NLP Short
5 pages
DL Programs
No ratings yet
DL Programs
13 pages
CNN Text Classification
No ratings yet
CNN Text Classification
12 pages
Assingment-3 NLP
No ratings yet
Assingment-3 NLP
5 pages
Video 7 - Building A Multilayer Feedforward Network For Classification in PyTorch
No ratings yet
Video 7 - Building A Multilayer Feedforward Network For Classification in PyTorch
18 pages
SPReg
No ratings yet
SPReg
46 pages
Astigmatism Double Angle Plot Tool V130 Spreadsheet Instructions
No ratings yet
Astigmatism Double Angle Plot Tool V130 Spreadsheet Instructions
8 pages
Thermochemical Data of Pure Substances
No ratings yet
Thermochemical Data of Pure Substances
2 pages
Intro Slides
No ratings yet
Intro Slides
31 pages
A Survey On Neural Network Language Models
No ratings yet
A Survey On Neural Network Language Models
7 pages
ML For NLP-LO4
No ratings yet
ML For NLP-LO4
42 pages
Baduk Notes
No ratings yet
Baduk Notes
3 pages
21 01 23
No ratings yet
21 01 23
8 pages
Cristobal, Gladys May G. Psy 3 10:30AM To 12:00PM AB-306 June 25, 2010
No ratings yet
Cristobal, Gladys May G. Psy 3 10:30AM To 12:00PM AB-306 June 25, 2010
17 pages
LLM Fine Tune
No ratings yet
LLM Fine Tune
11 pages
Assignment 1: Welcome To Tensorflow: Problem 1: Op Is All You Need
No ratings yet
Assignment 1: Welcome To Tensorflow: Problem 1: Op Is All You Need
4 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
14 pages
Assignment 4x
No ratings yet
Assignment 4x
19 pages
R Deep Neural Network Step by Step
No ratings yet
R Deep Neural Network Step by Step
27 pages
COMP 4650 6490 Assignment 3 2023-v1.1
No ratings yet
COMP 4650 6490 Assignment 3 2023-v1.1
6 pages
Kami Export - Assignment - 2 - 20240709
No ratings yet
Kami Export - Assignment - 2 - 20240709
13 pages
Biography of Mother Teresa
No ratings yet
Biography of Mother Teresa
67 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
Job Offers Case Study
No ratings yet
Job Offers Case Study
7 pages
Medical Text Classifier GabrieldeOlaguibel
No ratings yet
Medical Text Classifier GabrieldeOlaguibel
12 pages
Transformers Torch
No ratings yet
Transformers Torch
38 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
ML 5th and 6th
No ratings yet
ML 5th and 6th
37 pages
Assignment 1
No ratings yet
Assignment 1
7 pages
INTRAMUROS
20% (5)
INTRAMUROS
16 pages
HW 1
100% (1)
HW 1
5 pages
Top 7 Relationship Manager Interview Questions Answers
No ratings yet
Top 7 Relationship Manager Interview Questions Answers
11 pages
MSC in International Trade and Shipping Management, Frederick University, Cyprus
No ratings yet
MSC in International Trade and Shipping Management, Frederick University, Cyprus
4 pages
Resume Sopheak Phat
No ratings yet
Resume Sopheak Phat
2 pages
Project 3 Write Up
No ratings yet
Project 3 Write Up
6 pages
CS335 Lab6
No ratings yet
CS335 Lab6
7 pages
Concert Report 2 - Final
No ratings yet
Concert Report 2 - Final
4 pages
Glove
100% (1)
Glove
10 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
Text Classification - Movie Review - News Wires
No ratings yet
Text Classification - Movie Review - News Wires
5 pages
CS3110 Final Cheat Sheet
No ratings yet
CS3110 Final Cheat Sheet
7 pages
Natural Language Processing With Pytorch Readthedocs Io en Latest PDF
No ratings yet
Natural Language Processing With Pytorch Readthedocs Io en Latest PDF
35 pages
CS 4320/5320 Homework 5: 1 Consistency Models (15 Points)
No ratings yet
CS 4320/5320 Homework 5: 1 Consistency Models (15 Points)
4 pages
Article 20 Nov 2010
No ratings yet
Article 20 Nov 2010
2 pages
Third Quarter G9
100% (1)
Third Quarter G9
69 pages
Contingency Theory Otley 2016
No ratings yet
Contingency Theory Otley 2016
18 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
Syllabus
No ratings yet
Syllabus
4 pages
Gold Digger: Treasure
No ratings yet
Gold Digger: Treasure
2 pages
KNL 316 K
No ratings yet
KNL 316 K
21 pages
Homework 2 Solution
No ratings yet
Homework 2 Solution
7 pages
Post Lab Questions
No ratings yet
Post Lab Questions
4 pages
CS-4701 Practicum in Artificial Intelligence Fall 2013: Project Proposals
No ratings yet
CS-4701 Practicum in Artificial Intelligence Fall 2013: Project Proposals
1 page
Python PDF Best Python Training Programming
0% (1)
Python PDF Best Python Training Programming
11 pages
Worksheet 2
No ratings yet
Worksheet 2
1 page
Kyrie - 09-23-16
No ratings yet
Kyrie - 09-23-16
1 page
Lominger
No ratings yet
Lominger
4 pages
Chow - Open Channel Hydraulics
100% (2)
Chow - Open Channel Hydraulics
350 pages
Kings: Department of Electronics and Communication Engineering
No ratings yet
Kings: Department of Electronics and Communication Engineering
6 pages
Final Exam Review
No ratings yet
Final Exam Review
8 pages
Guidelines For Cultural Safety, The Treaty of Waitangi, and Maori Health in Nursing Education and Practice
No ratings yet
Guidelines For Cultural Safety, The Treaty of Waitangi, and Maori Health in Nursing Education and Practice
24 pages
FL PlantCruise by Experion Launch in China
No ratings yet
FL PlantCruise by Experion Launch in China
7 pages
Atomic Theory Worksheet
100% (1)
Atomic Theory Worksheet
3 pages
EXPERIMENT 4 - Centrifugal Force
No ratings yet
EXPERIMENT 4 - Centrifugal Force
6 pages
HP Laserjet P2050
No ratings yet
HP Laserjet P2050
2 pages