CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
Fall 2017
Neural Language Models and Classifiers
Final report: due via Gradescope & CMS by Friday 12/1 11:59PM
1 Overview
In this project you will gain familiarity and working knowledge of neural
language models. This time we provide the bulk of the code in Python,
using the dynet and pytorch libraries. You will have to understand the
code, extend it slightly, and run various experiments for the report. Choose
your library among the two, and make sure you have the latest release in-
stalled. We recommend dynet for NLP tasks, since it is both easier to code
for and about 2-3x faster on CPUs.
Useful reading.
• Chapter 8 in Jurafsky & Martin, 3rd ed. Especially 8.3 and 8.5. (Note
that Chapter 8 is required reading anyway, so do read it. (https:
//web.stanford.edu/~jurafsky/slp3/8.pdf)
• Graham Neubig’s slides (http://phontron.com/class/nn4nlp2017/
assets/slides/nn4nlp-02-lm.pdf) and lecture notes (http://
phontron.com/class/mtandseq2seq2017/mt-spring2017.chapter5.
pdf)
2 Dataset
We’re back to movie reviews for this assignment, but we are providing
a larger version of the dataset from Pang & Lee this time. Do not down-
1
load the full dataset from it’s on-line site: we are providing you with a
slightly reorganized version via CMS, in order to make for a more inter-
esting project.
Task 1. Download the data and the code from CMS. Look through the
preprocess.py file and understand what it does.a What is the size
of the preprocessed vocabulary? Explain how you obtained your
answer.
a Along the way, try to identify some things you were making overcomplicated in
3 Sentiment classification
We will first train a simple neural sentiment classifier using a very simple
model: a deep averaging network (DAN) with one hidden layer.1 Impor-
tantly, we’re using this model as a generic introduction to how neural
networks are constructed, trained and evaluated.
exp(2zi ) − 1
σ(z)i =
exp(2zi ) + 1
ordered composition rivals syntactic methods for text classification.” In: ACL 2015.
2
approach is to limit the vocabulary size to a value |V | and learn a matrix
of word vectors E ∈ R|V |×n . Then, we can use Ew ∈ Rn as a vector repre-
sentation of word w.
Training. We want to find the best model parameters (E, W , b, Wout , bout ).
To do this, we optimize with respect to the binary logistic loss
ℓ(sentence, y) = −y log(p) − (1 − y) log(1 − p)
where y is the true label: y = 1 if the sentence is truly positive, else y = 0.
In this assignment we use mini-batch training which is faster: we
group all of the training data into sets of (say) 32 sentences. We then com-
pute the batch-level loss:
∑
ℓ(batch) = ℓ(sentence, y)
(sentence,y)∈batch
3
Evaluation. We will monitor the accuracy on the validation set after ev-
ery full pass over all training batches. (A full pass is commonly called an
epoch.) Keeping with the “batching” paradigm, we need a function that
takes a validation batch and returns the number of correctly classified
sentences (i.e., sentences for which y = [[p > 0.5]] when the gold standard
says that y is of positive sentiment and sentences for which y = [[p <= 0.5]]
when the gold standard says that y is of negative sentiment. A stub of this
function is provided, but it always returns 0.
4
next words. Both these reasons mean that the language model will be
much slower to train than the classifier!
At positition k > 0 in a sentence, a simple, bigram feed-forward lan-
guage model performs the following steps:
exp(zj )
P (wk = j|wk−1 ) = ∑|V |
i=0 exp(zi )
|V |
∑
log P (wk = j|wk−1 ) = zj − log exp(zi )
i=0
where y is the index of the actual word observed at position k. This is also
known as the negative log-likelihood (NLL). The batch loss is the sum of
all word-level losses in the batch
∑ len(sentence)
∑
ℓ(batch) = ℓk (sentence)
sentence∈batch k=1
∑ ∑ len(sentence)
∑
NLL(dataset) = ℓ(batch) = ℓk (sentence)
batch∈dataset sentence∈dataset k=1
5
Using unlabeled data. An observant reader will notice that we are not
using the sentiment labels at all in training our language model.2 This
means we can throw the unlabeled data in the mix and maybe get a better
language model.
Task 5. Load the unlabeled data (use the commented code to help you)
and combine it with the labeled data, then train a language model on
the resulting bigger dataset. Plot the validation perplexity for the two
cases (i.e., with and without the extra unlabeled data) on the same
graph, and discuss what you observe. Do not plot the training per-
plexities on the same plot: they are not comparable to each other.
Why are they not comparable?
6
classifier. In dynet, use clf.embed.populate(filename, "/embed"),
and in pytorch use clf.embed.weight = torch.load(filename).
6 Report
The required tasks are highligted with borders in this PDF. Follow the
instructions in the boxes to the letter. Make sure to answer all additional
questions: every question mark inside a border of this PDF must have a
clearly-worded answer in your report.
You will not submit any code. Instead, every time a task requires
you to modify the code, indicate in your report what changes you made.
(Don’t rely on line numbers, because as you modify the code the offsets
can change. Keep changes minimal, don’t include plotting code.) For in-
stance, to explain how you load the test set, you’d write something like4
we added
with open(os.path.join('processed', 'test_ix.pkl'),
'rb') as f:
test_ix = pickle.load(f)