[go: up one dir, main page]

0% found this document useful (0 votes)
199 views2 pages

Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras

This document describes an assignment to develop a feedforward neural network for topic classification. The tasks involve implementing text processing methods to transform raw text into input vectors, building a feedforward network with an embedding layer, hidden layer and output layer, training the network using stochastic gradient descent with backpropagation, tuning hyperparameters, analyzing model performance, retraining the network using pre-trained embeddings, extending the network with additional hidden layers, and providing well-documented code with discussions of design choices. The goal is to classify news articles into one of three topic classes. Training, development and test datasets are provided in CSV format to be loaded and preprocessed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
199 views2 pages

Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras

This document describes an assignment to develop a feedforward neural network for topic classification. The tasks involve implementing text processing methods to transform raw text into input vectors, building a feedforward network with an embedding layer, hidden layer and output layer, training the network using stochastic gradient descent with backpropagation, tuning hyperparameters, analyzing model performance, retraining the network using pre-trained embeddings, extending the network with additional hidden layers, and providing well-documented code with discussions of design choices. The goal is to classify news articles into one of three topic classes. Training, development and test datasets are provided in CSV format to be loaded and preprocessed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

[COM6513] Assignment 2: Topic Classification with a

Feedforward Network
Instructor: Nikos Aletras
The goal of this assignment is to develop a Feedforward neural network for topic classification.

For that purpose, you will implement:

Text processing methods for transforming raw text data into input vectors for your network (1 mark)

A Feedforward network consisting of:


One-hot input layer mapping words into an Embedding weight matrix (1 mark)
One hidden layer computing the mean embedding vector of all words in input followed by a ReLU activation function (1 mark)
Output layer with a softmax activation. (1 mark)

The Stochastic Gradient Descent (SGD) algorithm with back-propagation to learn the weights of your Neural network. Your algorithm
should:
Use (and minimise) the Categorical Cross-entropy loss function (1 mark)
Perform a Forward pass to compute intermediate outputs (3 marks)
Perform a Backward pass to compute gradients and update all sets of weights (6 marks)
Implement and use Dropout after each hidden layer for regularisation (2 marks)

Discuss how did you choose hyperparameters? You can tune the learning rate (hint: choose small values), embedding size {e.g. 50, 300,
500}, the dropout rate {e.g. 0.2, 0.5} and the learning rate. Please use tables or graphs to show training and validation performance for
each hyperparameter combination (2 marks).

After training a model, plot the learning process (i.e. training and validation loss in each epoch) using a line plot and report accuracy.
Does your model overfit, underfit or is about right? (1 mark).

Re-train your network by using pre-trained embeddings (GloVe) trained on large corpora. Instead of randomly initialising the
embedding weights matrix, you should initialise it with the pre-trained weights. During training, you should not update them (i.e.
weight freezing) and backprop should stop before computing gradients for updating embedding weights. Report results by
performing hyperparameter tuning and plotting the learning process. Do you get better performance? (3 marks).

Extend you Feedforward network by adding more hidden layers (e.g. one more or two). How does it affect the performance? Note: You
need to repeat hyperparameter tuning, but the number of combinations grows exponentially. Therefore, you need to choose a subset
of all possible combinations (4 marks)

Provide well documented and commented code describing all of your choices. In general, you are free to make decisions about text
processing (e.g. punctuation, numbers, vocabulary size) and hyperparameter values. We expect to see justifications and discussion for
all of your choices (2 marks).

Provide efficient solutions by using Numpy arrays when possible. Executing the whole notebook with your code should not take more
than 10 minutes on any standard computer (e.g. Intel Core i5 CPU, 8 or 16GB RAM) excluding hyperparameter tuning runs and loading
the pretrained vectors. You can find tips in Lab 1 (2 marks).

Data
The data you will use for the task is a subset of the AG News Corpus and you can find it in the ./data_topic folder in CSV format:

data_topic/train.csv : contains 2,400 news articles, 800 for each class to be used for training.
data_topic/dev.csv : contains 150 news articles, 50 for each class to be used for hyperparameter selection and monitoring the
training process.
data_topic/test.csv : contains 900 news articles, 300 for each class to be used for testing.

Pre-trained Embeddings
You can download pre-trained GloVe embeddings trained on Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB
download) from here. No need to unzip, the file is large.

Save Memory
To save RAM, when you finish each experiment you can delete the weights of your network using del W followed by Python's garbage
collector gc.collect()

Submission Instructions
You should submit a Jupyter Notebook file (assignment2.ipynb) and an exported PDF version (you can do it from Jupyter: File-
>Download as->PDF via Latex ).

You are advised to follow the code structure given in this notebook by completing all given funtions. You can also write any
auxilliary/helper functions (and arguments for the functions) that you might need but note that you can provide a full solution without any
such functions. Similarly, you can just use only the packages imported below but you are free to use any functionality from the Python
Standard Library, NumPy, SciPy (excluding built-in softmax funtcions) and Pandas. You are not allowed to use any third-party library
such as Scikit-learn (apart from metric functions already provided), NLTK, Spacy, Keras, Pytorch etc.. You should mention if you've used
Windows to write and test your code because we mostly use Unix based machines for marking (e.g. Ubuntu, MacOS).

There is no single correct answer on what your accuracy should be, but correct implementations usually achieve F1-scores around 80\% or
higher. The quality of the analysis of the results is as important as the accuracy itself.

This assignment will be marked out of 30. It is worth 30\% of your final grade in the module.

The deadline for this assignment is 23:59 on Mon, 9 May 2022 and it needs to be submitted via Blackboard. Standard departmental
penalties for lateness will be applied. We use a range of strategies to detect unfair means, including Turnitin which helps detect
plagiarism. Use of unfair means would result in getting a failing grade.

First of all the notebook was tested in google colab as I am very micuh famlier with google colab then I have used Windows to test my
code. My windows configuration is Core i5 6th gen, 8 GB Ram.

In [6]: import pandas as pd


import numpy as np
from collections import Counter
import re
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import random
from time import localtime, strftime
from scipy.stats import spearmanr,pearsonr
import zipfile
import gc

# fixing random seed for reproducibility


random.seed(10)
np.random.seed(10)

Transform Raw texts into training and development data


First, you need to load the training, development and test sets from their corresponding CSV files (tip: you can use Pandas dataframes).

In [7]: # Load dateset


data_tr = pd.read_csv("data_topic/train.csv", header=None, names=['label','text'])
data_te = pd.read_csv("data_topic/test.csv", header=None, names=['label','text'])
data_dev = pd.read_csv("data_topic/dev.csv", header=None, names=['label','text'])

In [8]: # transform test to list, and label to numpy arrays


def transform(df):
return list(df['text']), df['label'].to_numpy().reshape(-1,1)-1

X_tr_raw, Y_tr = transform(data_tr)


X_te_raw, Y_te = transform(data_te)
X_dev_raw, Y_dev = transform(data_dev)

Create input representations


To train your Feedforward network, you first need to obtain input representations given a vocabulary. One-hot encoding requires large
memory capacity. Therefore, we will instead represent documents as lists of vocabulary indices (each word corresponds to a vocabulary
index).

Text Pre-Processing Pipeline


To obtain a vocabulary of words. You should:

tokenise all texts into a list of unigrams (tip: you can re-use the functions from Assignment 1)
remove stop words (using the one provided or one of your preference)
remove unigrams appearing in less than K documents
use the remaining to create a vocabulary of the top-N most frequent unigrams in the entire corpus.

In [9]: #Improved Unimportant Stop word list


stop_words = ['a','aaa', 'ad', 'after', 'again', 'all', 'also', 'am', 'an', 'and', 'any',
'are', 'as', 'at', 'be', 'because', 'been', 'being', 'between', 'both',
'but', 'by', 'can', 'could', 'does', 'each', 'ed', 'eg', 'either', 'etc',
'even', 'ever', 'every', 'for', 'from', 'had', 'has', 'have', 'he', 'her',
'hers', 'herself', 'him', 'himself', 'his', 'i', 'ie', 'if', 'in', 'inc',
'into', 'is', 'it', 'its', 'itself', 'ltd', 'may', 'maybe',
'me', 'might', 'mine', 'minute', 'minutes', 'must', 'my', 'myself',
'neither', 'nor', 'now', 'of', 'on', 'only', 'or', 'other', 'our', 'ours',
'ourselves', 'own', 'same', 'seem', 'seemed', 'shall', 'she', 'some',
'somehow', 'something', 'sometimes', 'somewhat', 'somewhere', 'spoiler',
'spoilers', 'such', 'suppose', 'that', 'the', 'their', 'theirs', 'them',
'themselves', 'there', 'these', 'they', 'this', 'those', 'thus', 'to',
'today', 'tomorrow', 'us', 've', 'vs', 'was', 'we', 'were', 'what',
'whatever', 'when', 'where', 'which', 'who', 'whom', 'whose', 'will',
'with', 'yesterday', 'you', 'your', 'yours', 'yourself', 'yourselves']

Unigram extraction from a document


You first need to implement the extract_ngrams function. It takes as input:

x_raw : a string corresponding to the raw text of a document


ngram_range : a tuple of two integers denoting the type of ngrams you want to extract, e.g. (1,2) denotes extracting unigrams and
bigrams.
token_pattern : a string to be used within a regular expression to extract all tokens. Note that data is already tokenised so you
could opt for a simple white space tokenisation.
stop_words : a list of stop words
vocab : a given vocabulary. It should be used to extract specific features.

and returns:

a list of all extracted features.

In [10]: def extract_ngrams(x_raw, ngram_range=(1,3), token_pattern=r'\b[A-Za-z][A-Za-z]+\b', stop_words=[], vocab=set()

tokenRE = re.compile(token_pattern)

# first extract all unigrams by tokenising


x_uni = [w for w in tokenRE.findall(str(x_raw).lower(),) if w not in stop_words]

# this is to store the ngrams to be returned


x = []

if ngram_range[0]==1:
x = x_uni

# generate n-grams from the available unigrams x_uni


ngrams = []
for n in range(ngram_range[0], ngram_range[1]+1):

# ignore unigrams
if n==1: continue

# pass a list of lists as an argument for zip


arg_list = [x_uni]+[x_uni[i:] for i in range(1, n)]

# extract tuples of n-grams using zip


x_ngram = list(zip(*arg_list))
ngrams.append(x_ngram)

for n in ngrams:
for t in n:
x.append(t)

if len(vocab)>0:
x = [w for w in x if w in vocab]

return x

Create a vocabulary of n-grams


Then the get_vocab function will be used to (1) create a vocabulary of ngrams; (2) count the document frequencies of ngrams; (3) their
raw frequency. It takes as input:

X_raw : a list of strings each corresponding to the raw text of a document


ngram_range : a tuple of two integers denoting the type of ngrams you want to extract, e.g. (1,2) denotes extracting unigrams and
bigrams.
token_pattern : a string to be used within a regular expression to extract all tokens. Note that data is already tokenised so you
could opt for a simple white space tokenisation.
stop_words : a list of stop words
min_df : keep ngrams with a minimum document frequency.
keep_topN : keep top-N more frequent ngrams.

and returns:

vocab : a set of the n-grams that will be used as features.


df : a Counter (or dict) that contains ngrams as keys and their corresponding document frequency as values.
ngram_counts : counts of each ngram in vocab

In [11]: def get_vocab(X_raw, ngram_range=(1,3), token_pattern=r'\b[A-Za-z][A-Za-z]+\b',


min_df=0, keep_topN=0, stop_words=[]):

tokenRE = re.compile(token_pattern)

df = Counter()
ngram_counts = Counter()
vocab = set()

# interate through each raw text


for x in X_raw:

x_ngram = extract_ngrams(x, ngram_range=ngram_range, token_pattern=token_pattern, stop_words=stop_words

#update doc and ngram frequencies


df.update(list(set(x_ngram)))
ngram_counts.update(x_ngram)

# obtain a vocabulary as a set.


# Keep elements with doc frequency > minimum doc freq (min_df)
vocab = set([w for w in df if df[w]>=min_df])

# keep the top N most freqent


if keep_topN>0:
vocab = set([w[0] for w in ngram_counts.most_common(keep_topN) if w[0] in vocab])

return vocab, df, ngram_counts

Now you should use get_vocab to create your vocabulary and get document and raw frequencies of unigrams:

In [12]: #using the vocabulary of training for training. Choosing the top 5000 vocabulary only
vocab, df, ngram_counts = get_vocab(X_tr_raw, ngram_range=(1,1), keep_topN=5000, stop_words=stop_words)
print(len(vocab))
print()
print(list(sorted(vocab))[:100])
print()
print(df.most_common()[:10])

5000

['aaron', 'abandon', 'abandoned', 'abby', 'abdullah', 'aber', 'able', 'aboard', 'about', 'above', 'abroad', 'ab
solute', 'abu', 'abuja', 'ac', 'accept', 'accepted', 'accepting', 'access', 'accessories', 'accident', 'accordi
ng', 'account', 'accounting', 'accusations', 'accused', 'accuser', 'accuses', 'accusing', 'ace', 'acknowledge
d', 'acquire', 'acquisition', 'acquisitions', 'across', 'act', 'action', 'actions', 'activated', 'activist', 'a
ctivists', 'activities', 'activity', 'actors', 'actress', 'adam', 'add', 'added', 'adding', 'additional', 'adju
sted', 'adjusters', 'administration', 'administrator', 'admission', 'adopted', 'adults', 'advance', 'advanced',
'advantage', 'advertisers', 'advertising', 'adviser', 'advising', 'aegis', 'affair', 'afford', 'afghan', 'afgha
nistan', 'afghans', 'afp', 'africa', 'african', 'africans', 'aftermath', 'afternoon', 'ag', 'against', 'agass
i', 'age', 'agencies', 'agency', 'agent', 'ago', 'agony', 'agree', 'agreed', 'agreement', 'agreements', 'agricu
ltural', 'ahead', 'ahmed', 'aid', 'aided', 'aides', 'ailing', 'aimed', 'aiming', 'air', 'aircraft']

[('reuters', 631), ('said', 432), ('tuesday', 413), ('wednesday', 344), ('new', 325), ('ap', 275), ('athens', 2
45), ('monday', 221), ('first', 210), ('two', 187)]

Then, you need to create vocabulary id -> word and word -> vocabulary id dictionaries for reference:

In [13]: # Calculate the vocab and df of different data sets. Only choosing the top 5000 vocab for every dataset. after
#To large Vocab size also runs slow in my cpu especially in the pre-trained embedding
vocab_tr, df_tr, ngram_counts_tr = get_vocab(X_tr_raw, ngram_range=(1,1), keep_topN=5000, stop_words=stop_words
vocab_te, df_te, ngram_counts_te = get_vocab(X_te_raw, ngram_range=(1,1), keep_topN=5000, stop_words=stop_words
vocab_dev, df_dev, ngram_counts_dev = get_vocab(X_dev_raw, ngram_range=(1,1), keep_topN=5000, stop_words=stop_w
# create reference with sorted vocab to avoid getting random accuracy every time we restart the kernal
id2vocab = dict(enumerate(sorted(vocab_tr)))
vocab2id = dict(zip(id2vocab.values(),id2vocab.keys()))

Convert the list of unigrams into a list of vocabulary indices


Storing actual one-hot vectors into memory for all words in the entire data set is prohibitive. Instead, we will store word indices in the
vocabulary and look-up the weight matrix. This is equivalent of doing a dot product between an one-hot vector and the weight matrix.

First, represent documents in train, dev and test sets as lists of words in the vocabulary:

In [14]: # extract n-grams


X_uni_tr = [extract_ngrams(line,ngram_range=(1,1),stop_words=stop_words) for line in X_tr_raw]
X_uni_te = [extract_ngrams(line,ngram_range=(1,1),stop_words=stop_words) for line in X_te_raw]
X_uni_dev = [extract_ngrams(line,ngram_range=(1,1),stop_words=stop_words) for line in X_dev_raw]

In [15]: X_uni_tr[0]

['reuters',
Out[15]:
'venezuelans',
'turned',
'out',
'early',
'large',
'numbers',
'sunday',
'vote',
'historic',
'referendum',
'remove',
'left',
'wing',
'president',
'hugo',
'chavez',
'office',
'give',
'new',
'mandate',
'govern',
'next',
'two',
'years']

Then convert them into lists of indices in the vocabulary:

In [16]: def words2indices(words,index = vocab2id):


#Convert a list of words to a list of index
words = [word for word in words if word in index.keys()]
return list(map(lambda x:index[x],words))

X_tr = [words2indices(line) for line in X_uni_tr]


X_te = [words2indices(line) for line in X_uni_te]
X_dev = [words2indices(line) for line in X_uni_dev]

In [17]: X_tr[0]

[3734,
Out[17]:
4754,
4655,
3082,
1388,
2470,
2991,
4333,
4801,
2041,
3619,
3668,
2512,
4909,
3360,
2102,
761,
3013,
1856,
2947,
2657,
1892,
2953,
4665,
4985]
Put the labels Y for train, dev and test sets into arrays:

In [18]: # Already transform to numpy at the beginning of reading CSV


_, Y_tr = transform(data_tr)
_, Y_te = transform(data_te)
_, Y_dev = transform(data_dev)

Network Architecture
Your network should pass each word index into its corresponding embedding by looking-up on the embedding matrix and then compute
the first hidden layer h : 1

1
e
h1 = ∑W ,i ∈ x
i
|x|
i

where |x| is the number of words in the document and W is an embedding matrix |V | × d, |V | is the size of the vocabulary and d the
e

embedding size.

Then h should be passed through a ReLU activation function:


1

a1 = relu(h1 )

Finally the hidden layer is passed to the output layer:

y = softmax(a1 W )

where W is a matrix d × |Y|, |Y| is the number of classes.

During training, a should be multiplied with a dropout mask vector (elementwise) for regularisation before it is passed to the output layer.
1

You can extend to a deeper architecture by passing a hidden layer to another one:

hi = ai−1 Wi

ai = relu(hi )

Network Training
First we need to define the parameters of our network by initiliasing the weight matrices. For that purpose, you should implement the
network_weights function that takes as input:

vocab_size : the size of the vocabulary


embedding_dim : the size of the word embeddings
hidden_dim : a list of the sizes of any subsequent hidden layers. Empty if there are no hidden layers between the average embedding
and the output layer
num_classes : the number of the classes for the output layer

and returns:

W : a dictionary mapping from layer index (e.g. 0 for the embedding matrix) to the corresponding weight matrix initialised with small
random numbers (hint: use numpy.random.uniform with from -0.1 to 0.1)

Make sure that the dimensionality of each weight matrix is compatible with the previous and next weight matrix, otherwise you won't be
able to perform forward and backward passes. Consider also using np.float32 precision to save memory.

In [19]: def network_weights(vocab_size=1500, embedding_dim=300,


hidden_dim=[], num_classes=3, init_val = 0.5):
# fixing random seed for reproducibility
np.random.seed(10)
W_emb = np.random.uniform(low = -1*init_val, high = init_val, size = (vocab_size,embedding_dim))

W_h = list()
pt = embedding_dim
for layer in hidden_dim:
W_h.append(np.random.uniform(low = -1*init_val, high = init_val, size = (pt,layer)))
pt = layer
W_out = np.random.uniform(low = -1*init_val, high = init_val, size = (pt,num_classes))
W = [W_emb,*W_h,W_out]

return W

Then you need to develop a softmax function (same as in Assignment 1) to be used in the output layer.

It takes as input z (array of real numbers) and returns sig (the softmax of z )

In [20]: def softmax(z):

#Calculate softmax results

#In order to avoid data overflow, the corresponding data boundary is added.
z = np.minimum(z,709.782)
sig = np.exp(z) / np.sum(np.exp(z),axis=z.ndim-1,keepdims=True)

return sig

Now you need to implement the categorical cross entropy loss by slightly modifying the function from Assignment 1 to depend only on
the true label y and the class probabilities vector y_preds :

In [21]: def categorical_loss(y, y_preds):

Y = np.array(y)
assert type(y_preds) == np.ndarray
try:
n,d = y_preds.shape
except:
n=1
d = y_preds.shape[0]

Y = np.eye(d,d)[Y].reshape(n,d)
y_preds = np.maximum(y_preds,np.finfo(np.float64).eps)
assert np.all(y_preds>0)
l1 = -np.sum(Y*np.log(y_preds),1)
l=np.mean(l1)
assert l>=0
return l

Then, implement the relu function to introduce non-linearity after each hidden layer of your network (during the forward pass):

relu(zi ) = max(zi , 0)

and the relu_derivative function to compute its derivative (used in the backward pass):

relu_derivative(z )=0, if z <=0, 1 otherwise.


i i

Note that both functions take as input a vector z

Hint use .copy() to avoid in place changes in array z

In [22]: def relu(z):

#ReLu activation function

a = np.fmax(z, 0)
return a

def relu_derivative(z):

#Derivative of ReLu activation function


dz = np.fmax(z,0)
np.sign(dz,out=dz)
return dz

During training you should also apply a dropout mask element-wise after the activation function (i.e. vector of ones with a random
percentage set to zero). The dropout_mask function takes as input:

size : the size of the vector that we want to apply dropout


dropout_rate : the percentage of elements that will be randomly set to zeros

and returns:

dropout_vec : a vector with binary values (0 or 1)

In [23]: def dropout_mask(size, dropout_rate):

#Dropout matrix

dropout_vec = np.full(size,1.0)
index = np.arange(size)
np.random.shuffle(index)
dropout_vec[index[:int(size*dropout_rate)]]=0

return dropout_vec

Now you need to implement the forward_pass function that passes the input x through the network up to the output layer for
computing the probability for each class using the weight matrices in W . The ReLU activation function should be applied on each hidden
layer.

x : a list of vocabulary indices each corresponding to a word in the document (input)


W : a list of weight matrices connecting each part of the network, e.g. for a network with a hidden and an output layer: W[0] is the
weight matrix that connects the input to the first hidden layer, W[1] is the weight matrix that connects the hidden layer to the output
layer.
dropout_rate : the dropout rate that is used to generate a random dropout mask vector applied after each hidden layer for
regularisation.

and returns:

out_vals : a dictionary of output values from each layer: h (the vector before the activation function), a (the resulting vector after
passing h from the activation function), its dropout mask vector; and the prediction vector (probability for each class) from the output
layer.

In [24]: def initial_h0(x,W):

#mapping words' index into an Embedding weight matrix

if type(x[0])!=list:
h = np.zeros(W.shape[1])
for index in range(len(x)):
np.add(W[x[index]]/len(x),h,out=h) # with mean value
else:
h = np.zeros((len(x),W.shape[1]))
for i in range(len(x)):
for index in range(len(x[i])):
h[i,:] += W[x[i][index]]/len(x[i])
return h

def forward_linear(A_prev, W, dropout_rate):

'''Linear forward:
1. h = a*W
2. a = relu(h)
'''

h = np.dot(A_prev, W)
a = relu(h)
dropout = dropout_mask(a.shape[a.ndim-1],dropout_rate)
return h,a,dropout

def forward_pass(x, W, dropout_rate=0.2):


''' Feed forward network
'''
out_vals = {}

h_vecs = []
a_vecs = []
dropout_vecs = []

# one-hot input layer to embedding weight matrix (with mean)


h = initial_h0(x,W[0])
A = relu(h) # ReLu activation
dropout = dropout_mask(A.shape[A.ndim-1],dropout_rate) # Dropout
h_vecs.append(h);a_vecs.append(A);dropout_vecs.append(dropout)
A = np.multiply(A,dropout)

# Hidden layers
L = len(W)
for l in range(1,L-1):
A_prev = A
# Activation, dropout
h,A,dropout = forward_linear(A_prev,W[l],dropout_rate)
h_vecs.append(h);a_vecs.append(A);dropout_vecs.append(dropout)
A = np.multiply(A,dropout)

# Output layer
A_prev = A
h,A,dropout = forward_linear(A_prev,W[-1],dropout_rate)
np.multiply(A,dropout,out=A)
y = softmax(A) # softmax function mapping vector to probability

out_vals['h'] = h_vecs
out_vals['a'] = a_vecs
out_vals['dropout_vec'] = dropout_vecs
out_vals['y'] = y

return out_vals

The backward_pass function computes the gradients and updates the weights for each matrix in the network from the output to the
input. It takes as input

x : a list of vocabulary indices each corresponding to a word in the document (input)


y : the true label
W : a list of weight matrices connecting each part of the network, e.g. for a network with a hidden and an output layer: W[0] is the
weight matrix that connects the input to the first hidden layer, W[1] is the weight matrix that connects the hidden layer to the output
layer.
out_vals : a dictionary of output values from a forward pass.
learning_rate : the learning rate for updating the weights.
freeze_emb : boolean value indicating whether the embedding weights will be updated.

and returns:

W : the updated weights of the network.

Hint: the gradients on the output layer are similar to the multiclass logistic regression.

In [25]: def backward_linear(dh,A_prev,W):


''' linear back-propagation
'''
A_prev = A_prev.reshape(1,-1)
dW = np.dot(A_prev.T, dh)
dA_prev = np.dot(dh, W.T)
return dW,dA_prev

def backward_dropout(dA, A_prev, W, dropout):


''' dropout back-propagation
'''
dW,dA_prev = backward_linear(dA, A_prev,W)
np.multiply(dA_prev,dropout,out=dA_prev)
return dW,dA_prev

def backward_activation(dA, h):


''' ReLu back-propagation
'''
dh = relu_derivative(h)*dA
return dh

def backward_pass(x, Y, W, out_vals, lr=0.001, freeze_emb=False):


''' Back-propagation
'''
# output layer
y_preds = out_vals['y']
dh = y_preds-Y # gradient dL/dh
dh = backward_activation(dh,y_preds)

# hidden layers
L = len(W)
for l in range(1,L):
A_prev = out_vals['a'][-1*l]
# Gradient, Dropout
dW,dA_prev = backward_dropout(dh, A_prev,W[-1*l],out_vals['dropout_vec'][-1*l])
# Activation
dh = backward_activation(dA_prev,out_vals['h'][-1*l])

np.multiply(lr,dW,out = dW)
np.subtract(W[-1*l],dW,out = W[-1*l])

# input layers
if not freeze_emb:
X = np.array(x).reshape(-1,1)/len(x)
dW = np.dot(X,dh)

np.multiply(lr,dW,out=dW)
for index in range(len(x)):
W[0][x[index]] -= dW[index]

return W

Finally you need to modify SGD to support back-propagation by using the forward_pass and backward_pass functions.

The SGD function takes as input:

X_tr : array of training data (vectors)


Y_tr : labels of X_tr
W : the weights of the network (dictionary)
X_dev : array of development (i.e. validation) data (vectors)
Y_dev : labels of X_dev
lr : learning rate
dropout : regularisation strength
epochs : number of full passes over the training data
tolerance : stop training if the difference between the current and previous validation loss is smaller than a threshold
freeze_emb : boolean value indicating whether the embedding weights will be updated (to be used by the backward pass function).
print_progress : flag for printing the training progress (train/validation loss)

and returns:

weights : the weights learned


training_loss_history : an array with the average losses of the whole training set after each epoch
validation_loss_history : an array with the average losses of the whole development set after each epoch

In [26]: def SGD(X_tr, Y_tr, W, X_dev=[], Y_dev=[], lr=0.001,


dropout=0.2, epochs=5, tolerance=0.001, freeze_emb=False, print_progress=True):
np.random.seed(10)
training_loss_history = []
validation_loss_history = []

# Stage 1: transform label into vector


num_classes = W[-1].shape[1]
Y_tr_pre = np.eye(num_classes,num_classes)[Y_tr].reshape(len(Y_tr),-1)

# Stage 2: Init stochastic value


idx_list = np.array(range(len(X_tr)))

t_forward = 0
t_backward = 0
for epoch in range(epochs):
for i in idx_list:
# get single data
X_tr_i, Y_tr_i = X_tr[i],Y_tr_pre[i]
# Forward pass
out_vals = forward_pass(X_tr_i, W, dropout_rate=dropout)
# Backward pass
W = backward_pass(X_tr_i, Y_tr_i.reshape(1,-1), W, out_vals, lr=lr, freeze_emb=False)

# training loss
y_preds = forward_pass(X_tr, W, dropout_rate=0)['y']
loss_tr = categorical_loss(Y_tr, y_preds)
# evaluation loss
y_preds = forward_pass(X_dev, W, dropout_rate=0)['y']
loss_dev = categorical_loss(Y_dev, y_preds)

# Add history
training_loss_history.append(loss_tr)
validation_loss_history.append(loss_dev)

if print_progress == True:
print("Epoch: %d| Training loss: %f| Validation loss: %f"%(epoch+1,loss_tr,loss_dev))

if epoch >1 and (validation_loss_history[-2]-validation_loss_history[-1]) <= tolerance:


break

return W, training_loss_history, validation_loss_history

Now you are ready to train and evaluate your neural net. First, you need to define your network using the network_weights function
followed by SGD with backprop:

In [27]: #initializing the network with fine tune parameters


W = network_weights(vocab_size=len(vocab),embedding_dim=650,
hidden_dim=[], num_classes=3)

for i in range(len(W)):
print('Shape W'+str(i), W[i].shape)

W, loss_tr, dev_loss = SGD(X_tr, Y_tr,


W,
X_dev=X_dev,
Y_dev=Y_dev,
lr=0.0001,
dropout=0.2,
freeze_emb=False,
tolerance=0.01,
epochs=50)

Shape W0 (5000, 650)


Shape W1 (650, 3)
Epoch: 1| Training loss: 0.705052| Validation loss: 0.814691
Epoch: 2| Training loss: 0.493439| Validation loss: 0.615941
Epoch: 3| Training loss: 0.378705| Validation loss: 0.506333
Epoch: 4| Training loss: 0.305828| Validation loss: 0.444826
Epoch: 5| Training loss: 0.253243| Validation loss: 0.404934
Epoch: 6| Training loss: 0.212256| Validation loss: 0.376268
Epoch: 7| Training loss: 0.181997| Validation loss: 0.356622
Epoch: 8| Training loss: 0.158664| Validation loss: 0.342787
Epoch: 9| Training loss: 0.138073| Validation loss: 0.332984
Plot the learning process:

In [28]: %matplotlib inline


fig = plt.figure()
plt.plot(range(len(loss_tr)),loss_tr,label='training loss')
plt.plot(range(len(dev_loss)),dev_loss,label='validation loss')

plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training Monitoring')

plt.legend()
plt.show()
By seeing the above graph, We can say that our model trained well beacuse as we can is both the training and validation loss continues to
fall.

Compute accuracy, precision, recall and F1-Score:

In [29]: preds_te = [np.argmax(forward_pass(x, W, dropout_rate=0.0)['y']) for x,y in zip(X_te,Y_te)]

print('Accuracy:', accuracy_score(Y_te,preds_te))
print('Precision:', precision_score(Y_te,preds_te,average='macro'))
print('Recall:', recall_score(Y_te,preds_te,average='macro'))
print('F1-Score:', f1_score(Y_te,preds_te,average='macro'))

Accuracy: 0.8355555555555556
Precision: 0.8363954119098828
Recall: 0.8355555555555556
F1-Score: 0.8350263621620005

Discuss how did you choose model hyperparameters ?


Choose Model Hyperparameters:

Choose Model Hyperparameters: For this model, emberdding and dropout has been used for fine tuning and learning rate is constant 0.01.
For hyperparameter tuning gird search method has been used. This method is straight forword but time consuming. Becuase of the
restriction of using different librarios K-Fold cross validation is not being used which is more better. For embedding 4 values 50, 350, 650
and 950 and for dropout 3 values 0.2,0.4,0.6 has been used that means in total 12 model combination.

In [30]: # Initial hyperparameters


fig = plt.figure()
embeddings = [50,350,650,950]
dropouts = [0.2,0.4,0.6]
embeddings_num = len(embeddings)
dropouts_num = len(dropouts)

# Run trainings on different sets of hyperparameters


for i in range(dropouts_num):
# Dropout Tune
dropout = dropouts[i]
print("For Dropout:",dropout)
tmp = list()
for j in range(embeddings_num):
# Embedding Tune
embedding = embeddings[j]
print("For Embedding: ",embedding)
W = network_weights(vocab_size=len(vocab),embedding_dim=embedding,hidden_dim=[], num_classes=3)
W, _, _ = SGD(X_tr, Y_tr,
W,
X_dev=X_dev,
Y_dev=Y_dev,
lr=0.0001,
dropout=dropout,
freeze_emb=False,
tolerance=0.01,
epochs=50,
print_progress=False)
preds_dev = [np.argmax(forward_pass(x, W, dropout_rate=0.0)['y']) for x,y in zip(X_dev,Y_dev)]
score = f1_score(Y_dev,preds_dev,average='macro')
print(score)
tmp.append(score)
plt.plot(embeddings,tmp,label='dropout: {}'.format(dropout))

plt.xlabel('Embedding Size')
plt.ylabel('F1 Score')
plt.title('Hyperparameters: embedding & dropout')

plt.legend()
plt.show()

For Dropout: 0.2


For Embedding: 50
0.9005595629556868
For Embedding: 350
0.8869874736583122
For Embedding: 650
0.9073091481593138
For Embedding: 950
0.9069062865799077
For Dropout: 0.4
For Embedding: 50
0.8872056202580744
For Embedding: 350
0.8807540905666156
For Embedding: 650
0.8803675361828662
For Embedding: 950
0.8744230559560009
For Dropout: 0.6
For Embedding: 50
0.8871976969295737
For Embedding: 350
0.8744104398249659
For Embedding: 650
0.8803675361828662
For Embedding: 950
0.8737713013070513

Chart description:

As shown in the figure above, the X axis represents the embedding size, and the Y axis refers to the F1-Score. And different colored
polylines indicate different dropout.

Embedding: For the embedding parameter in emebding 50 it gaves around 90% F1-score after that it decresed on embedding size 350 and
again start increasing after this. Highest peek is on embedding size 650.

Dropout: From the graph we can say that as the dropout increases, the performance also gradually declines. Only dropout 02. performs
well.

After fine tuning the best parameter is: Dropout: 0.2 Embedding Size: 650

Use Pre-trained Embeddings


Now re-train the network using GloVe pre-trained embeddings. You need to modify the backward_pass function above to stop
computing gradients and updating weights of the embedding matrix.

Use the function below to obtain the embedding martix for your vocabulary. Generally, that should work without any problem. If you get
errors, you can modify it.

Sorted Voabulary is very important if we want to get the same result even after restarting the kernal every time.

In [31]: #function for loading embedding file


def get_glove_embeddings(f_zip, f_txt, word2id, emb_size=300):

w_emb = np.zeros((len(word2id), emb_size))

with zipfile.ZipFile(f_zip) as z:
with z.open(f_txt) as f:
for line in f:
line = line.decode('utf-8')
word = line.split()[0]

if word in sorted(vocab):
emb = np.array(line.strip('\n').split()[1:]).astype(np.float32)
w_emb[word2id[word]] +=emb
return w_emb

In [33]: w_glove = get_glove_embeddings("data_topic/glove.840B.300d.zip","glove.840B.300d.txt",vocab2id)

First, initialise the weights of your network using the network_weights function. Second, replace the weigths of the embedding matrix
with w_glove . Finally, train the network by freezing the embedding weights:

In [34]: #training with Pre-embedding network


W = network_weights(vocab_size=len(vocab),embedding_dim=300,hidden_dim=[], num_classes=3)
W[0] = w_glove
W, loss_tr, dev_loss = SGD(X_tr, Y_tr,
W,
X_dev=X_dev,
Y_dev=Y_dev,
lr=0.001,
dropout=0.2,
freeze_emb=True,
tolerance=0.01,
epochs=50)

Epoch: 1| Training loss: 0.941820| Validation loss: 1.075626


Epoch: 2| Training loss: 0.502158| Validation loss: 0.580066
Epoch: 3| Training loss: 0.316189| Validation loss: 0.407243
Epoch: 4| Training loss: 0.220960| Validation loss: 0.335605
Epoch: 5| Training loss: 0.161608| Validation loss: 0.291945
Epoch: 6| Training loss: 0.124208| Validation loss: 0.273915
Epoch: 7| Training loss: 0.099348| Validation loss: 0.259843
Epoch: 8| Training loss: 0.078236| Validation loss: 0.248790
Epoch: 9| Training loss: 0.059555| Validation loss: 0.251154

Plot the learning process

In [35]: %matplotlib inline


fig = plt.figure()
plt.plot(range(len(loss_tr)),loss_tr,label='training loss')
plt.plot(range(len(dev_loss)),dev_loss,label='validation loss')

plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training Monitoring')

plt.legend()
plt.show()

By seeing the above graph, We can say that our pretrained average embedding model trained well beacuse as we can is both the training
and validation loss continues to fall together. The gap between the validation and training loss is also not that much.

In [36]: preds_te = [np.argmax(forward_pass(x, W, dropout_rate=0.0)['y'])


for x,y in zip(X_te,Y_te)]

print('Accuracy:', accuracy_score(Y_te,preds_te))
print('Precision:', precision_score(Y_te,preds_te,average='macro'))
print('Recall:', recall_score(Y_te,preds_te,average='macro'))
print('F1-Score:', f1_score(Y_te,preds_te,average='macro'))

Accuracy: 0.8633333333333333
Precision: 0.8632796695811602
Recall: 0.8633333333333333
F1-Score: 0.8629290151920235
We got better result then before. Using pretrained embeddings helps us to improve the model.

Discuss how did you choose model hyperparameters ?


Like the average embedding without pre-trained before, I tune the learning rate first, then change the dropout to get better convergence.
The embedding size is a constant number wich is 300 and it cannot be changed. Because the embedding matrix is pre-trained rather than
random values. 3 learning rate values and 3 dropout values in total 9 combination model.

In [37]: # Initial hyperparameters


fig = plt.figure()
lrs = [0.001,0.0001,0.00001]
dropouts = [0.2,0.4,0.6]
lr_num = len(lrs)
dropouts_num = len(dropouts)

# Run trainings on different sets of hyperparameters


for i in range(dropouts_num):
# Dropout Tune
dropout = dropouts[i]
print("For dropout:",dropout)
tmp = list()
for j in range(lr_num):
# Learning rate Tune
lr = lrs[j]
print("For learning rate: ",lr)
W = network_weights(vocab_size=len(vocab),embedding_dim=300,hidden_dim=[], num_classes=3)
W[0] = w_glove
W, _, _ = SGD(X_tr, Y_tr,
W,
X_dev=X_dev,
Y_dev=Y_dev,
lr=lr,
dropout=dropout,
freeze_emb=True,
tolerance=0.01,
epochs=50,
print_progress=False)
preds_dev = [np.argmax(forward_pass(x, W, dropout_rate=0.0)['y']) for x,y in zip(X_dev,Y_dev)]
score = f1_score(Y_dev,preds_dev,average='macro')
print(score)
tmp.append(score)
plt.plot(lrs,tmp,label='dropout: {}'.format(dropout))

plt.xlabel('Learning Rate')
plt.ylabel('F1 Score')
plt.title('Hyperparameters: learning rate & dropout')

plt.legend()
plt.show()

For dropout: 0.2


For learning rate: 0.001
0.901186299081036
For learning rate: 0.0001
0.901186299081036
For learning rate: 1e-05
0.901186299081036
For dropout: 0.4
For learning rate: 0.001
0.8094729344729344
For learning rate: 0.0001
0.9008700406876698
For learning rate: 1e-05
0.8877657052349103
For dropout: 0.6
For learning rate: 0.001
0.8606209150326798
For learning rate: 0.0001
0.9008700406876698
For learning rate: 1e-05
0.8876880507199655

Chart description:

As shown in the figure above, the X axis represents the learning, and the Y axis refers to the F1-Score. And different colored polylines
indicate different dropout.

Only for drop out 0.2 the learning rate didnt decresead but for rest of the dropout 0.4 and 0.6 as the leraning rate increased the F1-Score
decreasd. The best dropout is 0.2 followed by best learning rate is 0.001.

After fine tuning the best parameter is: Dropout: 0.2 Learning rate: 0.001

Extend to support deeper architectures


Extend the network to support back-propagation for more hidden layers. You need to modify the backward_pass function above to
compute gradients and update the weights between intermediate hidden layers. Finally, train and evaluate a network with a deeper
architecture. Do deeper architectures increase performance?

For extendin the network two hidden layers has been used which are 100 and 30. After using different combination of hidden layers this
two layes has been selected. Basically uniformaly coming from 300 to 100 then 100 to 30 and finally 30 to 3. SO the the network with look
like this 5000(vocab size) X 300 X 100 X 30 X 3.

In [67]: #Train with extend deeped architectures


W = network_weights(vocab_size=len(vocab),embedding_dim=300,hidden_dim=[100,30], num_classes=3)
W[0] = w_glove
W, loss_tr, dev_loss = SGD(X_tr, Y_tr,
W,
X_dev=X_dev,
Y_dev=Y_dev,
lr=0.001,
dropout=0.2,
freeze_emb=True,
tolerance=0.01,
epochs=50)

Epoch: 1| Training loss: 0.219675| Validation loss: 0.576614


Epoch: 2| Training loss: 0.166958| Validation loss: 0.611724
Epoch: 3| Training loss: 0.099358| Validation loss: 0.720356

In [68]: fig = plt.figure()


plt.plot(range(len(loss_tr)),loss_tr,label='training loss')
plt.plot(range(len(dev_loss)),dev_loss,label='validation loss')

plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training Monitoring')

plt.legend()
plt.show()

By seeing the above graph we can say that by exteding out network with 2 hidden layers it didnt help us that much. Our model overfitted.
In the above grpah it can be seen that the gap between the validation and training loss. Also when the training loss is decresing the
validation loss start to get worsen which is the charactistics of a overfit model. So this model is overfitted.

In [69]: preds_te = [np.argmax(forward_pass(x, W, dropout_rate=0.0)['y'])


for x,y in zip(X_te,Y_te)]

print('Accuracy:', accuracy_score(Y_te,preds_te))
print('Precision:', precision_score(Y_te,preds_te,average='macro'))
print('Recall:', recall_score(Y_te,preds_te,average='macro'))
print('F1-Score:', f1_score(Y_te,preds_te,average='macro'))

Accuracy: 0.8366666666666667
Precision: 0.8375991318052716
Recall: 0.8366666666666666
F1-Score: 0.8358287405211274
Adding two additation hidden layers didnt help us to improve the model any further. The accuracy is lower then before. The accuracy for
this model is almost simalr with the first model so adding this wo hidden layers didnt benifited the model.

Discuss how did you choose model hyperparameters ?


Like the average embedding without pre-trained before, I tune the learning rate first, then change the dropout to get better convergence.
The embedding size is a constant number and it cannot be changed. Because the embedding matrix is pre-trained rather than random
values.

In [66]: # Initial hyperparameters


fig = plt.figure()
lrs = [0.001,0.0001,0.00001]
dropouts = [0.1,0.2,0.3]
lr_num = len(lrs)
dropouts_num = len(dropouts)

# Run trainings on different sets of hyperparameters


for i in range(dropouts_num):
# Dropout Tune
dropout = dropouts[i]
print("For dropout: ",dropout)
tmp = list()
for j in range(lr_num):
# Learning rate Tune
lr = lrs[j]
print("For learning rate: ",lr)
W = network_weights(vocab_size=len(vocab),embedding_dim=300,hidden_dim=[100,30], num_classes=3)
W[0] = w_glove
W,_,_ = SGD(X_tr, Y_tr,
W,
X_dev=X_dev,
Y_dev=Y_dev,
lr=lr,
dropout=dropout,
freeze_emb=True,
tolerance=0.01,
epochs=50,
print_progress=False)
preds_dev = [np.argmax(forward_pass(x, W, dropout_rate=0.0)['y']) for x,y in zip(X_dev,Y_dev)]
score = f1_score(Y_dev,preds_dev,average='macro')
print(score)
tmp.append(score)
plt.plot(lrs,tmp,label='dropout: {}'.format(dropout))

plt.xlabel('Learning Rate')
plt.ylabel('F1 Score')
plt.title('Hyperparameters: learning rate & dropout')

plt.legend()
plt.show()

For dropout: 0.1


For learning rate: 0.001
0.9135301538224733
For learning rate: 0.0001
0.9072338404246733
For learning rate: 1e-05
0.9072338404246733
For dropout: 0.2
For learning rate: 0.001
0.9267399267399267
For learning rate: 0.0001
0.9137872970970567
For learning rate: 1e-05
0.907413338671471
For dropout: 0.3
For learning rate: 0.001
0.9199266593325999
For learning rate: 0.0001
0.9199266593325999
For learning rate: 1e-05
0.9072338404246733

As shown in the figure above, the X axis represents the learning, and the Y axis refers to the F1-Score. And different colored polylines
indicate different dropout.

So for all the drop out as the learning rate increseas the performance of the model increase also. The best dropout parameter for this
model is 0.2 followed by 0.001 learning rate.

After fine tuning the best parameter is: Dropout: 0.2 Learning rate: 0.001

Full Results
Model Precision Recall F1-Score Accuracy

Average Embedding 83.55% 83.63% 83.55% 83.50%

Average Embedding (Pre-trained) 86.33% 86.32% 86.33% 86.29%

Average Embedding (Pre-trained) + X hidden layers 83.66% 83.75% 83.66% 83.58%

Pre-trained Average Embedding > Pre-trained Average Embedding + hidden layers > Average Embedding

Pre-trained Average Embedding perfomed the best with 86.29% F1-score and Average Embedding perfomed the lowest with 83.50% F1-
Score. After extending the hidden layers our pretrained average embedding didnt perform well. It might be because there are numerous
parameters in the multilayer neural network, making convergence to the global lowest point more challenging. As a result, the overall F1
score is not exceptionally high. Also, when SGD is employed instead of GD, the gradient in SGD does not always point to the global lowest
point, but rather to the lowest point under the current data point. Thats why it didnt perform well and had lower F1-Score then previous
model. In resepect to the Question all the model performs well (above 80%) which was mentioned in the qestion also.

In [ ]: !ipython nbconvert assignment2.ipynb --to=pdfviahtml

In [ ]:

You might also like