Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
Feedforward Network
Instructor: Nikos Aletras
The goal of this assignment is to develop a Feedforward neural network for topic classification.
Text processing methods for transforming raw text data into input vectors for your network (1 mark)
The Stochastic Gradient Descent (SGD) algorithm with back-propagation to learn the weights of your Neural network. Your algorithm
should:
Use (and minimise) the Categorical Cross-entropy loss function (1 mark)
Perform a Forward pass to compute intermediate outputs (3 marks)
Perform a Backward pass to compute gradients and update all sets of weights (6 marks)
Implement and use Dropout after each hidden layer for regularisation (2 marks)
Discuss how did you choose hyperparameters? You can tune the learning rate (hint: choose small values), embedding size {e.g. 50, 300,
500}, the dropout rate {e.g. 0.2, 0.5} and the learning rate. Please use tables or graphs to show training and validation performance for
each hyperparameter combination (2 marks).
After training a model, plot the learning process (i.e. training and validation loss in each epoch) using a line plot and report accuracy.
Does your model overfit, underfit or is about right? (1 mark).
Re-train your network by using pre-trained embeddings (GloVe) trained on large corpora. Instead of randomly initialising the
embedding weights matrix, you should initialise it with the pre-trained weights. During training, you should not update them (i.e.
weight freezing) and backprop should stop before computing gradients for updating embedding weights. Report results by
performing hyperparameter tuning and plotting the learning process. Do you get better performance? (3 marks).
Extend you Feedforward network by adding more hidden layers (e.g. one more or two). How does it affect the performance? Note: You
need to repeat hyperparameter tuning, but the number of combinations grows exponentially. Therefore, you need to choose a subset
of all possible combinations (4 marks)
Provide well documented and commented code describing all of your choices. In general, you are free to make decisions about text
processing (e.g. punctuation, numbers, vocabulary size) and hyperparameter values. We expect to see justifications and discussion for
all of your choices (2 marks).
Provide efficient solutions by using Numpy arrays when possible. Executing the whole notebook with your code should not take more
than 10 minutes on any standard computer (e.g. Intel Core i5 CPU, 8 or 16GB RAM) excluding hyperparameter tuning runs and loading
the pretrained vectors. You can find tips in Lab 1 (2 marks).
Data
The data you will use for the task is a subset of the AG News Corpus and you can find it in the ./data_topic folder in CSV format:
data_topic/train.csv : contains 2,400 news articles, 800 for each class to be used for training.
data_topic/dev.csv : contains 150 news articles, 50 for each class to be used for hyperparameter selection and monitoring the
training process.
data_topic/test.csv : contains 900 news articles, 300 for each class to be used for testing.
Pre-trained Embeddings
You can download pre-trained GloVe embeddings trained on Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB
download) from here. No need to unzip, the file is large.
Save Memory
To save RAM, when you finish each experiment you can delete the weights of your network using del W followed by Python's garbage
collector gc.collect()
Submission Instructions
You should submit a Jupyter Notebook file (assignment2.ipynb) and an exported PDF version (you can do it from Jupyter: File-
>Download as->PDF via Latex ).
You are advised to follow the code structure given in this notebook by completing all given funtions. You can also write any
auxilliary/helper functions (and arguments for the functions) that you might need but note that you can provide a full solution without any
such functions. Similarly, you can just use only the packages imported below but you are free to use any functionality from the Python
Standard Library, NumPy, SciPy (excluding built-in softmax funtcions) and Pandas. You are not allowed to use any third-party library
such as Scikit-learn (apart from metric functions already provided), NLTK, Spacy, Keras, Pytorch etc.. You should mention if you've used
Windows to write and test your code because we mostly use Unix based machines for marking (e.g. Ubuntu, MacOS).
There is no single correct answer on what your accuracy should be, but correct implementations usually achieve F1-scores around 80\% or
higher. The quality of the analysis of the results is as important as the accuracy itself.
This assignment will be marked out of 30. It is worth 30\% of your final grade in the module.
The deadline for this assignment is 23:59 on Mon, 9 May 2022 and it needs to be submitted via Blackboard. Standard departmental
penalties for lateness will be applied. We use a range of strategies to detect unfair means, including Turnitin which helps detect
plagiarism. Use of unfair means would result in getting a failing grade.
First of all the notebook was tested in google colab as I am very micuh famlier with google colab then I have used Windows to test my
code. My windows configuration is Core i5 6th gen, 8 GB Ram.
tokenise all texts into a list of unigrams (tip: you can re-use the functions from Assignment 1)
remove stop words (using the one provided or one of your preference)
remove unigrams appearing in less than K documents
use the remaining to create a vocabulary of the top-N most frequent unigrams in the entire corpus.
and returns:
tokenRE = re.compile(token_pattern)
if ngram_range[0]==1:
x = x_uni
# ignore unigrams
if n==1: continue
for n in ngrams:
for t in n:
x.append(t)
if len(vocab)>0:
x = [w for w in x if w in vocab]
return x
and returns:
tokenRE = re.compile(token_pattern)
df = Counter()
ngram_counts = Counter()
vocab = set()
Now you should use get_vocab to create your vocabulary and get document and raw frequencies of unigrams:
In [12]: #using the vocabulary of training for training. Choosing the top 5000 vocabulary only
vocab, df, ngram_counts = get_vocab(X_tr_raw, ngram_range=(1,1), keep_topN=5000, stop_words=stop_words)
print(len(vocab))
print()
print(list(sorted(vocab))[:100])
print()
print(df.most_common()[:10])
5000
['aaron', 'abandon', 'abandoned', 'abby', 'abdullah', 'aber', 'able', 'aboard', 'about', 'above', 'abroad', 'ab
solute', 'abu', 'abuja', 'ac', 'accept', 'accepted', 'accepting', 'access', 'accessories', 'accident', 'accordi
ng', 'account', 'accounting', 'accusations', 'accused', 'accuser', 'accuses', 'accusing', 'ace', 'acknowledge
d', 'acquire', 'acquisition', 'acquisitions', 'across', 'act', 'action', 'actions', 'activated', 'activist', 'a
ctivists', 'activities', 'activity', 'actors', 'actress', 'adam', 'add', 'added', 'adding', 'additional', 'adju
sted', 'adjusters', 'administration', 'administrator', 'admission', 'adopted', 'adults', 'advance', 'advanced',
'advantage', 'advertisers', 'advertising', 'adviser', 'advising', 'aegis', 'affair', 'afford', 'afghan', 'afgha
nistan', 'afghans', 'afp', 'africa', 'african', 'africans', 'aftermath', 'afternoon', 'ag', 'against', 'agass
i', 'age', 'agencies', 'agency', 'agent', 'ago', 'agony', 'agree', 'agreed', 'agreement', 'agreements', 'agricu
ltural', 'ahead', 'ahmed', 'aid', 'aided', 'aides', 'ailing', 'aimed', 'aiming', 'air', 'aircraft']
[('reuters', 631), ('said', 432), ('tuesday', 413), ('wednesday', 344), ('new', 325), ('ap', 275), ('athens', 2
45), ('monday', 221), ('first', 210), ('two', 187)]
Then, you need to create vocabulary id -> word and word -> vocabulary id dictionaries for reference:
In [13]: # Calculate the vocab and df of different data sets. Only choosing the top 5000 vocab for every dataset. after
#To large Vocab size also runs slow in my cpu especially in the pre-trained embedding
vocab_tr, df_tr, ngram_counts_tr = get_vocab(X_tr_raw, ngram_range=(1,1), keep_topN=5000, stop_words=stop_words
vocab_te, df_te, ngram_counts_te = get_vocab(X_te_raw, ngram_range=(1,1), keep_topN=5000, stop_words=stop_words
vocab_dev, df_dev, ngram_counts_dev = get_vocab(X_dev_raw, ngram_range=(1,1), keep_topN=5000, stop_words=stop_w
# create reference with sorted vocab to avoid getting random accuracy every time we restart the kernal
id2vocab = dict(enumerate(sorted(vocab_tr)))
vocab2id = dict(zip(id2vocab.values(),id2vocab.keys()))
First, represent documents in train, dev and test sets as lists of words in the vocabulary:
In [15]: X_uni_tr[0]
['reuters',
Out[15]:
'venezuelans',
'turned',
'out',
'early',
'large',
'numbers',
'sunday',
'vote',
'historic',
'referendum',
'remove',
'left',
'wing',
'president',
'hugo',
'chavez',
'office',
'give',
'new',
'mandate',
'govern',
'next',
'two',
'years']
In [17]: X_tr[0]
[3734,
Out[17]:
4754,
4655,
3082,
1388,
2470,
2991,
4333,
4801,
2041,
3619,
3668,
2512,
4909,
3360,
2102,
761,
3013,
1856,
2947,
2657,
1892,
2953,
4665,
4985]
Put the labels Y for train, dev and test sets into arrays:
Network Architecture
Your network should pass each word index into its corresponding embedding by looking-up on the embedding matrix and then compute
the first hidden layer h : 1
1
e
h1 = ∑W ,i ∈ x
i
|x|
i
where |x| is the number of words in the document and W is an embedding matrix |V | × d, |V | is the size of the vocabulary and d the
e
embedding size.
a1 = relu(h1 )
y = softmax(a1 W )
During training, a should be multiplied with a dropout mask vector (elementwise) for regularisation before it is passed to the output layer.
1
You can extend to a deeper architecture by passing a hidden layer to another one:
hi = ai−1 Wi
ai = relu(hi )
Network Training
First we need to define the parameters of our network by initiliasing the weight matrices. For that purpose, you should implement the
network_weights function that takes as input:
and returns:
W : a dictionary mapping from layer index (e.g. 0 for the embedding matrix) to the corresponding weight matrix initialised with small
random numbers (hint: use numpy.random.uniform with from -0.1 to 0.1)
Make sure that the dimensionality of each weight matrix is compatible with the previous and next weight matrix, otherwise you won't be
able to perform forward and backward passes. Consider also using np.float32 precision to save memory.
W_h = list()
pt = embedding_dim
for layer in hidden_dim:
W_h.append(np.random.uniform(low = -1*init_val, high = init_val, size = (pt,layer)))
pt = layer
W_out = np.random.uniform(low = -1*init_val, high = init_val, size = (pt,num_classes))
W = [W_emb,*W_h,W_out]
return W
Then you need to develop a softmax function (same as in Assignment 1) to be used in the output layer.
It takes as input z (array of real numbers) and returns sig (the softmax of z )
#In order to avoid data overflow, the corresponding data boundary is added.
z = np.minimum(z,709.782)
sig = np.exp(z) / np.sum(np.exp(z),axis=z.ndim-1,keepdims=True)
return sig
Now you need to implement the categorical cross entropy loss by slightly modifying the function from Assignment 1 to depend only on
the true label y and the class probabilities vector y_preds :
Y = np.array(y)
assert type(y_preds) == np.ndarray
try:
n,d = y_preds.shape
except:
n=1
d = y_preds.shape[0]
Y = np.eye(d,d)[Y].reshape(n,d)
y_preds = np.maximum(y_preds,np.finfo(np.float64).eps)
assert np.all(y_preds>0)
l1 = -np.sum(Y*np.log(y_preds),1)
l=np.mean(l1)
assert l>=0
return l
Then, implement the relu function to introduce non-linearity after each hidden layer of your network (during the forward pass):
relu(zi ) = max(zi , 0)
and the relu_derivative function to compute its derivative (used in the backward pass):
a = np.fmax(z, 0)
return a
def relu_derivative(z):
During training you should also apply a dropout mask element-wise after the activation function (i.e. vector of ones with a random
percentage set to zero). The dropout_mask function takes as input:
and returns:
#Dropout matrix
dropout_vec = np.full(size,1.0)
index = np.arange(size)
np.random.shuffle(index)
dropout_vec[index[:int(size*dropout_rate)]]=0
return dropout_vec
Now you need to implement the forward_pass function that passes the input x through the network up to the output layer for
computing the probability for each class using the weight matrices in W . The ReLU activation function should be applied on each hidden
layer.
and returns:
out_vals : a dictionary of output values from each layer: h (the vector before the activation function), a (the resulting vector after
passing h from the activation function), its dropout mask vector; and the prediction vector (probability for each class) from the output
layer.
if type(x[0])!=list:
h = np.zeros(W.shape[1])
for index in range(len(x)):
np.add(W[x[index]]/len(x),h,out=h) # with mean value
else:
h = np.zeros((len(x),W.shape[1]))
for i in range(len(x)):
for index in range(len(x[i])):
h[i,:] += W[x[i][index]]/len(x[i])
return h
'''Linear forward:
1. h = a*W
2. a = relu(h)
'''
h = np.dot(A_prev, W)
a = relu(h)
dropout = dropout_mask(a.shape[a.ndim-1],dropout_rate)
return h,a,dropout
h_vecs = []
a_vecs = []
dropout_vecs = []
# Hidden layers
L = len(W)
for l in range(1,L-1):
A_prev = A
# Activation, dropout
h,A,dropout = forward_linear(A_prev,W[l],dropout_rate)
h_vecs.append(h);a_vecs.append(A);dropout_vecs.append(dropout)
A = np.multiply(A,dropout)
# Output layer
A_prev = A
h,A,dropout = forward_linear(A_prev,W[-1],dropout_rate)
np.multiply(A,dropout,out=A)
y = softmax(A) # softmax function mapping vector to probability
out_vals['h'] = h_vecs
out_vals['a'] = a_vecs
out_vals['dropout_vec'] = dropout_vecs
out_vals['y'] = y
return out_vals
The backward_pass function computes the gradients and updates the weights for each matrix in the network from the output to the
input. It takes as input
and returns:
Hint: the gradients on the output layer are similar to the multiclass logistic regression.
# hidden layers
L = len(W)
for l in range(1,L):
A_prev = out_vals['a'][-1*l]
# Gradient, Dropout
dW,dA_prev = backward_dropout(dh, A_prev,W[-1*l],out_vals['dropout_vec'][-1*l])
# Activation
dh = backward_activation(dA_prev,out_vals['h'][-1*l])
np.multiply(lr,dW,out = dW)
np.subtract(W[-1*l],dW,out = W[-1*l])
# input layers
if not freeze_emb:
X = np.array(x).reshape(-1,1)/len(x)
dW = np.dot(X,dh)
np.multiply(lr,dW,out=dW)
for index in range(len(x)):
W[0][x[index]] -= dW[index]
return W
Finally you need to modify SGD to support back-propagation by using the forward_pass and backward_pass functions.
and returns:
t_forward = 0
t_backward = 0
for epoch in range(epochs):
for i in idx_list:
# get single data
X_tr_i, Y_tr_i = X_tr[i],Y_tr_pre[i]
# Forward pass
out_vals = forward_pass(X_tr_i, W, dropout_rate=dropout)
# Backward pass
W = backward_pass(X_tr_i, Y_tr_i.reshape(1,-1), W, out_vals, lr=lr, freeze_emb=False)
# training loss
y_preds = forward_pass(X_tr, W, dropout_rate=0)['y']
loss_tr = categorical_loss(Y_tr, y_preds)
# evaluation loss
y_preds = forward_pass(X_dev, W, dropout_rate=0)['y']
loss_dev = categorical_loss(Y_dev, y_preds)
# Add history
training_loss_history.append(loss_tr)
validation_loss_history.append(loss_dev)
if print_progress == True:
print("Epoch: %d| Training loss: %f| Validation loss: %f"%(epoch+1,loss_tr,loss_dev))
Now you are ready to train and evaluate your neural net. First, you need to define your network using the network_weights function
followed by SGD with backprop:
for i in range(len(W)):
print('Shape W'+str(i), W[i].shape)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training Monitoring')
plt.legend()
plt.show()
By seeing the above graph, We can say that our model trained well beacuse as we can is both the training and validation loss continues to
fall.
print('Accuracy:', accuracy_score(Y_te,preds_te))
print('Precision:', precision_score(Y_te,preds_te,average='macro'))
print('Recall:', recall_score(Y_te,preds_te,average='macro'))
print('F1-Score:', f1_score(Y_te,preds_te,average='macro'))
Accuracy: 0.8355555555555556
Precision: 0.8363954119098828
Recall: 0.8355555555555556
F1-Score: 0.8350263621620005
Choose Model Hyperparameters: For this model, emberdding and dropout has been used for fine tuning and learning rate is constant 0.01.
For hyperparameter tuning gird search method has been used. This method is straight forword but time consuming. Becuase of the
restriction of using different librarios K-Fold cross validation is not being used which is more better. For embedding 4 values 50, 350, 650
and 950 and for dropout 3 values 0.2,0.4,0.6 has been used that means in total 12 model combination.
plt.xlabel('Embedding Size')
plt.ylabel('F1 Score')
plt.title('Hyperparameters: embedding & dropout')
plt.legend()
plt.show()
Chart description:
As shown in the figure above, the X axis represents the embedding size, and the Y axis refers to the F1-Score. And different colored
polylines indicate different dropout.
Embedding: For the embedding parameter in emebding 50 it gaves around 90% F1-score after that it decresed on embedding size 350 and
again start increasing after this. Highest peek is on embedding size 650.
Dropout: From the graph we can say that as the dropout increases, the performance also gradually declines. Only dropout 02. performs
well.
After fine tuning the best parameter is: Dropout: 0.2 Embedding Size: 650
Use the function below to obtain the embedding martix for your vocabulary. Generally, that should work without any problem. If you get
errors, you can modify it.
Sorted Voabulary is very important if we want to get the same result even after restarting the kernal every time.
with zipfile.ZipFile(f_zip) as z:
with z.open(f_txt) as f:
for line in f:
line = line.decode('utf-8')
word = line.split()[0]
if word in sorted(vocab):
emb = np.array(line.strip('\n').split()[1:]).astype(np.float32)
w_emb[word2id[word]] +=emb
return w_emb
First, initialise the weights of your network using the network_weights function. Second, replace the weigths of the embedding matrix
with w_glove . Finally, train the network by freezing the embedding weights:
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training Monitoring')
plt.legend()
plt.show()
By seeing the above graph, We can say that our pretrained average embedding model trained well beacuse as we can is both the training
and validation loss continues to fall together. The gap between the validation and training loss is also not that much.
print('Accuracy:', accuracy_score(Y_te,preds_te))
print('Precision:', precision_score(Y_te,preds_te,average='macro'))
print('Recall:', recall_score(Y_te,preds_te,average='macro'))
print('F1-Score:', f1_score(Y_te,preds_te,average='macro'))
Accuracy: 0.8633333333333333
Precision: 0.8632796695811602
Recall: 0.8633333333333333
F1-Score: 0.8629290151920235
We got better result then before. Using pretrained embeddings helps us to improve the model.
plt.xlabel('Learning Rate')
plt.ylabel('F1 Score')
plt.title('Hyperparameters: learning rate & dropout')
plt.legend()
plt.show()
Chart description:
As shown in the figure above, the X axis represents the learning, and the Y axis refers to the F1-Score. And different colored polylines
indicate different dropout.
Only for drop out 0.2 the learning rate didnt decresead but for rest of the dropout 0.4 and 0.6 as the leraning rate increased the F1-Score
decreasd. The best dropout is 0.2 followed by best learning rate is 0.001.
After fine tuning the best parameter is: Dropout: 0.2 Learning rate: 0.001
For extendin the network two hidden layers has been used which are 100 and 30. After using different combination of hidden layers this
two layes has been selected. Basically uniformaly coming from 300 to 100 then 100 to 30 and finally 30 to 3. SO the the network with look
like this 5000(vocab size) X 300 X 100 X 30 X 3.
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training Monitoring')
plt.legend()
plt.show()
By seeing the above graph we can say that by exteding out network with 2 hidden layers it didnt help us that much. Our model overfitted.
In the above grpah it can be seen that the gap between the validation and training loss. Also when the training loss is decresing the
validation loss start to get worsen which is the charactistics of a overfit model. So this model is overfitted.
print('Accuracy:', accuracy_score(Y_te,preds_te))
print('Precision:', precision_score(Y_te,preds_te,average='macro'))
print('Recall:', recall_score(Y_te,preds_te,average='macro'))
print('F1-Score:', f1_score(Y_te,preds_te,average='macro'))
Accuracy: 0.8366666666666667
Precision: 0.8375991318052716
Recall: 0.8366666666666666
F1-Score: 0.8358287405211274
Adding two additation hidden layers didnt help us to improve the model any further. The accuracy is lower then before. The accuracy for
this model is almost simalr with the first model so adding this wo hidden layers didnt benifited the model.
plt.xlabel('Learning Rate')
plt.ylabel('F1 Score')
plt.title('Hyperparameters: learning rate & dropout')
plt.legend()
plt.show()
As shown in the figure above, the X axis represents the learning, and the Y axis refers to the F1-Score. And different colored polylines
indicate different dropout.
So for all the drop out as the learning rate increseas the performance of the model increase also. The best dropout parameter for this
model is 0.2 followed by 0.001 learning rate.
After fine tuning the best parameter is: Dropout: 0.2 Learning rate: 0.001
Full Results
Model Precision Recall F1-Score Accuracy
Pre-trained Average Embedding > Pre-trained Average Embedding + hidden layers > Average Embedding
Pre-trained Average Embedding perfomed the best with 86.29% F1-score and Average Embedding perfomed the lowest with 83.50% F1-
Score. After extending the hidden layers our pretrained average embedding didnt perform well. It might be because there are numerous
parameters in the multilayer neural network, making convergence to the global lowest point more challenging. As a result, the overall F1
score is not exceptionally high. Also, when SGD is employed instead of GD, the gradient in SGD does not always point to the global lowest
point, but rather to the lowest point under the current data point. Thats why it didnt perform well and had lower F1-Score then previous
model. In resepect to the Question all the model performs well (above 80%) which was mentioned in the qestion also.
In [ ]: