[go: up one dir, main page]

0% found this document useful (0 votes)
128 views165 pages

Deep Learning Cours

The document outlines a comprehensive curriculum for a Master program in Deep Learning and its applications, covering topics such as machine learning, deep neural networks, optimization, and regularization. It includes detailed sections on various models, algorithms, and practical implementations using Pytorch. The content is structured to provide both theoretical foundations and practical skills necessary for expertise in the field.

Uploaded by

oumaimakadim45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views165 pages

Deep Learning Cours

The document outlines a comprehensive curriculum for a Master program in Deep Learning and its applications, covering topics such as machine learning, deep neural networks, optimization, and regularization. It includes detailed sections on various models, algorithms, and practical implementations using Pytorch. The content is structured to provide both theoretical foundations and practical skills necessary for expertise in the field.

Uploaded by

oumaimakadim45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 165

DEEP LEARNING & APPLICATIONS

Master MBD / CI LSI

Pr. Lotfi ELAACHAK


Département Génie Informatique

2023 — 2024
Table des matières
0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.1.1 Machine learning vs Deep learning . . . . . . . . . . . . . . . . . . . . . . 5
0.1.2 History of Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.1.3 Application Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.1.4 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1 Linear Models 12
1.1 Multi Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.2 Cost/Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.4 Probabilistic interpretation (Cost Function) . . . . . . . . . . . . . . . . . 14
1.1.5 Multi Linear Regression with Pytorch . . . . . . . . . . . . . . . . . . . . . 16
1.2 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2.1 Perceptron Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2.2 Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.2.3 Perceptron with Pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.3.1 Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.3.2 Probability Interpretation (Cost Function) . . . . . . . . . . . . . . . . . . 26
1.3.3 Logistic Regression Pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.4 SoftMax Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.4.1 MultiClass Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.4.2 SoftMax Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.4.3 Probability Interpretation (Cost Function) . . . . . . . . . . . . . . . . . . 31
1.4.4 SoftMax Pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2 Deep Neural networks 36


2.1 Computational Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.1.1 Types of computational graphs . . . . . . . . . . . . . . . . . . . . . . . . 36
2.1.2 Forward computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.1.3 Backward computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.2 Multi-Layer Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.3 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.3.1 Supervised Learning Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3.2 Loss Function for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.3 Loss Function for Classification . . . . . . . . . . . . . . . . . . . . . . . . 53
2.3.4 Binary cross-entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.3.5 Cross-entropy / Multi-Class . . . . . . . . . . . . . . . . . . . . . . . . . . 54

1
2.3.6 Kullback Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.4 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4.1 Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4.2 Tanh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.4.3 ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.4.4 Leaky ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.4.5 Exponential Linear Units (ELU) . . . . . . . . . . . . . . . . . . . . . . . . 60
2.4.6 How to choose an activation function . . . . . . . . . . . . . . . . . . . . . 60
2.5 ANN Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3 Optimization and Regularization 61


3.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.1.1 Optimization Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.1.2 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1.3 Optimization Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.2.1 Capacity of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.2.2 L1 and L2 Regularzation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.2.3 Early Stop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.2.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.3 DNN Pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.3.1 Rgression MLP Pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.3.2 Binary Classification MLP Pytorch . . . . . . . . . . . . . . . . . . . . . . 81
3.3.3 Multiclass Classification MLP Pytorch . . . . . . . . . . . . . . . . . . . . 84
3.4 DNN Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.4.1 Rgression MLP Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.4.2 Binary Classification MLP Keras . . . . . . . . . . . . . . . . . . . . . . . 89
3.4.3 Multiclass Classification MLP Keras . . . . . . . . . . . . . . . . . . . . . 90

4 Convolution neural networks 92


4.1 The first approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2 Convolutional neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3 Architecture of a traditional CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.3.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3.2 Kernel hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3.3 Convolutions Over Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.3.4 The Conv Layer process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3.5 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3.6 Fully-Connected Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.4 CNN Explainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.5 Classic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5.1 LeNet-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5.2 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5.3 VGG-16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Lotfi ELAACHAK Page 2


4.6 CNN Pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.7 CNN Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5 Sequence models (Recurrent neural network and LSTM) 105


5.1 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.1.1 Basic Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 106
5.1.2 Architecture of a traditional RNN . . . . . . . . . . . . . . . . . . . . . . . 106
5.1.3 Different types of RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.1.4 Backpropagation through time . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.1.5 RNN Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.1.6 Multi-Layer RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.1.7 The Problem of Long-Term Dependencies . . . . . . . . . . . . . . . . . . 109
5.2 Gated Recurrent Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.1 Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.2 GRU Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3 Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.1 The Core Idea Behind LSTMs . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3.2 Step-by-Step LSTM Walk Through . . . . . . . . . . . . . . . . . . . . . . 113
5.3.3 Example of LSTM Forecasting . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4 RNN/LSTM/GRU Pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4.1 Simple RNN/LSTM/GRU Pytorch One To One . . . . . . . . . . . . . . . 115
5.4.2 LSTM One to Many : Image Captioning . . . . . . . . . . . . . . . . . . . 118
5.4.3 LSTM Many to One : Sentiment Analysis . . . . . . . . . . . . . . . . . . 118
5.4.4 LSTM Many to Many : Text Generation . . . . . . . . . . . . . . . . . . . 118
5.4.5 Time series Pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.5 RNN Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.5.1 Basic RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.5.2 RNN GRU LSTM Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.5.3 RNN Many To One Keras . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.5.4 RNN Many To Many Keras . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.5.5 RNN Time Serie Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6 Transformers 131
6.1 Attention ALL you need . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.1.1 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2 Transformers Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2.1 Embedding & Positional Encoding . . . . . . . . . . . . . . . . . . . . . . 133
6.2.2 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2.3 Add & Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.2.4 Global Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.3 Transformers Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7 Auto-encoders 143
7.1 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

Lotfi ELAACHAK Page 3


7.1.1 Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1.2 Generative Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . 144
7.1.3 Autoencoders Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.1.4 Regularization in autoencoders . . . . . . . . . . . . . . . . . . . . . . . . 146
7.1.5 Feed Forward Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.1.6 AE Pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.1.7 AE Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.2 Variational Auto encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.2.1 Variational Auto encoders Architecture . . . . . . . . . . . . . . . . . . . . 153
7.2.2 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.2.3 VAE Pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.2.4 VAE Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

8 Generative Adversarial Networks 155


8.0.1 Why GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.1 Generative Models : Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.2 Application Of Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . 156
8.3 GANs Model & Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.3.1 Generator Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.3.2 Discriminator Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.3.3 Advantages of Generative Adversarial Networks (GANs) . . . . . . . . . . 159
8.3.4 Disadvantages of Generative Adversarial Networks (GANs) . . . . . . . . . 159
8.4 GANs Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.4.1 GANs Pytroch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

Lotfi ELAACHAK Page 4


0.1 Introduction
Deep Learning is a subfield of machine learning concerned with algorithms inspired by the struc-
ture and function of the brain called artificial neural networks. Deep learning is a procedure
of machine learning that teaches computers to do what comes of course to humans : learn by
example.

0.1.1 Machine learning vs Deep learning


Deep Learning can essentially do everything that machine learning does, but not the other way
around. For instance, machine learning is useful when the dataset is small and well-curated,
which means that the data is carefully preprocessed.
Generally, machine learning is alternatively termed shallow learning because it is very effective
for smaller datasets. Deep learning, on the other hand, is extremely powerful when the dataset
is large.
It can learn any complex patterns from the data and can draw accurate conclusions on its own.
In fact, deep learning is so powerful that it can even process unstructured data—data that is not
adequately arranged like text corpus, social media activity, etc.

Lotfi ELAACHAK Page 5


Furthermore, it can also generate new data samples and find anomalies that machine learning
algorithms and human eyes can miss.

Lotfi ELAACHAK Page 6


0.1.2 History of Deep Learning

Lotfi ELAACHAK Page 7


0.1.3 Application Fields
— Automated Driving : Automotive researchers are using deep learning to automatically de-
tect objects such as stop signs and traffic lights. Besides, deep learning is used to detect
pedestrians, which helps decrease accidents.

— Aerospace and Defense : Deep learning is used to identify objects from satellites that lo-
cate areas of interest, and identify safe or unsafe zones for troops.

— Medical Research : Cancer researchers are using deep learning to automatically detect can-
cer cells. Teams at UCLA built an advanced microscope that yields a high-dimensional
data set used to train a deep learning application to accurately identify cancer cells.

— Industrial Automation : Deep learning is helping to improve worker safety around heavy
machinery by automatically detecting when people or objects are within an unsafe dis-
tance of machines.

— Electronics : Deep learning is being used in automated hearing and speech translation. For
example, home assistance devices that respond to your voice and know your preferences
are powered by deep learning applications.

Lotfi ELAACHAK Page 8


0.1.4 Tensors
A tensor is a generalization of vectors and matrices and is easily understood as a multidimensio-
nal array. A vector is a one-dimensional or first order tensor and a matrix is a two-dimensional
or second order tensor.

Tensor notation is much like matrix notation with a capital letter representing a tensor and
lowercase letters with subscript integers representing scalar values within the tensor.

Tensor Python
# create tensor
from numpy import array
T = array([
[[1,2,3], [4,5,6], [7,8,9]],
[[11,12,13], [14,15,16], [17,18,19]],
[[21,22,23], [24,25,26], [27,28,29]],
])
print(T.shape)
print(T)

Lotfi ELAACHAK Page 9


Pytorch Tensors

Tensors Pytorch
import torch
import math

x = torch.empty(3, 4)
print(type(x))
print(x)

zeros = torch.zeros(2, 3)
print(zeros)

ones = torch.ones(2, 3)
print(ones)

torch.manual_seed(1729)
random = torch.rand(2, 3)
print(random)

ones = torch.zeros(2, 2) + 1
twos = torch.ones(2, 2) * 2
threes = (torch.ones(2, 2) * 7 - 1) / 2
fours = twos ** 2
sqrt2s = twos ** 0.5

print(ones)
print(twos)
print(threes)
print(fours)
print(sqrt2s)

powers2 = twos ** torch.tensor([[1, 2], [3, 4]])


print(powers2)

fives = ones + fours


print(fives)

dozens = threes * fours


print(dozens)

t1 = torch.tensor([1, 2, 3, 4])
t2 = torch.tensor([5, 6, 7, 8])

Lotfi ELAACHAK Page 10


# adding two tensors
print("tensor2 + tensor1")
print(torch.add(t2, t1))

# subtracting two tensor


print("\ntensor2 - tensor1")
print(torch.sub(t2, t1))

# multiplying two tensors


print("\ntensor2 * tensor1")
print(torch.mul(t2, t1))

# diving two tensors


print("\ntensor2 / tensor1")
print(torch.div(t2, t1))

TensorFlow Tensors

Tensors TesorFlow
import tensorflow as tf
import numpy as np

a = tf.constant([[1, 2],[3, 4]])


b = tf.constant([[1, 1],[1, 1]]) # Could have also said `tf.ones([2,2])`

print(tf.add(a, b), "\n")


print(tf.multiply(a, b), "\n")
print(tf.matmul(a, b), "\n")

Lotfi ELAACHAK Page 11


Chapitre 1
Linear Models
1.1 Multi Linear Regression

1.1.1 The Model

x ∈ Rd , y ∈ R
for n such examples

h(θ) = θ0 + θ1 x1 + θ2 x2 + ...... + θn xn

h(θ) = Σni=1 θi xi + θ0

x0 = 1 → Intercept

h(θ) = θ0 x0 + θ1 x1 + θ2 x2 + ...... + θn xn

h(θ) = Σni=0 θi xi = θT X

1.1.2 Cost/Loss Function

1
J(θ) = Σni=1 (h(θ(i) ) − y (i) )2
2
θ̂ = argminθ J(θ) = argminθ (h(θ(i) ) − y (i) )2
Note : You can add 1/n for standarization.

12
1.1.3 Gradient Descent
Least Mean Squares Algorithm
θ0 ← initialization
θj = θj − α δJ(θ)
δθj

(This update is simultaneously performed for all values of j = 0, . . . , d.) Here, α is called the
learning rate. This is a very natural algorithm that repeatedly takes a step in the direction of
steepest decrease of J.

Let’s first work it out for the case of if we have only one training example (x, y), so that we can
neglect the sum in the definition of J. We have :
δJ(θ) δ 1
δθj
= δθj 2
(hθ (x) − y)2

δJ(θ)
δθj
= 2. 12 (hθ (x) − y) δθδj (hθ (x) − y)

δJ(θ)
δθj
= (hθ (x) − y) δθδj (Σdi=0 θi xi − y)

δJ(θ)
δθj
= (hθ (x) − y)xi

For a single training example, this gives the update rule :


(i))
θj := θj + α(y (i) − hθ (x(i)) ))xj

We’d derived the LMS rule for when there was only a single training example. There are several
ways to modify this method for a training set of more than one example. The first is replace it
with the following algorithm :
Repeat until convergence {

(i))
θj := θj + αΣni=1 (y (i) − hθ (x(i)) ))xj , (for every j) (1)

By grouping the updates of the coordinates into an update of the vectorθ, we can rewrite update
(1) in a slightly more succinct way :

θ := θ + αΣni=1 (y (i) − hθ (x(i)) ))x(i))

Lotfi ELAACHAK Page 13


see : https ://www.mladdict.com/linear-regression-simulator

1.1.4 Probabilistic interpretation (Cost Function)


When faced with a regression problem, why might linear regression, and specifically why might
the least-squares cost function J, be a reasonable choice ?
Let us assume that the target variables and the inputs are related via the equation :

y (i) := θT X (i) + (i)

where (i) Gaussian Noise, Let us further assume that the (i) are distributed IID (independently
and identically distributed) according to a Gaussian distribution (also called a Normal distribu-
tion) with mean zero and some variance σ 2 , We can write this assumption as (i) ∼ N(0, σ 2 ), the
density of (i) is given by
(i) −θ T x(i) )2
p(y (i) | x(i) ; θ) = √ 1 exp(− (y )
2πσ 2σ 2

Lotfi ELAACHAK Page 14


The notation " p(y (i) | x(i) ; θ)" indicates that this is the distribution of y (i) given x(i) and para-
meterized by θ.
One of the most commonly encountered way of thinking in machine learning is the maximum
likelihood point of view. This is the concept that when working with a probabilistic model with
unknown parameters in our case θ "vector", the parameters which make the data have the highest
probability are the most likely ones.
This resulting conditional probability is referred to as the likelihood of observing the data given
the model parameters and written using the notation L() to denote the likelihood function.

L(θ) = L(θ; X, ∼ ~y ) = p(∼ ~y | X; θ).

As such, the likelihood factorizes. According to the principle of maximum likelihood, the best
values of parameters and are those that maximize the likelihood of the entire dataset :
Qn
L(θ) = i=1 p(y i | xi ; θ)

(i) −θ T x(i) )2
√ 1 exp(− (y
Qn
L(θ) = i=1 2πσ 2σ 2
)

Now, given this probabilistic model relating the y (i) ’s and the x(i) ’s, what is a reasonable way of
choosing our best guess of the parameters θ ?

The principal of maximum likelihood says that we should choose θ so as to make the data as
high probability as possible.

Given the common use of log in the likelihood function, it is referred to as a log-likelihood
function. It is also common in optimization problems to prefer to minimize the cost function
rather than to maximize.

l(θ) = logL(θ)

(i) −θ T x(i) )2
√ 1 exp(− (y
Qn
l(θ) = log i=1 2πσ 2σ 2
)

(i) −θ T x(i) )2
1
l(θ) = Σni=1 log √2πσ exp(− (y 2σ 2
)

For reasons of increased computational ease, it is often easier to minimise the negative of the
log-likelihood rather than maximise the log-likelihood itself. Hence, we can "stick a minus sign
in front of the log-likelihood" to give us the negative log-likelihood (NLL) :

(i) −θ T x(i) )2
1
N LL(θ) = −Σni=1 log √2πσ exp(− (y 2σ 2
)

N LL(θ) = −Σni=1 21 log( 2πσ


1
2) −
1
2σ 2
(y (i) − θT x(i) )2

1 1
N LL(θ) = − n2 log( 2πσ 2) − Σn (y (i)
2σ 2 i=1
− θT x(i) )2

1 n
Σ (y (i)
2 i=1
− θT x(i) )2 is the Residual Sum of Squares, also known as the Sum of Squared Errors
(SSE).

Lotfi ELAACHAK Page 15


1.1.5 Multi Linear Regression with Pytorch

Pytorch Linear Regression


#https://www.kaggle.com/code/joseguzman/multiple-regression-explained-with-pytorch/
notebook
import pandas as pd

data = pd.read_csv("datasets/Advertising.csv")
data.head()
data.shape[0]

import seaborn as sns


data_sales = pd.DataFrame(data['Sales'])

sns.displot(data_sales, x="Sales",kde=True)

plt.show()

import torch
import torch.nn as nn
from tqdm import tqdm #progress Bar

tv = torch.tensor( data = data.TV.values, dtype = torch.float ) # x_1


radio = torch.tensor( data = data.Radio.values, dtype = torch.float) # x_2
news = torch.tensor( data = data.Newspaper.values, dtype = torch.float) # x_3

sales = torch.tensor( data = data.Sales.values, dtype = torch.float ) # targets

theta0 = torch.randn(1, requires_grad = True) # start with a random number from a


normal distribution
theta1 = torch.randn(1, requires_grad = True)
theta2 = torch.randn(1, requires_grad = True)
theta3 = torch.randn(1, requires_grad = True)

def mylnmodel( tv:torch.Tensor, radio:torch.Tensor, news:torch.Tensor):


"""
computes f(x; theta0,theta1,theta2,theta3) = theta0 + theta1 x_1 + theta2 x_2 +
theta3 x_3,
for independent variables x_1, x_2 and x_3.

Arguments:
tv (tensor) with the values of tv investment (x_1)
radio (tensor) with the values of radio investment (x_2)
news (tensor) with the newspaper investment (x_3).

Note: coefficients theta0, theta1, theta2 and theta3 must be previoulsy

Lotfi ELAACHAK Page 16


defined as tensors with requires_grad = True

Returns a tensor with the backward() method


"""
return theta0 + theta1*tv + theta2*radio + theta3*news

# generate the first prediction


predicted = mylnmodel(tv, radio, news)
predicted.shape

# compare it with targets


sales.shape

import matplotlib.pyplot as plt


plt.figure(figsize=(3,3))
plt.scatter(sales, predicted.detach(), c='k', s=4)
plt.xlabel('sales'), plt.ylabel('predicted');
x = y = range(100)
plt.plot(x,y, c='brown')
plt.xlim(0,100), plt.ylim(0,120);
plt.text(60,50, f'theta0 = {theta0.item():2.4f}', fontsize=10);
plt.text(60,40, f'tv = {theta1.item():2.4f}', fontsize=10);
plt.text(60,30, f'radio = {theta2.item():2.4f}', fontsize=10);
plt.text(60,20, f'news = {theta3.item():2.4f}', fontsize=10);

def MSE(y_predicted:torch.Tensor, y_target:torch.Tensor):


"""
Returns a single value tensor with
the mean of squared errors (SSE) between the predicted and target
values:

"""
error = y_predicted - y_target # element-wise substraction
return torch.sum(error**2 ) / error.numel() # mean (sum/n)

predicted = mylnmodel(tv,radio,news)
loss = MSE(y_predicted = predicted, y_target=sales)
print(loss)

# initial values for the coefficients is random, gradients are not calculated
print(f'theta0 = {float(theta0.item()):+2.4f}, df(a)/da = {theta0.grad}')
print(f'theta1 = {float(theta1.item()):+2.4f}, df(b)/da = {theta1.grad}')
print(f'theta2 = {float(theta2.item()):+2.4f}, df(c)/dc = {theta2.grad}')
print(f'theta3 = {float(theta3.item()):+2.4f}, df(d)/dd = {theta3.grad}')

loss.backward()

Lotfi ELAACHAK Page 17


print(f'theta0 = {float(theta0.item()):+2.4f}, df(a)/da = {theta0.grad}')
print(f'theta1 = {float(theta1.item()):+2.4f}, df(b)/da = {theta1.grad}')
print(f'theta2 = {float(theta2.item()):+2.4f}, df(c)/dc = {theta2.grad}')
print(f'theta3 = {float(theta3.item()):+2.4f}, df(d)/dd = {theta3.grad}')

## Use gradiendt descent


myMSE = list()
for i in tqdm(range(5_000)):
theta0.grad.zero_()
theta1.grad.zero_()
theta2.grad.zero_()
theta3.grad.zero_()

predicted = mylnmodel(tv,radio,news) # forward pass (compute results)


loss = MSE(y_predicted = predicted, y_target = sales) # calculate MSE

loss.backward() # compute gradients


myMSE.append(loss.item()) # append loss
with torch.no_grad():
theta0 -= theta0.grad * 1e-6
theta1 -= theta0.grad * 1e-6
theta2 -= theta2.grad * 1e-6
theta3 -= theta3.grad * 1e-6

plt.plot(myMSE);
plt.xlabel('Epoch (#)'), plt.ylabel('Mean squared Errors')

plt.figure(figsize=(3,3))
plt.scatter(sales, predicted.detach(), c='k', s=4)
plt.xlabel('sales'), plt.ylabel('predicted');
x = y = range(30)
plt.plot(x,y, c='brown')
plt.xlim(0,35), plt.ylim(0,35);
plt.text(25, 15, f'theta0 = {theta0.item():2.4f}', fontsize=8)
plt.text(25, 12, f'tv = {theta1.item():2.4f}', fontsize=8)
plt.text(25, 9, f'radio = {theta2.item():2.4f}', fontsize=8)
plt.text(25, 6, f'news = {theta3.item():2.4f}', fontsize=8)

Pytorch Linear Regression Matrix version


#https://www.kaggle.com/code/joseguzman/multiple-regression-explained-with-pytorch/
notebook
import pandas as pd

data = pd.read_csv("datasets/Advertising.csv")

Lotfi ELAACHAK Page 18


data.head()

# matrix form
# costum_data a 4 * 200 matrix
costum_data = data.loc[:,['TV','Radio','Newspaper']]

costum_data.insert(loc=0,
column='theta0',
value=1)

X = torch.tensor(costum_data.values)
y = torch.tensor(data.Sales.values)
print(X.shape)

theta = torch.rand(4, dtype=torch.double, requires_grad = True)


theta

def model(X:torch.Tensor):
"""
Performs the matrix vector multiplication
"""
assert len(X.shape) == 2

return X @ theta.T

predicted = model(X)
loss = MSE(y_predicted = predicted, y_target=y)
print(loss)

## Use gradiendt descent


myMSE = list()
for i in tqdm(range(10_000)):

predicted = model(X) # forward pass (compute results)


loss = MSE(y_predicted = predicted, y_target = y) # calculate MSE

loss.backward() # compute gradients


myMSE.append(loss) # append loss
with torch.no_grad():
theta -= theta.grad * 1e-6

theta.grad.zero_()

Lotfi ELAACHAK Page 19


plt.plot(myMSE);
plt.xlabel('Epoch (#)'), plt.ylabel('Mean squared Errors')

plt.figure(figsize=(3,3))
plt.scatter(sales, predicted.detach(), c='gray', s=4)
plt.xlabel('sales'), plt.ylabel('predicted');
x = y = range(30)
plt.plot(x,y, c='brown')
plt.xlim(0,35), plt.ylim(0,35);

theta0 , tv, radio, news = theta.T

plt.text(25, 15, f'theta0 = {theta0:2.4f}', fontsize=8)


plt.text(25, 12, f'tv = {tv:2.4f}', fontsize=8)
plt.text(25, 9, f'radio = {radio:2.4f}', fontsize=8)
plt.text(25, 6, f'newspaper = {news:2.4f}', fontsize=8)

1.2 Perceptron

1.2.1 Perceptron Model

x ∈ Rd , y ∈ {0, 1}

hθ = g(θT X)

1 if z ≥ 0
g(z) =  (1.1)
0 if z < 0
Binary step function :

Lotfi ELAACHAK Page 20


Perceptron schema

If we then let hθ (x) = g(θT X) as before but using this modified definition of g, and if we use the
update rule

θ(j) = θ(j) − α(y (i) − hθ (x(i) ))x(i) )(j)

Then we have the perceptron learning algorithm.

1.2.2 Learning Algorithm

θ~ ← init(~0)
For in in 1,2........,n :
θ(j) = θ(j) − α(y (i) − hθ (x(i) ))x(i) )(j)

Lotfi ELAACHAK Page 21


1.2.3 Perceptron with Pytorch

Pytorch Perceptron
import numpy as np
import matplotlib.pyplot as plt
import torch
import pandas as pd
%matplotlib inline

def custom_where(cond, x_1, x_2):


return (cond * x_1) + ((~(cond)) * x_2)

class Perceptron():
def __init__(self, num_features):
self.num_features = num_features
self.weights = torch.zeros(num_features, 1,dtype=torch.float32, device=
device)
self.bias = torch.zeros(1, dtype=torch.float32, device=device)

def forward(self, x):


linear = torch.add(torch.mm(x, self.weights), self.bias)
predictions = custom_where(linear > 0., 1, 0).float()
return predictions

def backward(self, x, y):


predictions = self.forward(x)
errors = y - predictions
return errors

def train(self, x, y, epochs):

Lotfi ELAACHAK Page 22


for e in range(epochs):

for i in range(y.size()[0]):
# use view because backward expects a matrix (i.e., 2D tensor)
errors = self.backward(x[i].view(1, self.num_features), y[i]).view
(-1)
self.weights += (errors * x[i]).view(self.num_features, 1)
self.bias += errors

def evaluate(self, x, y):


predictions = self.forward(x).view(-1)
accuracy = torch.sum(predictions == y).float() / y.size()[0]
return accuracy

data = pd.read_csv("datasets/diabetes.csv")
data

array = data.values
X, y = array[:,0:2] , array[:,8]

from sklearn.model_selection import train_test_split


X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

from sklearn.preprocessing import StandardScaler


sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)

plt.scatter(X_train[y_train==0, 0], X_train[y_train==0, 1], label='class 0', marker


='o')
plt.scatter(X_train[y_train==1, 0], X_train[y_train==1, 1], label='class 1', marker
='s')
plt.xlabel('feature 1')
plt.ylabel('feature 2')
plt.legend()
plt.show()

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

ppn = Perceptron(num_features=2)

X_train_tensor = torch.tensor(X_train, dtype=torch.float32, device=device)


y_train_tensor = torch.tensor(y_train, dtype=torch.float32, device=device)

ppn.train(X_train_tensor, y_train_tensor, epochs=5)

Lotfi ELAACHAK Page 23


print('Model parameters:')
print(' Weights: %s' % ppn.weights)
print(' Bias: %s' % ppn.bias)

X_test_tensor = torch.tensor(X_test, dtype=torch.float32, device=device)


y_test_tensor = torch.tensor(y_test, dtype=torch.float32, device=device)

test_acc = ppn.evaluate(X_test_tensor, y_test_tensor)


print('Test set accuracy: %.2f%%' % (test_acc*100))

w, b = ppn.weights, ppn.bias

x_min = -2
y_min = ( (-(w[0] * x_min) - b[0])
/ w[1] )

x_max = 2
y_max = ( (-(w[0] * x_max) - b[0])
/ w[1] )

fig, ax = plt.subplots(1, 2, sharex=True, figsize=(7, 3))

ax[0].plot([x_min, x_max], [y_min, y_max])


ax[1].plot([x_min, x_max], [y_min, y_max])

ax[0].scatter(X_train[y_train==0, 0], X_train[y_train==0, 1], label='class 0',


marker='o')
ax[0].scatter(X_train[y_train==1, 0], X_train[y_train==1, 1], label='class 1',
marker='s')

ax[1].scatter(X_test[y_test==0, 0], X_test[y_test==0, 1], label='class 0', marker='


o')
ax[1].scatter(X_test[y_test==1, 0], X_test[y_test==1, 1], label='class 1', marker='
s')

ax[1].legend(loc='upper left')
plt.show()

Lotfi ELAACHAK Page 24


1.3 Logistic Regression

1.3.1 Logistic Regression Model

x ∈ Rd , y ∈ {0, 1}

y (i) = 1 Positive example. y (i) = 0 Negative example.

hθ = g(θT X)

we will find the g function.


Log odds play an important role in logistic regression as it converts the LR model from proba-
bility based to a likelihood based model. Both probability and log odds have their own set of
properties, however log odds makes interpreting the output easier. Thus, using log odds is slightly
more advantageous over probability.

Odds : Simply put, odds are the chances of success divided by the chances of failure. It is repre-
sented in the form of a ratio. (As shown in equation given below)

p
odds =
1−p

where p = odds of success, 1 – p = odds of failure.

log of odds will be given by :


p
log(odds) = log( )
1−p
we assume :

p
log( ) = θT X
1−p

p
= exp(θT X)
1−p

exp(θT X)
p=
1 + exp(θT X)

1
p=
1 + exp(−θT X)

1
g(z) =
1 + e−z

Lotfi ELAACHAK Page 25


Logistic Regression Architecture

1.3.2 Probability Interpretation (Cost Function)


The logistic model, the output variable yi is a Bernoulli random variable (it can take only two
values, either 1 or 0) and :
p(y (i) = 1 | x(i) ; θ) = hθ (x)
p(y (i) = 0 | x(i) ; θ) = 1 − hθ (x)

p(y | x; θ) = [hθ (x)]y [1 − hθ (x)](1−y)

Lotfi ELAACHAK Page 26


The likelihood of an observation can be written as"

n
p(y (i) | x(i) ; θ)
Y
L(θ) =
i=1

n
[hθ (x)]y [1 − hθ (x)](1−y)
Y
L(θ) =
i=1

l(θ) = logL(θ) = Σni=1 y (i) log(hθ (x)) + (1 − y (i) )log(1 − hθ (x))

l(θ) = logL(θ) = Σni=1 y (i) log(θT x) + (1 − y (i) )log(1 − θT x)

How do we maximize the likelihood ? Similar to our derivation in the case of linear regression,
we can use gradient ascent. Written in vectorial notation, our updates will therefore be given by
θ := θ + α∇θ l(θ).

∇θ l(θ) = [y − hθ (x)].x

For one example "SGD".


(i)
θj := θj − α[y (i) − hθ (x)].xj
For all examples.
(i)
θj := θj − αΣni=1 [y (i) − hθ (x)].xj
you can add 1/n for standardization.

1 n (i)
θj := θj − αΣi=1 [y (i) − hθ (x)].xj
n

1.3.3 Logistic Regression Pytorch

Pytorch Logistic Regression

import seaborn as sns


import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib
from tqdm import tqdm
import torch
import matplotlib.pyplot as plt

data = pd.read_csv("datasets/diabetes.csv")

array = data.values
X, y = array[:,0:2] , array[:,8]

class LogisticRegression(torch.nn.Module):
def __init__(self, input_dim, output_dim):

Lotfi ELAACHAK Page 27


super(LogisticRegression, self).__init__()
self.linear = torch.nn.Linear(input_dim, output_dim)
def forward(self, x):
outputs = torch.sigmoid(self.linear(x))
return outputs

epochs = 200000
input_dim = 2 # Two inputs x1 and x2
output_dim = 1 # Single binary output
learning_rate = 0.01

model = LogisticRegression(input_dim,output_dim)
criterion = torch.nn.BCELoss()# binary cross entropy
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

X_train, X_test, y_train, y_test = train_test_split(


X, y, test_size=0.33, random_state=42)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)

X_train, X_test = torch.Tensor(X_train),torch.Tensor(X_test)


y_train, y_test = torch.Tensor(y_train),torch.Tensor(y_test)

losses = []
losses_test = []
Iterations = []

iter = 0
for epoch in tqdm(range(int(epochs)),desc='Training Epochs'):
x = X_train
labels = y_train
optimizer.zero_grad() # Setting our stored gradients equal to zero
outputs = model(X_train)
loss = criterion(torch.squeeze(outputs), labels) # [200,1] -squeeze-> [200]

loss.backward() # Computes the gradient of the given tensor w.r.t. graph leaves

optimizer.step() # Updates weights and biases with the optimizer (SGD)

iter+=1
if iter%10000==0:
# calculate Accuracy
with torch.no_grad():
# Calculating the loss and accuracy for the test dataset

Lotfi ELAACHAK Page 28


correct_test = 0
total_test = 0
outputs_test = torch.squeeze(model(X_test))
loss_test = criterion(outputs_test, y_test)

predicted_test = outputs_test.round().detach().numpy()
total_test += y_test.size(0)
correct_test += np.sum(predicted_test == y_test.detach().numpy())
accuracy_test = 100 * correct_test/total_test
losses_test.append(loss_test.item())

# Calculating the loss and accuracy for the train dataset


total = 0
correct = 0
total += y_train.size(0)
correct += np.sum(torch.squeeze(outputs).round().detach().numpy() ==
y_train.detach().numpy())
accuracy = 100 * correct/total
losses.append(loss.item())
Iterations.append(iter)

print(f"Iteration: {iter}. \nTest - Loss: {loss_test.item()}. Accuracy:


{accuracy_test}")
print(f"Train - Loss: {loss.item()}. Accuracy: {accuracy}\n")

def model_plot(model,X,y,title):
parm = {}
b = []
for name, param in model.named_parameters():
parm[name]=param.detach().numpy()

w = parm['linear.weight'][0]
b = parm['linear.bias'][0]
plt.scatter(X[:, 0], X[:, 1], c=y,cmap='jet')
u = np.linspace(X[:, 0].min(), X[:, 0].max(), 2)
plt.plot(u, (0.5-b-w[0]*u)/w[1])
plt.xlim(X[:, 0].min()-0.5, X[:, 0].max()+0.5)
plt.ylim(X[:, 1].min()-0.5, X[:, 1].max()+0.5)
plt.xlabel(r'x_1$',fontsize=16)
plt.ylabel(r'x_2$',fontsize=16)
plt.title(title)
plt.show()

# Train Data
model_plot(model,X_train,y_train,'Train Data')

Lotfi ELAACHAK Page 29


# Test Dataset Results
model_plot(model,X_test,y_test,'Test Data')

1.4 SoftMax Regression

1.4.1 MultiClass Classification

we have three decision boundary according to the three θ vectors :


T
θ[1] x
T
θ[2] x
T
θ[3] x
Each class has his proper θ vector, we do the dot product between the sample and the all θ
vectors, then we chose the highest value. for example if the highest value is θ[1] then the sample
belongs to class 1.

1.4.2 SoftMax Model


Softmax regression (or multinomial logistic regression) is a generalization of logistic regression
to the case where we want to handle multiple classes.

Consider a classification problem which involved k number of classes. Let x as the feature vector
and y as the corresponding class, our predictor variable follows a multinomial distribution, that
is y ∈ {1, 2, . . . , k}.

Now, we would like to model the probability of y given x, P(y|x), which is a vector of probabilities
of y be either of the classes given the features :

 
P (y = 1 | x) = φ1
 P (y = 2 | x) = φ2 
 
 
 P (y = 3 | x) = φ3 
P (y | x) =  (1.2)


 ............ 

P (y = k | x) = φk

Lotfi ELAACHAK Page 30


with :

Σki=1 φi = 1

we can assume that the log-odd for y=i with respect to y=k is assumed to have linear relationship
with the independent variable x. for i = 1,2,.....,k

ln(oddi ) = θiT x
P (y = i | x)
ln( ) = θiT x
P (y = k | x)

P (y = i | x) T
= e(θi x)
P (y = k | x)
T
P (y = i | x) = e(θi x) P (y = k | x)
Since the sum of P(y=j|x) for j=1, 2, 3, . . . , k is equal to 1, so :
T T
Σkj=1 P (y = j | x) = Σkj=1 e(θj x) P (y = k | x) = P (y = k | x)Σkj=1 e(θj x) = 1

1
P (y = k | x) = T
Σkj=1 e(θj x)

By substitution :
T
θiT x eθi x
P (y = i | x) = e P (y = k | x) = T
Σkj=1 e(θj x)

SoftMax Architecture

1.4.3 Probability Interpretation (Cost Function)


Before we proceed, let’s get introduced about indicator function which output 1 if the argument
is true or else it will output 0.

1 if y = i is true
1(y = 1) =  (1.3)
0 otherwise

Lotfi ELAACHAK Page 31


To get the likelihood on the training data, we need to compute all of the probabilities of y (i) = y
given x(i) for i=1, 2, 3, . . . , m.

k
(i) =l)
P (y (i) | x(i) ) = (P (y (i) = l | x(i) ))(y
Y

l=1

We can compute the likelihood function, L(θ) as followed :


m
P (y (i) | x(i) )
Y
L(θ) =
i=1

k
m Y
(i) =l)
(P (y (i) = l | x(i) ))(y
Y
L(θ) =
i=1 l=1
k
m Y T
eθi x (i) =l)
)(y
Y
L(θ) = ( (θjT x)
i=1 l=1 Σkj=1 e
The log-likelihood function :
l(θ) = lnL(θ)
k
m Y T
eθi x (i) =l)
)(y
Y
l(θ) = ln ( (θT x)
k
i=1 l=1 Σj=1 e j
T
eθi x (i) =l)
l(θ) = Σm k
i=1 Σl=1 ln( (θjT x)
)(y
Σkj=1 e
The partial derivative of loss function with respect to any element of the weight matrix is :
T
δ(−l(θ)) eθi x (i) (i)
= Σm
i=1 ( T x) − 1(y = g))xh
δθgh (θ
Σkj=1 e j

The update rule for each iteration of gradient descent :


T
eθi x (i)
θgh = θgh − αΣm
i=1 ( T − 1(y (i) = g))xh
Σkj=1 e(θj x)

1.4.4 SoftMax Pytorch

Pytorch Softmax Regression


%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import torch
import torch.nn.functional as F

df = pd.read_csv('datasets/Iris.csv')
df.head()

array = df.values
X, y = array[:,1:5] , array[:,5]

Lotfi ELAACHAK Page 32


y = df['Species'].map({'Iris-setosa':0 , 'Iris-versicolor':1 , 'Iris-virginica':2
})

from sklearn.model_selection import train_test_split


X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

from sklearn.preprocessing import StandardScaler


sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)

fig, ax = plt.subplots(1, 2, figsize=(7, 2.5))


ax[0].scatter(X_train[y_train == 2, 0], X_train[y_train == 2, 1])
ax[0].scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], marker='v')
ax[0].scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], marker='s')
ax[1].scatter(X_test[y_test == 2, 0], X_test[y_test == 2, 1])
ax[1].scatter(X_test[y_test == 1, 0], X_test[y_test == 1, 1], marker='v')
ax[1].scatter(X_test[y_test == 0, 0], X_test[y_test == 0, 1], marker='s')
plt.show()

# dataframe / array to tensor


X_train_tensor, X_test_tensor = torch.Tensor(X_train),torch.Tensor(X_test)
y_train_tensor, y_test_tensor = torch.Tensor(y_train.values),torch.Tensor(y_test.
values)

class SoftmaxRegression(torch.nn.Module):

def __init__(self, num_features, num_classes):


super(SoftmaxRegression, self).__init__()
self.linear = torch.nn.Linear(num_features, num_classes)
# initialize weights to zeros here,
# since we used zero weights in the
# manual approach

self.linear.weight.detach().zero_()
self.linear.bias.detach().zero_()
# Note: the trailing underscore
# means "in-place operation" in the context
# of PyTorch

def forward(self, x):


logits = self.linear(x)
probas = F.softmax(logits, dim=1)
return logits, probas

Lotfi ELAACHAK Page 33


def predict_labels(self, x):
logits, probas = self.forward(x)
labels = torch.argmax(probas, dim=1)
return labels

def evaluate(self, x, y):


labels = self.predict_labels(x).float()
accuracy = torch.sum(labels.view(-1) == y.float()).item() / y.size(0)
return accuracy

model2 = SoftmaxRegression(num_features=4, num_classes=3).to(DEVICE)


optimizer = torch.optim.SGD(model2.parameters(), lr=0.1)

def comp_accuracy(true_labels, pred_labels):


accuracy = torch.sum(true_labels.view(-1).float() ==
pred_labels.float()).item() / true_labels.size(0)
return accuracy

num_epochs = 50
for epoch in range(num_epochs):

#### Compute outputs ####


logits, probas = model2(X_train_tensor)

#### Compute gradients ####


cost = F.cross_entropy(logits, y_train_tensor.long())
optimizer.zero_grad()
cost.backward()

#### Update weights ####


optimizer.step()

#### Logging ####


logits, probas = model2(X_train_tensor)
acc = comp_accuracy(y_train_tensor, torch.argmax(probas, dim=1))
print('Epoch: %03d' % (epoch + 1), end="")
print(' | Train ACC: %.3f' % acc, end="")
print(' | Cost: %.3f' % F.cross_entropy(logits, y_train_tensor.long()))

print('\nModel parameters:')
print(' Weights: %s' % model2.linear.weight)
print(' Bias: %s' % model2.linear.bias)

Lotfi ELAACHAK Page 34


test_acc = model2.evaluate(X_test_tensor, y_test_tensor)
print('Test set accuracy: %.2f%%' % (test_acc*100))

Lotfi ELAACHAK Page 35


Chapitre 2
Deep Neural networks
2.1 Computational Graphs
A computational graph is a way to represent a math function in the language of graph theory.
Recall the premise of graph theory : nodes are connected by edges, and everything in the graph
is either a node or an edge.

In a computational graph nodes are either input values or functions for combining values. Edges
receive their weights as the data flows through the graph. Outbound edges from an input node
are weighted with that input value ; outbound nodes from a function node are weighted by com-
bining the weights of the inbound edges using the specified function.

These can be instantiated for two types of computations :


— Forward computation
— Backward computation

Few essential terms of a computational graph are explained below.

— Node : A node in a graph is used to indicate a variable. The variable may be a scalar,
vector, matrix, tensor, or even a variable of another type.
— Edge : An edge represents a function argument and also data dependency. These are just
like pointers to nodes.
— Operation : An operation is a simple function of one or more variables. There is a fixed
set of allowable operations. Functions more complicated than these operations in this set
may be described by composing many operations together.

2.1.1 Types of computational graphs


Though both libraries (Pytroch & TensorFlow) employ a directed acyclic graph(or DAG) for re-
presenting their machine learning and deep learning models, there is still a big difference between
how they let their data and calculations flow through the graph.
The subtle difference between the two libraries is that while Tensorflow(v < 2.0) allows static
graph computations, Pytorch allows dynamic graph computations.

Type 1 : Static Computational Graphs

Properties of nodes & edges : The nodes represent the operations that are applied directly on
the data flowing in and out through the edges. For the above set of equations, we can keep the
following things in mind while implementing it in TensorFlow :
— Involves two phases :
Phase 1 :- Make a plan for your architecture.
Phase 2 :- To train the model and generate predictions, feed it a lot of data.

36
— Since the inputs act as the edges of the graph, we can use the tf.Placeholder() object
which can take any input of the desired datatype.
— For calculating the output ‘c’, we define a simple multiplication operation and start a ten-
sorflow session where we pass in the required input values through the feed_dict attribute
in the session.run() method for calculating the outputs and the gradients.
Static Graph TF1
# Importing tensorflow version 1
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Initializing placeholder variables of


# the graph
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)

# Defining the operation


c = tf.multiply(a, b)

# Instantiating a tensorflow session


with tf.Session() as sess:

# Computing the output of the graph by giving


# respective input values
out = sess.run(, feed_dict={a: [15.0], b: [20.0]})[0][0]

# Computing the output gradient of the output with


# respect to the input 'a'
derivative_out_a = sess.run(tf.gradients(c, a), feed_dict={
a: [15.0], b: [20.0
]})[0][0]

# Computing the output gradient of the output with


# respect to the input 'b'
derivative_out_b = sess.run(tf.gradients(c, b), feed_dict={
a: [15.0], b: [20.0
]})[0][0]

# Displaying the outputs


print(f'c = {out}')
print(f'Derivative of c with respect to a = {derivative_out_a}')
print(f'Derivative of c with respect to b = {derivative_out_b}')

Advantages :

— Since the graph is static, it provides many possibilities of optimizations in structure and
resource distribution.
— The computations are slightly faster than a dynamic graph because of the fixed structure.

Lotfi ELAACHAK Page 37


Disadvantages :

— Scales poorly to variable dimension inputs. For example, A CNN(Convolutional Neural


network) architecture with a static computation graph trained on 28×28 images wouldn’t
perform well on images of different sizes like 100×100 without a lot of pre-processing
boilerplate code.
— Poor debugging. These are very difficult to debug, primarily because the user doesn’t have
any access to how the information flow occurs.

Type 2 : Dynamic Computational Graphs

Properties of nodes & edges : The nodes represent the data(in form of tensors) and the edges
represent the operations applied to the input data.

For the equations given in the Introduction, we can keep the following things in mind while
implementing it in Pytorch

— As the forward computation is performed, the graph is implicitly defined.


— Since everything in Pytorch is created dynamically, we don’t need any placeholders and
can define our inputs and operations on the fly.
— After defining the inputs and computing the output ‘c’, we call the backward() method,
which calculates the corresponding partial derivatives with respect to the two inputs ac-
cessible through the .grad specifier.
Dynamic Graph Pytorch
# Importing torch
import torch

# Initializing input tensors


a = torch.tensor(15.0, requires_grad=True)
b = torch.tensor(20.0, requires_grad=True)

# Computing the output


c = a * b

# Computing the gradients


c.backward()

# Collecting the output gradient of the


# output with respect to the input 'a'
derivative_out_a = a.grad

# Collecting the output gradient of the


# output with respect to the input 'b'
derivative_out_b = b.grad

# Displaying the outputs


print(f'c = {c}')
print(f'Derivative of c with respect to a = {derivative_out_a}')

Lotfi ELAACHAK Page 38


print(f'Derivative of c with respect to b = {derivative_out_b}')

Advantages :

— Scalability to different dimensional inputs : Scales very well for different dimensional inputs
as a new pre-processing layer can be dynamically added to the network itself.
— Ease in debugging : These are very easy to debug and are one of the reasons why many
people are shifting from Tensorflow to Pytorch. As the nodes are created dynamically
before any information flows through them, the error becomes very easy to spot as the
user is in complete control of the variables used in the training process.
Disadvantages :

— Allows very little room for graph optimization because a new graph needs to be created
for each training instance/batch.

2.1.2 Forward computation


the forward propagation refers to the flow of data from the input to the output of the network.
So after forward propagation for an input x, you get an output ŷ.

For example, consider the relatively simple expression : f(x, y, z) = (x + y) * z. This is how we
would represent that function as as computational graph :

Lotfi ELAACHAK Page 39


Computational Graph Pytorch
import torch
from IPython.display import display, Math
# Define the graph a,b,c,d are leaf nodes and e is the root node
# The graph is constructed with every line since the
# computational graphs are dynamic in PyTorch
a = torch.tensor([2.0])
b = torch.tensor([3.0])
c = torch.tensor([5.0])
d = torch.tensor([10.0])
u = a*b
t = torch.log10(d)
v = t*c
e = u+v

print(f'a.is_leaf: {a.is_leaf}')
print(f'a.grad_fn: {a.grad_fn}')
print(f'a.grad: {a.grad}')
print()

print(f'b.is_leaf: {b.is_leaf}')
print(f'b.grad_fn: {b.grad_fn}')
print(f'b.grad: {b.grad}')
print()

print(f'c.is_leaf: {c.is_leaf}')
print(f'c.grad_fn: {c.grad_fn}')
print(f'c.grad: {c.grad}')
print()

print(f'd.is_leaf: {d.is_leaf}')

Lotfi ELAACHAK Page 40


print(f'd.grad_fn: {d.grad_fn}')
print(f'd.grad: {d.grad}')
print()

print(f'e.is_leaf: {e.is_leaf}')
print(f'e.grad_fn: {e.grad_fn}')
print(f'e.grad: {e.grad}')
print()

print(f'u.is_leaf: {u.is_leaf}')
print(f'u.grad_fn: {u.grad_fn}')
print(f'u.grad: {u.grad}')
print()

print(f'v.is_leaf: {v.is_leaf}')
print(f'v.grad_fn: {v.grad_fn}')
print(f'v.grad: {v.grad}')
print()

print(f't.is_leaf: {t.is_leaf}')
print(f't.grad_fn: {t.grad_fn}')
print(f't.grad: {t.grad}')

Computational Graph TensorFlow


import tensorflow as tf
from IPython.display import display, Math

print(tf.__version__)

a=tf.constant(2.0)
b=tf.constant(3.0)
c=tf.constant(5.0)
d=tf.constant(10.0)

def log10(x):
x1 = tf.math.log(x)
x2 = tf.math.log(10.0)
return x1/ x2

u = tf.multiply(a,b,name='u')
t = log10(d)
v = tf.multiply(t,c,name='v')
e = tf.add(u,v,name='e')

print(u,t,v,e,sep='\n')

Lotfi ELAACHAK Page 41


2.1.3 Backward computation
Loop over the nodes in reverse topological order starting with a final goal node, Compute deri-
vatives of final goal node value with respect to each edge’s tail node.

Lotfi ELAACHAK Page 42


The back-propagation algorithm allows the calculation of the gradient required for the optimi-
zation techniques.

Chain Rule of Calculus

In calculus, the chain rule is a formula for computing the derivative of the composition of two
or more functions. The chain rule may be written in Leibniz’s notation in the following way. If a
variable z depends on the variable y, which itself depends on the variable x, so that y and z are
therefore dependent variables, then z, via the intermediate variable of y, depends on x as well.
The chain rule is as follows.
dx dx dy
= .
dz dy dz

Back Propagation

Forward Pass :
y = x2
L = 2y
Loss : L = 2x2

Backward Pass :

Exemples

Forward Pass :

Lotfi ELAACHAK Page 43


(1)y = y(x)

(2)u = u(y)
(2)v = v(y)
(3)L = L(u, v)
Loss : L(u(y(x)), v(y(x)))

Backward Pass :

Lotfi ELAACHAK Page 44


As you can see, once the backward graph is built, calculating derivatives is straightforward and
is heavily optimized in the deep learning frameworks.
Backward computation Pytorch
import torch

from IPython.display import display, Math

a = torch.tensor([2.0],requires_grad=True) #automatic differentiation


b = torch.tensor([3.0],requires_grad=True) #automatic differentiation
c = torch.tensor([5.0],requires_grad=True) #automatic differentiation
d = torch.tensor([10.0],requires_grad=True) #automatic differentiation

u = a*b

t = torch.log10(d)

v = t*c

t.retain_grad()
e = u+v

print(e)

e.backward()
display(Math(fr'\frac{{\partial e}}{{\partial a}} = {a.grad.item()}'))

Lotfi ELAACHAK Page 45


print()
display(Math(fr'\frac{{\partial e}}{{\partial b}} = {b.grad.item()}'))
print()
display(Math(fr'\frac{{\partial e}}{{\partial c}} = {c.grad.item()}'))
print()
display(Math(fr'\frac{{\partial e}}{{\partial d}} = {d.grad.item()}'))

Autograd Module : The autograd provides the functionality of easy calculation of gradients
without the explicitly manual implementation of forward and backward pass for all layers.
Backward computation TensorFlow
import tensorflow as tf
import os
from IPython.display import display, Math
import numpy as np

def log10(x):
x1 = tf.math.log(x)
x2 = tf.math.log(10.0)
return x1/ x2

#Create input placeholders for the graph variables


a=tf.Variable(2.0)
b=tf.Variable(3.0)
c=tf.Variable(5.0)
d=tf.Variable(10.0)

# Create graphs

with tf.GradientTape() as tape:


u = tf.multiply(a,b,name='u')
t = log10(d)
v = tf.multiply(t,c,name='v')
e = tf.add(u,v,name='e')

grad = tape.gradient(e, [a,b,c,d])

for g in grad :
print(g)

Lotfi ELAACHAK Page 46


Tensorflow graph example

2.2 Multi-Layer Perceptrons

— MLPs are feedforward neural networks (no feedback connections).


— They compose several non-linear functions :

f (x) = y(h3 (h2 (h1 (x))))

— The data specifies only the behavior of the output layer.


— Each layer i comprises multiple neurons j which are implemented as affine transformations

(aT x + b)

followed by non-linear activation functions (g) :

hij = g(aij hi−1 + bij )

.
— Each neuron in each layer is fully connected to all neurons of the previous layer.
— The overall length of the chain is the depth of the model – “Deep Learning”.

Lotfi ELAACHAK Page 47


2.3 Loss functions
The loss function in a neural network quantifies the difference between the expected outcome
and the outcome produced by the machine learning model. From the loss function, we can derive
the gradients which are used to update the weights. The average over all losses constitutes the
cost.

— The output layer is the last layer in a neural network which computes the output
— I The loss function compares the result of the output layer to the target value(s)
— I Choice of output layer and loss function depends on task (discrete, continuous, ..)

What is the goal of optimizing the loss function ?

Lotfi ELAACHAK Page 48


— I Tries to make the model output (=prediction) similar to the target (=data)
— I Think of the loss function as a measure of cost being paid for a prediction

How to design a good loss function ?


— A loss function can be any differentiable function that we wish to optimize
— Deriving the cost function from the maximum likelihood principle removes the burden of
manually designing the cost function for each model
— Consider the output of the neural network as parameters of a distribution over yi

Log-Likelihood :
wM L = argmax pmodel (y|X, w)
N
Y
wM L = argmax pmodel (yi |xi , w)
i=1
N
X
wM L = argmax log pmodel (yi |xi , w)
i=1

Lotfi ELAACHAK Page 49


— Neural network fw (x) predicts mean u of Gaussian distribution over y :
1 (y − fw (x))2
p(y|x, w) = √ exp(− )
2πσ 2 2σ 2
— We want to maximize the probability of the target y under this distribution

2.3.1 Supervised Learning Tasks


Regression

Mapping :
fw : RW ∗H → R

Binary Classification

Mapping :
fw : RW ∗H → 1, 00 Beach0 ,0 N oBeach0

Multi-Class Classification

Lotfi ELAACHAK Page 50


Mapping :
fw : RW ∗H → 0, 1, 2, 30 Beach0 ,0 T ree0 ,0 P ersonne0

2.3.2 Loss Function for Regression


Mean Squared Error /L2 Loss

Gaussian Distribution :
1 (y − µ)2
p(y|x, w) = √ exp(− )
2πσ 2 2σ 2

— µ : mean.
— σ : standard deviation.
— The distribution has thin “tails” :

p(y) → 0 quickly as y → ∞

— It thus penalizes outliers strongly.


2
Let p(y|x, w) = √ 1 exp(− (y−fw 2(x)) ) be a Gaussian distribution. We obtain :
2πσ 2 2σ

N
X
wM L = argmaxw log pmodel (yi |xi , w)
i=1

N N
1 2 1
(fw (xi ) − yi )2
X X
wM L = argmaxw − log(2πσ ) − 2
i=1 2 i=1 2σ
N
(fw (xi ) − yi )2
X
wM L = argmaxw −
i=1
N
(fw (xi ) − yi )2
X
wM L = argminw
i=1

N
1 X
M SE = (fw (xi ) − yi )2
N i=1

We minimize the squared loss (=L2 loss), affected strongly by outliers.

Lotfi ELAACHAK Page 51


Mean Absolute Error /L1 Loss

Laplace Distribution :
1 |y − µ|
p(y) = √ exp(− )
2b b
— µ : location.
— b : scale.
— The distribution has heavy “tails” :

p(y) → 0 slowly as y → ∞

— It thus penalizes less outliers.


— Thus often preferred in practice.

Let p(y|x, w) = √1 exp(− |y−fw (x)| ) be a Laplace distribution. We obtain :


2b b

N
X
wM L = argmaxw log pmodel (yi |xi , w)
i=1

N N
X X 1
wM L = argmaxw − log(2b) − |fw (xi ) − yi |
i=1 i=1 b
N
X
wM L = argmaxw − |fw (xi ) − yi |
i=1
N
X
wM L = argminw |fw (xi ) − yi |
i=1

N
1 X
M AE = |fw (xi ) − yi |
N i=1

We minimize the absolute loss (=L1 loss) which is more robust than L2 .

Predicting all Parameters

Let p(y|x, w) = √ 1
exp(− |y−f w (x)|
gw (x)
) be a Laplace distribution. We obtain :
2gw (x)

Lotfi ELAACHAK Page 52


N
X
wM L = argmaxw log pmodel (yi |xi , w)
i=1

N N
X X 1
wM L = argmaxw − log(2gw (x)) − |fw (xi ) − yi |
i=1 i=1 gw (x)

In this case, we predict both the location µ and the scale b with a neural network, predict
uncertainty (variance/scale).

Mixture Density Networks

To represent multi-modal distributions, we can also model mixture densities :


(m)
1
exp(− |y−f(m)(X)|
PN
p(y|X, W ) = m=1 πm (m)
w
)
2gW (X) gW (X)

— Mixture of Laplace distribution.


— πm ∈ [0, 1] : weight of mode m , This can be easily achieved by passing the outputs of the
mixing coefficients through a Softmax layer.
— m πm = 1.
P

— Location µm and scale bm of each mode m modeled by neural network

2.3.3 Loss Function for Classification


2.3.4 Binary cross-entropy
Bernoulli Distribution :
p(y) = µy (1 − µ)(1−y)

Lotfi ELAACHAK Page 53


— µ : probability for y = 1.
— Handles only two classes e.g. (“cats” vs. “dogs”).

Let p(y|X, W ) = fW (X)y (1 − fW (X))(1−y) be a Laplace distribution. We obtain :


N
X
wM L = argmaxw log pmodel (yi |xi , w)
i=1

N
log[fW (Xi )yi (1 − fW (Xi ))(1−yi ) ]
X
wM L = argmaxw
i=1
N
X
wM L = argminw −yi logfW (Xi ) − (1 − yi )log(1 − fW (Xi ))
i=1

In other words, we minimize the binary cross-entropy (BCE) loss.


Remark : Last layer of fw (x) can be a sigmoid function such that fw (x)y ∈ [0, 1] .

2.3.5 Cross-entropy / Multi-Class


Categorical distribution :
p(y = c) = µc
— µc : probability for class c.
— Multiple classes, multiple modes.

Alternative notation :
C
µyc c
Y
p(y) =
c=1

— y : “one-hot” vector with yc ∈ {1, 0}


— y = (0, ..., 0, 1, 0, ..., 0)T with all zeros except for one (the true class)

Lotfi ELAACHAK Page 54


QC
Let p(y|X, W ) = c=1 fw(c) (x)yc be a Categorical distribution. We obtain :
N
X
wM L = argmaxw log pmodel (yi |xi , w)
i=1

N C
fw(c) (x)yc ]
X Y
wM L = argmaxw log[
i=1 c=1
C
N X
−yi,c logfw(c) (xi )
X
wM L = argminw
i=1 c=1

In other words, we minimize the cross-entropy (CE) loss.


The target y = (0, . . . , 0, 1, 0, . . . , 0) is a “one-hot” vector with yc its c’th element.
How can we ensure that fw(c) (x) predicts a valid Categorical (discrete) distribution ?

— We must guarantee (1) f fw(c) (x)in[0, 1] ) and (2) ΣC (c)


i=c fw (x) = 1.
— An element-wise sigmoid as output function would ensure (1) but not (2).
— Solution : The softmax function guarantees both (1) and (2) :

ex1 exC
sof tmax(x) = ( , ...., )
ΣC
k=1 e
(xk ) ΣC
k=1 e
(xk )

exC
fw(c) (x) =
ΣC
k=1 e
(xk )

2.3.6 Kullback Leibler Divergence


K-L divergence is a measure of how different a specific probability distribution is from a refe-
rence distribution. However, it is not a true statistical metric like variation of information which
measures the distance between two clusterings.

X P (X)
DKL (P ||Q) = P (X)log( )
x∈X Q(X)

where D(K-L) is the divergence of Q from P.

Lotfi ELAACHAK Page 55


The difference between Cross-Entropy and KL-divergence is that Cross-Entropy calculates the
total distributions required to represent an event from the distribution q instead of p, while KL-
divergence represents the extra amount of bit required to represent an event from the distribution
q instead of p.

— D(p || q) is always greater than or equal to 0.


— D(p || q) is not equal to D(q || p). The KL-divergence is not communicative.
— If p=q, then D(p || q) is 0.

The K-L divergence is an important feature in a variety of machine learning models. One in
particular is the Variational Autoencoder (VAE).

Example

Suppose we have two probability distributions, P & Q. and we want to find the difference between
the two probabilities, we can simply apply the KL divergence as shown below.

1 0.333 1 0.333 1 0.333


D(q||p) = log( ) + log( ) + log( )
3 0.36 3 0.48 3 0.16

D(q||p) = 0.096nats

“nats” is simply the unit of information obtained by using the natural logarithm (ln(x)).

Python code

KL Divergence
# box =[P(green),P(blue),P(red),P(yellow)]
box_1 = [0.25, 0.33, 0.23, 0.19]
box_2 = [0.21, 0.21, 0.32, 0.26]

import numpy as np
from scipy.special import rel_entr

def kl_divergence(a, b):

Lotfi ELAACHAK Page 56


return sum(a[i] * np.log(a[i]/b[i]) for i in range(len(a)))

print('KL-divergence(box_1 || box_2): %.3f ' % kl_divergence(box_1,box_2))


print('KL-divergence(box_2 || box_1): %.3f ' % kl_divergence(box_2,box_1))

# D( p || p) =0
print('KL-divergence(box_1 || box_1): %.3f ' % kl_divergence(box_1,box_1))

print("Using Scipy rel_entr function")


box_1 = np.array(box_1)
box_2 = np.array(box_2)

print('KL-divergence(box_1 || box_2): %.3f ' % sum(rel_entr(box_1,box_2)))


print('KL-divergence(box_2 || box_1): %.3f ' % sum(rel_entr(box_2,box_1)))
print('KL-divergence(box_1 || box_1): %.3f ' % sum(rel_entr(box_1,box_1)))

2.4 Activation functions


An Activation Function decides whether a neuron should be activated or not. This means that
it will decide whether the neuron’s input to the network is important or not in the process of
prediction using simpler mathematical operations.

The role of the Activation Function is to derive output from a set of input values fed to a node.
— Hidden layer hi = g(Ai hi−1 + bi ) with activation function g(·) and weights Ai , bi
— The activation function is frequently applied element-wise to its input
— Activation functions must be non-linear to learn non-linear mappings
— Some of them are not differentiable everywhere (but still ok for training)

2.4.1 Sigmoid

1
g(x) =
1 + e−x
— Maps input to range [0, 1].
— Neuroscience interpretation as saturating “firing rate” of neurons.

Lotfi ELAACHAK Page 57


Problems

Saturation “kills” gradients (gradients vanishing) :

— Downstream gradient becomes zero when input x is saturated : g’(x) = 0


— No learning if x is very small (<-10)
— No learning if x is very large (>10)

Non zero-centered outputs :

— The output is always between 0 and 1


— The gradient updates go too far in different directions which makes optimization harder.

2.4.2 Tanh

2 ex ˘e−x
g(x) = − 1 =
1 + e−2x ex + e−x
— Maps input to range [0, 1].
— Zero-centered.

Problems

Again, saturation “kills” gradients.

Lotfi ELAACHAK Page 58


2.4.3 ReLU

g(x) = max(0, x)

— Does not saturate (for x > 0).


— Leads to fast convergence.
— Computationally efficient

Problems

— Not zero-centered
— No learning for x < 0, dead ReLUs

2.4.4 Leaky ReLU

g(x) = max(0.01x, x)

— Does not saturate (i.e., will not die).


— Closer to zero-centered outputs.
— Leads to fast convergence.
— Computationally efficient.

Note : there is also an alternative Parametric ReLU : g(x) = max(αx, x) with the same advantages
as Leaky ReLU and Parameter α learned from data.

Lotfi ELAACHAK Page 59


2.4.5 Exponential Linear Units (ELU)


x if x > 0
g(x) =  x (2.1)
α(e − 1) if x ≤ 0
— All benefits of Leaky ReLU
— Adds some robustness to noise
— Default α = 1

2.4.6 How to choose an activation function

2.5 ANN Simulator


See : https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=
reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.72180&
showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=
false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=
classification&initZero=false&hideText=false

Lotfi ELAACHAK Page 60


Chapitre 3
Optimization and Regularization
3.1 Optimization
Machine/Deep learning involves using an algorithm to learn and generalize from historical data
in order to make predictions on new data.

This problem can be described as approximating a function that maps examples of inputs to
examples of outputs. Approximating a function can be solved by framing the problem as func-
tion optimization.

Machine/Deep learning optimization is the process of adjusting hyperparameters in order to mi-


nimize the cost function by using one of the optimization techniques. It is important to minimize
the cost function because it describes the discrepancy between the true value of the estimated
parameter and what the model has predicted.

3.1.1 Optimization Challenges


Local/Global minima

Gradient Descent :
w0 = winit
wt+1 = wt − α∇w L(wt )
— Neural network loss L(w) is not convex wrt. the network parameters w
— There exist multiple local minima, but we will find only one through optimization
— Example : we can permute all hidden units in a layer and get the same solution
— it is known that many local minima in deep networks are good ones

61
Learning rate

— Choosing the learning rate too low leads to very slow progress
— Choosing the learning rate too high might lead to divergence

Exploding gradients

— Steep cliffs can pose great challenges to optimization


— Very high derivatives catapult the parameters w very far off
— Gradient clipping is a common heuristics to counteract such effects

Saddle point

— At a saddle point, we have ∇w L(w) = 0, but we are not at a minimum.


— Many saddle points in DL, but only problematic if we exactly “hit” the saddle point.

Lotfi ELAACHAK Page 62


Gradient vanishing

— A flat region is called a plateau where ∇w L(w) ≈ 0, Slow progress.


— Example : Saturated sigmoid activation function, dead ReLUs, etc.

3.1.2 Optimization Algorithms


Gradient Descent

Gradient descent is an optimization algorithm which is commonly-used to train machine learning


models and neural networks. Training data helps these models learn over time, and the cost
function within gradient descent specifically acts as a barometer, gauging its accuracy with each
iteration of parameter updates. Until the function is close to or equal to zero, the model will
continue to adjust its parameters to yield the smallest possible error.
Algorithm :

1. Initialize weights w0 and pick learning rate η.


2. For all data points i ∈ 1, ..., N do :

Lotfi ELAACHAK Page 63


(a) Forward propagate xi through network to calculate prediction ŷi
(b) Backpropagate to obtain gradient∇w Li (wt ) = ∇w Li (ŷi , yi , wt )
3. Update gradients :wt+1 = wt − η N1 Σi ∇w Li (wt ).
4. If validation error decreases, go to step 2, otherwise stop.

— Typically, millions of parameter dim(w) = 1 million or more


— Typically, millions of training points N = 1 million or more
— Becomes extremely expensive to compute and doesn’t fit into memory

Stochastic Gradient Descent

The word ‘stochastic‘ means a system or process linked with a random probability. Hence, in
Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set
for each iteration. In Gradient Descent, there is a term called “batch” which denotes the total
number of samples from a dataset that is used for calculating the gradient for each iteration.
Algorithm :

1. Initialize weights w0 and pick learning rate η and minibatch size |Xbatch |.
2. Draw random minibatch (x1 , y1 ), ..., (xB , yB ) ⊆ X(withB << N )
3. For all minibatch elements b ∈ 1, ..., B do :
(a) Forward propagate xi through network to calculate prediction ŷi
(b) Backpropagate to obtain gradient∇w Li (wt ) = ∇w Li (ŷi , yi , wt )
4. Update gradients :wt+1 = wt − η N1 Σi ∇w Li (wt ).
5. If validation error decreases, go to step 2, otherwise stop.

— We call (x1 , y1 ), ..., (xB , yB ) ⊆ X a “minibatch”


— You should choose B as large as your (GPU) memory allows Typically B N, e.g., B = 8,
16, 32, 64, 128
— Smaller batch sizes lead to larger variance in the gradients (noisy updates)
— Batches can be chosen randomly or by partitioning the dataset

L(w) = (0.1w1 )2 + w22

∇w L(w) = (0.2w1 2w2 )T + N (0, 0.03)


To simulate minibatches, we have added Gaussian noise to the gradient.

Lotfi ELAACHAK Page 64


Choosing the learning rate η too small does not lead to convergence (in 100 steps).

Choosing the learning rate η too high leads to oscillations (also no convergence).

A good learning rate η works better, but still slow, inefficient and no convergence.

Lotfi ELAACHAK Page 65


Remark : Due to stochasticity, a fixed learning rate η will never lead to convergence.

Convergence of SGD :

A series is the sum of terms of an infinite sequence of numbers sn = Σnk=1 ak .

A series is convergent if there exists a number s∗ such that for every arbitrarily small positive
number  :

| sn − s∗ |< 

The SGD update leads to the following parameter series :

wt+1 = wt − ηΣi ∇w Li (wt )


w1 = w0 − ηΣi ∇w L0
w2 = w1 − ηΣi ∇w L1 = w0 − ηΣi ∇w L0 − ηΣi ∇w L1

w3 = w2 − ηΣi ∇w L2 = w0 − ηΣi ∇w L0 − ηΣi ∇w L1 − ηΣi ∇w L2

Optimization converges if there exists a vector w∗ such that for every arbitrarily small positive
number , there exists an integer T such that for all t ≥ T :

|| wt − w∗ ||< 

Problems of SGD

— Requires conservative learning rate to avoid divergence.


— However, in this case the updates become very small, slow progress.
— Finding a good learning rate is difficult.

SGD with Momentum

Momentum is an extension to the gradient descent optimization algorithm that allows the search
to build inertia in a direction in the search space and overcome the oscillations of noisy gradients
and coast across flat spots of the search space.

Lotfi ELAACHAK Page 66


Motivation for SGD with Momentum :

— SGD oscillates along w2 axis, we should dampen, e.g., by averaging over time.
— SGD makes slow progress along w1 axis, we like to accelerate in this direction.
— Idea of momentum : update weights with exponential moving average of gradients.

mt+1 = β1 mt − η∇w LB (wt )

wt+1 = wt + mt+1

With velocity m and momentumβ1 , typically β1 = 0.9.

Exponential Moving Average :

We can write the expression of mt+1 as below :

mt+1 = β1 mt − (1 − β1 )∇w LB (wt )

wt+1 = wt + ηmt+1
Let us abbreviate the gradient at iteration t with gt ≡ ∇w LB (wt ). We have :

mt+1 = β1 mt − (1 − β1 )gt

with (m0 = 0).


m1 = β1 m0 − (1 − β1 )g0 ) = (1 − β1 )g0

Lotfi ELAACHAK Page 67


m2 = β1 m1 − (1 − β1 )g1 ) = β1 (1 − β1 )g0 + (1 − β1 )g1
m3 = β2 m1 − (1 − β1 )g2 ) = β1 (1 − β1 )g0 + β1 (1 − β1 )g1 + (1 − β1 )g2
We see that the weight decays exponentially :

mt = (1 − β1 )Σt−1 t−i−1
i=0 β1 gi

Example :

t1 , t2 , t3 , ......, tn
b1 , b2 , b13 , ......, bn

with 0 ≥ γ ≤ 1.

Vt1 = b1
Vt2 = γVt1 + b2
Vt3 = γ 2 Vt1 + Vt2 γ + b3

SGD with Nesterov Momentum : Leads to faster dampening and Has significantly increased the
performance of RNNs on a variety of tasks.

RMSprop

Root Mean Squared Propagation, or RMSProp, is an extension of gradient descent and the
AdaGrad version of gradient descent that uses a decaying average of partial gradients in the
adaptation of the step size for each parameter, The main idea is “Divide the gradient by a run-
ning average of its recent magnitude”..
Motivation for RMSprop :

— Gradient distribution is very uneven (not equal)and thus requires conservative learning
rates.
— In this SGD example, gradients are very large in w2 but small in w1 .
— Idea of RMSprop : divide learning rate by moving average of squared gradients.

Lotfi ELAACHAK Page 68


v t+1 = β2 v t + (1 − β2 )∇2w LB (wt )

∇w LB (wt )
wt+1 = wt − η √
v t+1 + 

With uncentered variance of gradient v, momentum β2 = 0.999 and  = 10−8

Adam

Adaptive Moment Estimation (Adam) is a method that computes adaptive learning rates for each
parameter. It stores both the decaying average of the past gradients mt , similar to momentum
and also the decaying average of the past squared gradients vt , similar to RMSprop and Adadelta.
Thus, it combines the advantages of both the methods. Adam is the default choice of the optimizer
for any application in general.

mt+1 = β1 mt + (1 − β1 )∇w LB (wt )

v t+1 = β2 v t + (1 − β2 )∇2w LB (wt )

mt+1
wt+1 = wt − α √
v t+1 + 

Adam combines the benefits of Momentum and RMSprop.

Lotfi ELAACHAK Page 69


Optimizers Comparison

Visualization : https ://github.com/Jaewan-Yun/optimizer-visualization

3.1.3 Optimization Strategies


Learning Rate Schedules

— Fixed learning rate (not a good idea : too slow in the beginning and fast in the end)
— Inverse proportional decay : ηt = η/t (Robbins and Monro)
— Exponential decay : ηt = ηαt
— Step decay : η ← αη (every K iterations/epochs, common in practice : α = 0.5)

Monitoring the Training Process

Underfitting : Model does not have enough capacity to decrease losses.

Lotfi ELAACHAK Page 70


Not converged : Model requires more iterations to converge.

Overfitting : Training loss decreases but validation loss increases.

Example of train and validation curves showing a good fit.

Lotfi ELAACHAK Page 71


Noisy validation cuves : Validation set might be too small.

Might also happen : Validation set is easier than training set.

Hyperparameter Search

Hyperparameters :
— Network architecture
— Number of iterations
— Batch size
— Learning rate schedule
— Regularization
Methods :
— Manual search
— Most common
— Build intuitions
— Grid search
— Define ranges
— Systematically evaluate
— Requires large resources
— Random search
— Like grid search but
— hyperparameters selected
— based on random draws

Lotfi ELAACHAK Page 72


How to Start

1. Start with single training sample and use a small network


— First verify that the output is correct
— Then overfit, accuracy should be 100%, fast training/debug cycles
— Choose a good learning rate (0.1, 0.01, 0.001, ..)
2. Increase to 10 training samples
— Again, verify that the output is correct
— Measure time for one iteration (< 1s) –> identify bottlenecks (e.g., data loading)
— Overfit to 10 samples, accuracy should be near 100%
3. Increase to 100, 1000, 10000 samples and increase network size
— Plot train and validation error –> now you should start to see generalization
— Important : Make only one change at a time to identify causes

3.2 Regularization

3.2.1 Capacity of the model

— Underfitting : Model too simple, does not achieve low error on training set.
— Overfitting : Training error small, but test error (= generalization error) large.

Lotfi ELAACHAK Page 73


Bias and Variance

Bais : While making predictions, a difference occurs between prediction values made by the model
and actual values/expected values, and this difference is known as bias errors or Errors due to bias.

— Low Bias : A low bias model will make fewer assumptions about the form of the target
function.
— High Bias : A model with a high bias makes more assumptions, and the model becomes
unable to capture the important features of our dataset. A high bias model also cannot
perform well on new data.

Ways to reduce High Bias :

— Increase the input features as the model is underfitted.


— Decrease the regularization term.
— Use more complex models, such as including some polynomial features.

Variance : Variance tells that how much a random variable is different from its expected value.
Ideally, a model should not vary too much from one training dataset to another, which means
the algorithm should be good in understanding the hidden mapping between inputs and output
variables.
Ways to reduce High Variance :

— Reduce the input features or number of parameters as a model is overfitted.


— Do not use a much complex model.
— Increase the training data.
— Increase the Regularization term.

Lotfi ELAACHAK Page 74


1. Low-Bias, Low-Variance : The combination of low bias and low variance shows an ideal
machine learning model. However, it is not possible practically.
2. Low-Bias, High-Variance : With low bias and high variance, model predictions are in-
consistent and accurate on average. This case occurs when the model learns with a large
number of parameters and hence leads to an overfitting
3. High-Bias, Low-Variance : With High bias and low variance, predictions are consistent
but inaccurate on average. This case occurs when a model does not learn well with the
training dataset or uses few numbers of the parameter. It leads to underfitting problems
in the model.
4. High-Bias, High-Variance : With high bias and high variance, predictions are inconsistent
and also inaccurate on average.

How to identify High variance or High Bias ?

3.2.2 L1 and L2 Regularzation


Regularization refers to a set of different techniques that lower the complexity of a neural network
model during training, and thus prevent the overfitting.

Lotfi ELAACHAK Page 75


Let X = (X, y) denote the dataset and w the model parameters. We can limit the model capacity
by adding a parameter norm penalty R to the loss L.

L̂(X, w) = L(X, w) + αR(w)

where α ∈ [0, ∞) controls the strength of the regularizer.


There are three very popular and efficient regularization techniques called L1, L2, and dropout.

L2 Regularzation

The L2 regularization is the most common type of all regularization techniques and is also com-
monly known as weight decay or Ridge Regression.

L2 regularization, or the L2 norm, or Ridge (in regression problems), combats overfitting by


forcing weights to be small, but not making them exactly 0.

So, if we’re predicting house prices again, this means the less significant features for predicting
the house price would still have some influence over the final prediction, but it would only be a
small influence.

1
L̂(X, w) = L(X, w) + α ||W ||22
2

L̂(X, w) = L(X, w) + αΣni=1 wi2

L1 Regularzation

L1 regularization, also known as L1 norm or Lasso (in regression problems), combats overfitting
by shrinking the parameters towards 0. This makes some features obsolete.
It’s a form of feature selection, because when we assign a feature with a 0 weight, we’re multi-
plying the feature values by 0 which returns 0, eradicating the significance of that feature.

1
L̂(X, w) = L(X, w) + α ||W ||1
2

L̂(X, w) = L(X, w) + αΣni=1 |wi |

L1 vs L2

— L1 regularization penalizes the sum of absolute values of the weights, whereas L2 regula-
rization penalizes the sum of squares of the weights.
— The L1 regularization solution is sparse. The L2 regularization solution is non-sparse.
— L2 regularization doesn’t perform feature selection, since weights are only reduced to
values near 0 instead of 0. L1 regularization has built-in feature selection.
— L1 regularization is robust to outliers, L2 regularization is not.

3.2.3 Early Stop


— Most commonly used form of regularization in deep learning
— Effective, simple and computationally efficient form of regularization

Lotfi ELAACHAK Page 76


— Training time can be viewed as hyperparameter, model selection problem
— Efficient as a single training run tests all hyperparameters (unlike weight decay)
— Only cost : periodically evaluate validation error on validation set
— Validation set can be small, and evaluation less frequently

3.2.4 Dropout
Dropout means that during training with some probability P a neuron of the neural network gets
turned off during training. Let’s look at a visual example.

3.3 DNN Pytorch


A model has a life-cycle, and this very simple knowledge provides the backbone for both modeling
a dataset and understanding the PyTorch API.
The five steps in the life-cycle are as follows :

— Prepare the Data.


— Define the Model.
— Train the Model.
— Evaluate the Model.
— Make Predictions.

3.3.1 Rgression MLP Pytorch

DNN Pytorch
from numpy import vstack
from numpy import sqrt
from pandas import read_csv
from sklearn.metrics import mean_squared_error
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data import random_split
from torch import Tensor
from torch.nn import Linear
from torch.nn import Sigmoid
from torch.nn import Module
from torch.optim import SGD
from torch.nn import MSELoss
from torch.nn.init import xavier_uniform_

Lotfi ELAACHAK Page 77


from tqdm import tqdm

# dataset definition
class CSVDataset(Dataset):
# load the dataset
def __init__(self, path):
# load the csv file as a dataframe
df = read_csv(path, header=None)
# store the inputs and outputs
self.X = df.values[:, :-1].astype('float32')
self.y = df.values[:, -1].astype('float32')
# ensure target has the right shape
self.y = self.y.reshape((len(self.y), 1))

# number of rows in the dataset


def __len__(self):
return len(self.X)

# get a row at an index


def __getitem__(self, idx):
return [self.X[idx], self.y[idx]]

# get indexes for train and test rows


def get_splits(self, n_test=0.33):
# determine sizes
test_size = round(n_test * len(self.X))
train_size = len(self.X) - test_size
# calculate the split
return random_split(self, [train_size, test_size])
# model definition
class MLP(Module):
# define model elements
def __init__(self, n_inputs):
super(MLP, self).__init__()
# input to first hidden layer
self.hidden1 = Linear(n_inputs, 10)
xavier_uniform_(self.hidden1.weight)
self.act1 = Sigmoid()
# second hidden layer
self.hidden2 = Linear(10, 8)
xavier_uniform_(self.hidden2.weight)
self.act2 = Sigmoid()
# third hidden layer and output
self.hidden3 = Linear(8, 1)
xavier_uniform_(self.hidden3.weight)

Lotfi ELAACHAK Page 78


# forward propagate input
def forward(self, X):
# input to first hidden layer
X = self.hidden1(X)
X = self.act1(X)
# second hidden layer
X = self.hidden2(X)
X = self.act2(X)
# third hidden layer and output
X = self.hidden3(X)
return X
# prepare the dataset
def prepare_data(path):
# load the dataset
dataset = CSVDataset(path)
# calculate split
train, test = dataset.get_splits()
# prepare data loaders
train_dl = DataLoader(train, batch_size=32, shuffle=True)
test_dl = DataLoader(test, batch_size=1024, shuffle=False)
return train_dl, test_dl

# train the model


def train_model(train_dl, model):
size = len(train_dl.dataset)
# define the optimization
criterion = MSELoss()
optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)
# enumerate epochs
# enumerate epochs
for epoch in tqdm(range(100),desc='Training Epochs'):
print(f"Epoch {epoch+1}\n-------------------------------")
# enumerate mini batches
for batch, (inputs, targets) in enumerate(train_dl):
# clear the gradients
optimizer.zero_grad()
# compute the model output
yhat = model(inputs)
# calculate loss
loss = criterion(yhat, targets)
# credit assignment
loss.backward()
# update model weights
optimizer.step()

#if batch % 100 == 0:

Lotfi ELAACHAK Page 79


loss, current = loss.item(), batch * len(inputs)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")

# evaluate the model


def evaluate_model(test_dl, model):
predictions, actuals = list(), list()
for i, (inputs, targets) in enumerate(test_dl):
# evaluate the model on the test set
yhat = model(inputs)
# retrieve numpy array
yhat = yhat.detach().numpy()
actual = targets.numpy()
actual = actual.reshape((len(actual), 1))
# store
predictions.append(yhat)
actuals.append(actual)
predictions, actuals = vstack(predictions), vstack(actuals)
# calculate mse
mse = mean_squared_error(actuals, predictions)
return mse

# make a class prediction for one row of data


def predict(row, model):
# convert row to data
row = Tensor([row])
# make prediction
yhat = model(row)
# retrieve numpy array
yhat = yhat.detach().numpy()
return yhat

# prepare the data


path = 'datasets/housing.csv'
train_dl, test_dl = prepare_data(path)
print(len(train_dl.dataset), len(test_dl.dataset))
# define the network
model = MLP(13)
# train the model
train_model(train_dl, model)
# evaluate the model
mse = evaluate_model(test_dl, model)
print('MSE: %.3f, RMSE: %.3f' % (mse, sqrt(mse)))
# make a single prediction (expect class=1)
row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98]
yhat = predict(row, model)
print('Predicted: %.3f' % yhat)

Lotfi ELAACHAK Page 80


3.3.2 Binary Classification MLP Pytorch

DNN Pytorch
# pytorch mlp for binary classification
from numpy import vstack
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data import random_split
from torch import Tensor
from torch.nn import Linear
from torch.nn import ReLU
from torch.nn import Sigmoid
from torch.nn import Module
from torch.optim import SGD
from torch.nn import BCELoss
from torch.nn.init import kaiming_uniform_
from torch.nn.init import xavier_uniform_
from tqdm import tqdm

# dataset definition
class CSVDataset(Dataset):
# load the dataset
def __init__(self, path):
# load the csv file as a dataframe
df = read_csv(path, header=None)
# store the inputs and outputs
self.X = df.values[:, :-1]
self.y = df.values[:, -1]
# ensure input data is floats
self.X = self.X.astype('float32')
# label encode target and ensure the values are floats
self.y = LabelEncoder().fit_transform(self.y)
self.y = self.y.astype('float32')
self.y = self.y.reshape((len(self.y), 1))

# number of rows in the dataset


def __len__(self):
return len(self.X)

# get a row at an index


def __getitem__(self, idx):
return [self.X[idx], self.y[idx]]

Lotfi ELAACHAK Page 81


# get indexes for train and test rows
def get_splits(self, n_test=0.33):
# determine sizes
test_size = round(n_test * len(self.X))
train_size = len(self.X) - test_size
# calculate the split
return random_split(self, [train_size, test_size])

# model definition
class MLP(Module):
# define model elements
def __init__(self, n_inputs):
super(MLP, self).__init__()
# input to first hidden layer
self.hidden1 = Linear(n_inputs, 10)
kaiming_uniform_(self.hidden1.weight, nonlinearity='relu')
self.act1 = ReLU()
# second hidden layer
self.hidden2 = Linear(10, 8)
kaiming_uniform_(self.hidden2.weight, nonlinearity='relu')
self.act2 = ReLU()
# third hidden layer and output
self.hidden3 = Linear(8, 1)
xavier_uniform_(self.hidden3.weight)
self.act3 = Sigmoid()

# forward propagate input


def forward(self, X):
# input to first hidden layer
X = self.hidden1(X)
X = self.act1(X)
# second hidden layer
X = self.hidden2(X)
X = self.act2(X)
# third hidden layer and output
X = self.hidden3(X)
X = self.act3(X)
return X

# prepare the dataset


def prepare_data(path):
# load the dataset
dataset = CSVDataset(path)
# calculate split
train, test = dataset.get_splits()
# prepare data loaders

Lotfi ELAACHAK Page 82


train_dl = DataLoader(train, batch_size=32, shuffle=True)
test_dl = DataLoader(test, batch_size=1024, shuffle=False)
return train_dl, test_dl

# train the model


def train_model(train_dl, model):
size = len(train_dl.dataset)
# define the optimization
criterion = BCELoss()
optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)
# enumerate epochs
for epoch in tqdm(range(100),desc='Training Epochs'):
print(f"Epoch {epoch+1}\n-------------------------------")
# enumerate mini batches
for batch, (inputs, targets) in enumerate(train_dl):
# clear the gradients
optimizer.zero_grad()
# compute the model output
yhat = model(inputs)
# calculate loss
loss = criterion(yhat, targets)
# credit assignment
loss.backward()
# update model weights
optimizer.step()

#if batch % 100 == 0:


loss, current = loss.item(), batch * len(inputs)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")

# evaluate the model


def evaluate_model(test_dl, model):
predictions, actuals = list(), list()
for i, (inputs, targets) in enumerate(test_dl):
# evaluate the model on the test set
yhat = model(inputs)
# retrieve numpy array
yhat = yhat.detach().numpy()
actual = targets.numpy()
actual = actual.reshape((len(actual), 1))
# round to class values
yhat = yhat.round()
# store
predictions.append(yhat)
actuals.append(actual)
predictions, actuals = vstack(predictions), vstack(actuals)

Lotfi ELAACHAK Page 83


# calculate accuracy
acc = accuracy_score(actuals, predictions)
return acc

# make a class prediction for one row of data


def predict(row, model):
# convert row to data
row = Tensor([row])
# make prediction
yhat = model(row)
# retrieve numpy array
yhat = yhat.detach().numpy()
return yhat

#prepare the data


path = 'datasets/ionosphere.csv'
train_dl, test_dl = prepare_data(path)
print(len(train_dl.dataset), len(test_dl.dataset))
# define the network
model = MLP(34)
# train the model
train_model(train_dl, model)
# evaluate the model
acc = evaluate_model(test_dl, model)
print('Accuracy: %.3f' % acc)
# make a single prediction (expect class=1)
row = [1,0,0.99539,-0.05889,0.85243,0.02306,0.83398,-0.37708,1,0.03760,0.85243,-0.1
7755,0.59755,-0.44945,0.60536,-0.38223,0.84356,-0.38542,0.58212,-0.32192,0.56971
,-0.29674,0.36946,-0.47357,0.56811,-0.51171,0.41078,-0.46168,0.21266,-0.34090,0.
42267,-0.54487,0.18641,-0.45300]
yhat = predict(row, model)
print('Predicted: %.3f (class=%d)' % (yhat, yhat.round()))

3.3.3 Multiclass Classification MLP Pytorch

DNN Pytorch
# pytorch mlp for multiclass classification
from numpy import vstack
from numpy import argmax
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from torch import Tensor
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

Lotfi ELAACHAK Page 84


from torch.utils.data import random_split
from torch.nn import Linear
from torch.nn import ReLU
from torch.nn import Softmax
from torch.nn import Module
from torch.optim import SGD
from torch.nn import CrossEntropyLoss
from torch.nn.init import kaiming_uniform_
from torch.nn.init import xavier_uniform_
from tqdm import tqdm

# dataset definition
class CSVDataset(Dataset):
# load the dataset
def __init__(self, path):
# load the csv file as a dataframe
df = read_csv(path, header=None)
# store the inputs and outputs
self.X = df.values[:, :-1]
self.y = df.values[:, -1]
# ensure input data is floats
self.X = self.X.astype('float32')
# label encode target and ensure the values are floats
self.y = LabelEncoder().fit_transform(self.y)

# number of rows in the dataset


def __len__(self):
return len(self.X)

# get a row at an index


def __getitem__(self, idx):
return [self.X[idx], self.y[idx]]

# get indexes for train and test rows


def get_splits(self, n_test=0.33):
# determine sizes
test_size = round(n_test * len(self.X))
train_size = len(self.X) - test_size
# calculate the split
return random_split(self, [train_size, test_size])

# model definition
class MLP(Module):
# define model elements
def __init__(self, n_inputs):

Lotfi ELAACHAK Page 85


super(MLP, self).__init__()
# input to first hidden layer
self.hidden1 = Linear(n_inputs, 10)
kaiming_uniform_(self.hidden1.weight, nonlinearity='relu')
self.act1 = ReLU()
# second hidden layer
self.hidden2 = Linear(10, 8)
kaiming_uniform_(self.hidden2.weight, nonlinearity='relu')
self.act2 = ReLU()
# third hidden layer and output
self.hidden3 = Linear(8, 3)
xavier_uniform_(self.hidden3.weight)
self.act3 = Softmax(dim=1)

# forward propagate input


def forward(self, X):
# input to first hidden layer
X = self.hidden1(X)
X = self.act1(X)
# second hidden layer
X = self.hidden2(X)
X = self.act2(X)
# output layer
X = self.hidden3(X)
X = self.act3(X)
return X

# prepare the dataset


def prepare_data(path):
# load the dataset
dataset = CSVDataset(path)
# calculate split
train, test = dataset.get_splits()
# prepare data loaders
train_dl = DataLoader(train, batch_size=32, shuffle=True)
test_dl = DataLoader(test, batch_size=1024, shuffle=False)
return train_dl, test_dl

# train the model


def train_model(train_dl, model):
size = len(train_dl.dataset)
# define the optimization
criterion = CrossEntropyLoss()
optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)

Lotfi ELAACHAK Page 86


# enumerate epochs
for epoch in tqdm(range(500),desc='Training Epochs'):
print(f"Epoch {epoch+1}\n-------------------------------")
# enumerate mini batches
for batch, (inputs, targets) in enumerate(train_dl):
# clear the gradients
optimizer.zero_grad()
# compute the model output
yhat = model(inputs)
# calculate loss
loss = criterion(yhat, targets)
# credit assignment
loss.backward()
# update model weights
optimizer.step()

#if batch % 100 == 0:


loss, current = loss.item(), batch * len(inputs)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")

# evaluate the model


def evaluate_model(test_dl, model):
predictions, actuals = list(), list()
for i, (inputs, targets) in enumerate(test_dl):
# evaluate the model on the test set
yhat = model(inputs)
# retrieve numpy array
yhat = yhat.detach().numpy()
actual = targets.numpy()
# convert to class labels
yhat = argmax(yhat, axis=1)
# reshape for stacking
actual = actual.reshape((len(actual), 1))
yhat = yhat.reshape((len(yhat), 1))
# store
predictions.append(yhat)
actuals.append(actual)
predictions, actuals = vstack(predictions), vstack(actuals)
# calculate accuracy
acc = accuracy_score(actuals, predictions)
return acc

# make a class prediction for one row of data


def predict(row, model):
# convert row to data

Lotfi ELAACHAK Page 87


row = Tensor([row])
# make prediction
yhat = model(row)
# retrieve numpy array
yhat = yhat.detach().numpy()
return yhat

# prepare the data


path = 'datasets/Iris.csv'
train_dl, test_dl = prepare_data(path)
print(len(train_dl.dataset), len(test_dl.dataset))
# define the network
model = MLP(4)
# train the model
train_model(train_dl, model)
# evaluate the model
acc = evaluate_model(test_dl, model)
print('Accuracy: %.3f' % acc)
# make a single prediction
row = [5.1,3.5,1.4,0.2]
yhat = predict(row, model)
print('Predicted: %s (class=%d)' % (yhat, argmax(yhat)))

3.4 DNN Keras

3.4.1 Rgression MLP Keras

DNN Keras
# Regression Example With Boston Dataset: Standardized
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Import necessary modules


from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
# load dataset
dataframe = read_csv("datasets/housing.csv", header=None)
print(dataframe.shape)
#dataframe.head()

Lotfi ELAACHAK Page 88


dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:13]
Y = dataset[:,13]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30,


random_state=40)

# define base model


def baseline_model():
# create model
# Define model
model = Sequential()
model.add(Dense(500, input_dim=13, activation= "relu"))
model.add(Dense(100, activation= "relu"))
model.add(Dense(50, activation= "relu"))
model.add(Dense(1))
model.compile(loss= "mean_squared_error" , optimizer="adam", metrics=["
mean_squared_error"])
return model

model = baseline_model()
model.fit(X_train, y_train, epochs=20)

pred_train= model.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train)))

pred= model.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred)))

3.4.2 Binary Classification MLP Keras

DNN Keras
# first neural network with keras tutorial
from numpy import loadtxt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# load the dataset
dataset = loadtxt('datasets/pima-indians-diabetes.csv', delimiter=',')
# split into input (X) and output (y) variables
X = dataset[:,0:8]
y = dataset[:,8]
# define the keras model
model = Sequential()
model.add(Dense(12, input_shape=(8,), activation='relu'))

Lotfi ELAACHAK Page 89


model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit the keras model on the dataset
model.fit(X, y, epochs=150, batch_size=10)
# evaluate the keras model
_, accuracy = model.evaluate(X, y)
print('Accuracy: %.2f' % (accuracy*100))

3.4.3 Multiclass Classification MLP Keras

DNN Keras
# multi-class classification with Keras
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from keras.utils import to_categorical

# Random seed for reproducibility


seed = 10
np.random.seed(seed)

# load dataset
dataframe = pandas.read_csv("datasets/Iris.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:4].astype(float)
Y = dataset[:,4]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = to_categorical(encoded_Y)

Lotfi ELAACHAK Page 90


# define baseline model
def baseline_model():
# create model
model = Sequential()
model.add(Dense(8, input_dim=4, activation='relu'))
model.add(Dense(3, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['
accuracy'])
return model

estimator = KerasClassifier(build_fn = baseline_model, epochs = 100, batch_size = 1


0, verbose = 0)
# KFold Cross Validation
kfold = KFold(n_splits = 5, shuffle = True, random_state = seed)
# Try different values of splits e.g., 10results = cross_val_score(estimator, X,
dummy_y, cv=kfold)
# Object to describe the result
results = cross_val_score(estimator, X, Y, cv = kfold)
# Result
print("Result: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Lotfi ELAACHAK Page 91


Chapitre 4
Convolution neural networks
4.1 The first approach

The tools we have had so far : Expensive to learn ! ! Will not generalize well ! ! Does not exploit
the order and local relations in the data ! !

4.2 Convolutional neural network


Convolutional Neural Networks are very similar to ordinary Neural Networks from the previous
chapter : they are made up of neurons that have learnable weights and biases. Each neuron
receives some inputs, performs a dot product and optionally follows it with a non-linearity. The
whole network still expresses a single differentiable score function : from the raw image pixels on
one end to class scores at the other. And they still have a loss function (e.g. SVM/Softmax) on
the last (fully-connected) layer and all the tips/tricks we developed for learning regular Neural
Networks still apply.

Advantage of Convolutional Neural Networks

Consequently, the biggest advantage of a convolutional neural network, when compared to a fully
connected neural network, is a smaller number of parameters.

92
For example, if the input I has 32 × 32dimension and we apply 10 filters with dimension 3 × 3,
the output will be a tensor with the format 30 × 30 × 10. Every filter has 3 · 3 = 9 parameters
plus one bias element which is in total for 10 filters.

10 × 10 = 100 parameters. On the other hand, with a fully connected neural network, we would
need to flatten the input matrix into a 32 · 32 = 1024 dimensional vector. In order to have
the output with the same dimension as above, we would need 1024 × 30 × 30 × 10 = 921600
parameters, or weights.

4.3 Architecture of a traditional CNN

Lotfi ELAACHAK Page 93


VGG Net

4.3.1 Convolutional Layer


The convolution layer (CONV) uses filters that perform convolution operations as it is scanning
the input II with respect to its dimensions. Its hyperparameters include the filter size F and
stride S. The resulting output O is called feature map or activation map.
Convolution is defined as :
g(x, y) = w ∗ f (x, y) = Σss=s
max
min
Σtt=t
max
min
w(s, t)f (x + s, y + t)

where f(x, y) is the input image and w is the filter or kernel. More intuitively, we can imagine
this process looking at the illustration below :

The value of Pixel(1,3) = 1 * 1 + 1 * 0 + 0 * 1 + 1 * 0 + 1 * 1 + 1 * 0 + 0 * 1 + 1 * 0 + 1 * 1


= 1 + 1 + 1 = 3.

4.3.2 Kernel hyperparameters


Kernel

A filter provides a measure for how close a patch or a region of the input resembles a feature. It
acts as a single template or pattern, which, when convolved across the input, finds similarities

Lotfi ELAACHAK Page 94


between the stored template & different locations/regions in the input image. In our case is an
edge detector.
Dimensions of a filterA filter of size F ×F applied to an input containing C channels is a F ×F ×C
volume that performs convolutions on an input of size I × I × C and produces an output feature
map (also called activation map) of size O × O × 1 .

Padding

Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input.
This value can either be manually specified or automatically set through one of the three modes
detailed below :

an input of 6 X 6 dimension with a 3 X 3 filter results in 4 X 4 output. We can generalize it :


— Input : n X n
— Filter/Kernel size : f X f
— Output : (n-f+1) X (n-f+1)

Every time we apply a convolutional operation, the size of the image shrinks. Pixels present in
the corner of the image are used only a few number of times during convolution as compared
to the central pixels. Hence, we do not focus too much on the corners since that can lead to
information loss.

To overcome these issues, we can pad the image with an additional border, i.e., we add one pixel
all around the edges. This means that the input will be an 8 X 8 matrix (instead of a 6 X 6
matrix). Applying convolution of 3 X 3 on it will result in a 6 X 6 matrix which is the original
shape of the image. This is where padding comes to the fore :

— Input : n X n
— Filter/Kernel size : f X f
— Padding : p
— Output : (n+2p-f+1) X (n+2p-f+1)

Lotfi ELAACHAK Page 95


Stride

For a convolutional or a pooling operation, the stride S denotes the number of pixels by which
the window moves after each operation.

The dimensions for stride s will be :


— Input : n X n
— Padding : p
— Stride : s
— Filter size : f X f
— Output : [(n+2p-f)/s+1] X [(n+2p-f)/s+1]

4.3.3 Convolutions Over Volume


Suppose, instead of a 2-D image, we have a 3-D input image of shape 6 X 6 X 3. How will we
apply convolution on this image ? We will use a 3 X 3 X 3 filter instead of a 3 X 3 filter. Let’s
look at an example :

— Input : 6 X 6 X 3
— Filter : 3 X 3 X 3

Instead of using just a single filter, we can use multiple filters as well. Let’s say the first filter
will detect vertical edges and the second filter will detect horizontal edges from the image. If we
use multiple filters, the output dimension will change. So, instead of having a 4 X 4 output as in
the above example, we would have a 4 X 4 X 2 output (if we have used 2 filters) :

Generalized dimensions can be given as :

Lotfi ELAACHAK Page 96


— Input : n X n X nc
— Filter : f X f X nc
— Padding : p
— Stride : s
— Output : [(n+2p-f)/s+1] X [(n+2p-f)/s+1] X nc’

Here, nc is the number of channels in the input and filter, while nc’ is the number of filters.

4.3.4 The Conv Layer process


— Accepts a volume of size W1 × H1 × D1
— Requires four hyperparameters :
— Number of filters K,
— their spatial extent F,
— the stride S,
— the amount of zero padding P.
— Produces a volume of size W2 × H2 × D2 where :
— W2 = (W1 −FS +2P ) + 1
— H2 = (H1 −FS +2P ) + 1
— D2 = K
— With parameter sharing, it introduces F ·F ·D1 weights per filter, for a total of (F ·F ·D1 )·K
weights and K biases.
— In the output volume, the d-th depth slice (of size W 2 × H2) is the result of performing
a valid convolution of the d-th filter over the input volume with a stride of S, and then
offset by d-th bias.

A common setting of the hyperparameters is F=3,S=1,P=1.

4.3.5 Pooling Layer


The pooling layer (POOL) is a downsampling operation, typically applied after a convolution
layer, which does some spatial invariance. In particular, max and average pooling are special
kinds of pooling where the maximum and average value is taken, respectively.

It is common to periodically insert a Pooling layer in-between successive Conv layers in a Conv-
Net architecture. Its function is to progressively reduce the spatial size of the representation to
reduce the amount of parameters and computation in the network.

The Pooling Layer operates independently on every depth slice of the input and resizes it spa-
tially, using the MAX operation.

Lotfi ELAACHAK Page 97


— Accepts a volume of size W1×H1×D1
— Requires two hyperparameters :
— their spatial extent F,
— the stride S,
— Produces a volume of size W2×H2×D2 where :
— W2 = (W1S−F ) + 1
— H2 = (H1S−F ) + 1
— D2 = D1
— Introduces zero parameters since it computes a fixed function of the input
— For Pooling layers, it is not common to pad the input using zero-padding.

4.3.6 Fully-Connected Layer


The fully connected layer (FC) operates on a flattened input where each input is connected to
all neurons. If present, FC layers are usually found towards the end of CNN architectures and
can be used to optimize objectives such as class scores.

4.4 CNN Explainer

https ://poloclub.github.io/cnn-explainer/

Lotfi ELAACHAK Page 98


4.5 Classic Networks

4.5.1 LeNet-5

The total number of parameters in LeNet-5 are :


— Parameters : 60k
— Layers flow : Conv -> Pool -> Conv -> Pool -> FC -> FC -> Output
— Activation functions : Sigmoid/tanh and ReLu

4.5.2 AlexNet

This network is similar to LeNet-5 with just more convolution and pooling layers :
— Parameters : 60 million
— Activation function : ReLu

4.5.3 VGG-16

It is a bigger network, the number of parameters are also more.


— Parameters : 138 million

Lotfi ELAACHAK Page 99


4.6 CNN Pytorch
CNN Pytorch
import torch
import torchvision
import torchvision.transforms as transforms

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

batch_size = 4

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,


download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,


download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',


'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

import matplotlib.pyplot as plt


import numpy as np

# functions to show an image

def imshow(img):
img = img / 2 + 0.5 # unnormalize
npimg = img.numpy()
plt.imshow(np.transpose(npimg, (1, 2, 0)))
plt.show()
# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()

# show images

Lotfi ELAACHAK Page 100


imshow(torchvision.utils.make_grid(images))
# print labels
print(' '.join(f'{classes[labels[j]]:5s}' for j in range(batch_size)))

import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)

def forward(self, x):


x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = torch.flatten(x, 1) # flatten all dimensions except batch
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x

net = Net()

import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

for epoch in range(2): # loop over the dataset multiple times

running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data

# zero the parameter gradients


optimizer.zero_grad()

# forward + backward + optimize

Lotfi ELAACHAK Page 101


outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()

# print statistics
running_loss += loss.item()
if i % 2000 == 1999: # print every 2000 mini-batches
print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}')
running_loss = 0.0

print('Finished Training')

dataiter = iter(testloader)
images, labels = dataiter.next()

# print images
imshow(torchvision.utils.make_grid(images))
print('GroundTruth: ', ' '.join(f'{classes[labels[j]]:5s}' for j in range(4)))

outputs = net(images)

_, predicted = torch.max(outputs, 1)

print('Predicted: ', ' '.join(f'{classes[predicted[j]]:5s}' for j in range(4)))

correct = 0
total = 0
# since we're not training, we don't need to calculate the gradients for our
outputs
with torch.no_grad():
for data in testloader:
images, labels = data
# calculate outputs by running images through the network
outputs = net(images)
# the class with the highest energy is what we choose as prediction
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()

print(f'Accuracy of the network on the 10000 test images: {100 * correct // total}
%')

# prepare to count predictions for each class


correct_pred = {classname: 0 for classname in classes}
total_pred = {classname: 0 for classname in classes}

Lotfi ELAACHAK Page 102


# again no gradients needed
with torch.no_grad():
for data in testloader:
images, labels = data
outputs = net(images)
_, predictions = torch.max(outputs, 1)
# collect the correct predictions for each class
for label, prediction in zip(labels, predictions):
if label == prediction:
correct_pred[classes[label]] += 1
total_pred[classes[label]] += 1

# print accuracy for each class


for classname, correct_count in correct_pred.items():
accuracy = 100 * float(correct_count) / total_pred[classname]
print(f'Accuracy for class: {classname:5s} is {accuracy:.1f} %')

4.7 CNN Keras

CNN Keras
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
import numpy as np

import ssl
ssl._create_default_https_context = ssl._create_unverified_context
(x_train, y_train), (x_test, y_test) = mnist.load_data()

img_rows, img_cols = 28, 28

if K.image_data_format() == 'channels_first':
x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)

Lotfi ELAACHAK Page 103


x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255

y_train = keras.utils.to_categorical(y_train, 10)


y_test = keras.utils.to_categorical(y_test, 10)

model = Sequential()
model.add(Conv2D(32, kernel_size = (3, 3), activation = 'relu', input_shape =
input_shape))
model.add(Conv2D(64, (3, 3), activation = 'relu'))
model.add(MaxPooling2D(pool_size = (2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation = 'softmax'))

model.compile(loss = keras.losses.categorical_crossentropy,
optimizer = keras.optimizers.Adadelta(), metrics = ['accuracy'])

model.fit(
x_train, y_train,
batch_size = 128,
epochs = 12,
verbose = 1,
validation_data = (x_test, y_test)
)

score = model.evaluate(x_test, y_test, verbose = 0)

print('Test loss:', score[0])


print('Test accuracy:', score[1])

pred = model.predict(x_test)
pred = np.argmax(pred, axis = 1)[:5]
label = np.argmax(y_test,axis = 1)[:5]

print(pred)
print(label)

Lotfi ELAACHAK Page 104


Chapitre5
Sequence models (Recurrent neural
network and LSTM)
Sequence Models have been motivated by the analysis of sequential data such text sentences,
time-series and other discrete sequences data. These models are especially designed to handle
sequential information while Convolutional Neural Network are more adapted for process spatial
information.

Why sequence models. Examples of sequence data :


— Speech recognition
— Music generation
— Sentiment classification
— DNA sequence analysis
— Machine translation
— Video activity recognition
— Named entity recognition
— Image captioning

5.1 Recurrent Neural Networks


A recurrent neural network (RNN) is a special type of an artificial neural network adapted to
work for time series data or data that involves sequences.

Ordinary feed forward neural networks are only meant for data points, which are independent
of each other. However, if we have data in a sequence such that one data point depends upon
the previous data point, we need to modify the neural network to incorporate the dependencies
between these data points. RNNs have the concept of ‘memory’ that helps them store the states
or information of previous inputs to generate the next output of the sequence.

— Core idea : update hidden state h based on input and previous hidden state using same
update rule (same/shared parameters) at each time step.

105
— Allows for processing sequences of variable length, not only fixed-sized vectors.
— Infinite memory : h is function of all previous inputs (long-term dependencies).

5.1.1 Basic Recurrent Neural Network

ht = tanh(Ah ht−1 + Ax xt + b)
ŷt = Ay ht
— Hidden state ht = linear combination of input xt and previous hidden state ht−1 .
— Output ŷt = linear prediction based on current hidden state ht .
— tanh(·) is commonly used as activation function (data is in the range [-1, 1]).
— Parameters Ah , Ax , Ay , b are constant over time (sequences may vary in length).

5.1.2 Architecture of a traditional RNN

For each timestep t, the activation a<t> and the output y <t> are expressed as follows :

a<t> = g1 (Waa a<t−1> + Wax a<t> + ba )


y <t> = g2 (Wya a<t> + by )
where Wax , Waa , Wya , ba , by are coefficients that are shared temporally and g1 , g2 activation func-
tions.

Lotfi ELAACHAK Page 106


5.1.3 Different types of RNNs
RNN models are mostly used in the fields of natural language processing and speech recognition.

One-to-one Tx = Ty = 1

One-to-many Tx = 1, Ty > 1

Many-to-one Tx > 1, Ty = 1

Many-to-many Tx > 1, Ty > 1, Tx = Ty

Lotfi ELAACHAK Page 107


Many-to-many Tx > 1, Ty > 1, Tx 6= Ty

5.1.4 Backpropagation through time


Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with
respect to weight matrix W is expressed as follows :

δLt δLt
= ΣTt=1 |t
δW δW

— To train RNNs, we backpropagate gradients through time.


— As all hidden RNN cells share their parameters, gradients get accumulated.
— However, very quickly intractable (memory) for larger sequences.

5.1.5 RNN Problems


Vanishing/exploding gradient : The vanishing and exploding gradient phenomena are often en-
countered in the context of RNNs. The reason why they happen is that it is difficult to capture
long term dependencies because of multiplicative gradient that can be exponentially decrea-
sing/increasing with respect to the number of layers.

Lotfi ELAACHAK Page 108


Gradient clipping : it is a technique used to cope with the exploding gradient problem sometimes
encountered when performing backpropagation. By capping the maximum value for the gradient,
this phenomenon is controlled in practice.

5.1.6 Multi-Layer RNNs

h1t = tanh(A1h h1t−1 + A1x xt + b1 )

h2t = tanh(A2h h2t−1 + A2x h1t + b2 )


ŷt = Ay h2t
— Deeper multi-layer RNNs can be constructed by stacking RNN layers.
— An alternative is to make each individual computation (=RNN cell) deeper.
— Today, often combined with residual connections in vertical direction.

5.1.7 The Problem of Long-Term Dependencies


One of the appeals of RNNs is the idea that they might be able to connect previous informa-
tion to the present task, such as using previous video frames might inform the understanding

Lotfi ELAACHAK Page 109


of the present frame. If RNNs could do this, they’d be extremely useful. But can they ? It depends.

Sometimes, we only need to look at recent information to perform the present task. For example,
consider a language model trying to predict the next word based on the previous ones. If we are
trying to predict the last word in “the clouds are in the sky,” we don’t need any further context
– it’s pretty obvious the next word is going to be sky. In such cases, where the gap between
the relevant information and the place that it’s needed is small, RNNs can learn to use the past
information.

But there are also cases where we need more context. Consider trying to predict the last word
in the text “I grew up in France. . . I speak fluent French.” Recent information suggests that the
next word is probably the name of a language, but if we want to narrow down which language,
we need the context of France, from further back. It’s entirely possible for the gap between the
relevant information and the point where it is needed to become very large.

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

5.2 Gated Recurrent Unit


Gate Recurrent Unit is one of the ideas that has enabled RNN to become much better at capturing
very long range dependencies and has made RNN much more effective, it deals with the vanishing
gradient problem encountered by traditional RNNs.

5.2.1 Gates
In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs
and usually have a well-defined purpose. They are usually noted Γ "Gamma" and are equal to :

Γ = σ(W x<t> + U a<t−1> + b)

where W, U, b are coefficients specific to the gate and σ is the sigmoid function.,The main ones
are summed up as below :

Lotfi ELAACHAK Page 110


— Update gate Γu , How much past should matter now ? , GRU , LSTM.
— Relevance gate Γr , Drop previous information ? , GRU , LSTM.
— Forget gate Γf , Erase a cell or not ? , LSTM.
— Output gate Γo , How much to reveal of a cell ? , LSTM.

5.2.2 GRU Unit

rt = Γr = σ(Wrh ht−1 + Wrx xt + br )

ut = Γu = σ(Wuh ht−1 + Wux xt + bu)


st = tanh(Wsh (rt ht−1 ) + Wsx xt + bs )
ht = ut ht−1 + (1 − ut ) st
— Reset gate controls which parts of the state are used to compute next target state.
— Update gate controls how much information to pass from previous time step.

5.3 Long Short-Term Memory


Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered
by traditional RNNs, with LSTM being a generalization of GRU.

Lotfi ELAACHAK Page 111


ft = σ(Wf h ht−1 + Wf x xt + bf )

it = σ(Wih ht−1 + Wix xt + bi )


ot = σ(W ohht−1 + Wox xt + bo )
st = tanh(Wsh ht−1 + Wsx xt + bs )
ct = ft ct−1 + it st
ht = ot tanh(ct )
Passes along an additional cell state c in addition to the hidden state h. Has 3 gates :

— Forget gate determines information to erase from cell state.


— Input gate determines which values of cell state to update.
— Output gate determines which elements of cell state to reveal at time t.

5.3.1 The Core Idea Behind LSTMs


The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only
some minor linear interactions. It’s very easy for information to just flow along it unchanged.

The LSTM does have the ability to remove or add information to the cell state, carefully regu-
lated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid
neural net layer and a pointwise multiplication operation.

Lotfi ELAACHAK Page 112


The sigmoid layer outputs numbers between zero and one, describing how much of each com-
ponent should be let through. A value of zero means “let nothing through,” while a value of one
means “let everything through !”

An LSTM has three of these gates, to protect and control the cell state.

5.3.2 Step-by-Step LSTM Walk Through


The first step in our LSTM is to decide what information we’re going to throw away from the cell
state. This decision is made by a sigmoid layer called the "forget gate layer." It looks at ht−1 and
xt , and outputs a number between 0 and 1 for each number in the cell state Ct−1 .A 1 represents
" completely keep this" while a 0 represents " completely get rid of this.".

Let’s go back to our example of a language model trying to predict the next word based on all the
previous ones. In such a problem, the cell state might include the gender of the present subject,
so that the correct pronouns can be used. When we see a new subject, we want to forget the
gender of the old subject.

The next step is to decide what new information we’re going to store in the cell state. This has
two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update.
Next, a tanh layer creates a vector of new candidate values, Ct , that could be added to the state.
In the next step, we’ll combine these two to create an update to the state.

In the example of our language model, we’d want to add the gender of the new subject to the
cell state, to replace the old one we’re forgetting.

Lotfi ELAACHAK Page 113


It’s now time to update the old cell state, Ct−1 , into the new cell state Ct . The previous steps
already decided what to do, we just need to actually do it.

We multiply the old state by ft , forgetting the things we decided to forget earlier. Then we add
it ∗ Ct . This is the new candidate values, scaled by how much we decided to update each state
value.
In the case of the language model, this is where we’d actually drop the information about the
old subject’s gender and add the new information, as we decided in the previous steps.

Finally, we need to decide what we’re going to output. This output will be based on our cell
state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the
cell state we’re going to output. Then, we put the cell state through tanh(to push the values to
be between -1 and 1 ) and multiply it by the output of the sigmoid gate, so that we only output
the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information
relevant to a verb, in case that’s what is coming next. For example, it might output whether
the subject is singular or plural, so that we know what form a verb should be conjugated into if
that’s what follows next.

Lotfi ELAACHAK Page 114


5.3.3 Example of LSTM Forecasting

Use the same wights in all LSTM cells.

5.4 RNN/LSTM/GRU Pytorch

5.4.1 Simple RNN/LSTM/GRU Pytorch One To One

Simple RNN/LSTM/GRU Pytorch


import torch
from torch import nn

Lotfi ELAACHAK Page 115


import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# data generation

seq_length = 20
time_steps = np.linspace(0, np.pi, seq_length + 1)
data = np.sin(time_steps)
data.resize((seq_length + 1, 1))
#size becomes (seq_length+1, 1), adds an input_size dimension

x = data[:-1] # all but the last piece of data


y = data[1:] # all but the first

class RNN(nn.Module):
def __init__(self, input_size, output_size, hidden_dim, n_layers):
super(RNN, self).__init__()

self.hidden_dim=hidden_dim
# batch_first means that the first dim of the input and output will be the
batch_size
self.rnn = nn.RNN(input_size, hidden_dim, n_layers, batch_first=True)
# last, fully-connected layer
self.fc = nn.Linear(hidden_dim, output_size)

def forward(self, x, hidden):


# x (batch_size, seq_length, input_size)
# hidden (n_layers, batch_size, hidden_dim)
# r_out (batch_size, time_step, hidden_size)
batch_size = x.size(0)

# get RNN outputs


r_out, hidden = self.rnn(x, hidden)
# shape output to be (batch_size*seq_length, hidden_dim)
r_out = r_out.view(-1, self.hidden_dim)

# get final output


output = self.fc(r_out)

return output, hidden

input_size=1
output_size=1
hidden_dim=32

Lotfi ELAACHAK Page 116


n_layers=1

rnn = RNN(input_size, output_size, hidden_dim, n_layers)


print(rnn)

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(rnn.parameters(), lr=0.01)

def train(rnn, n_steps, print_every):

# initialize the hidden state


hidden = None

for batch_i, step in enumerate(range(n_steps)):


# defining the training data
time_steps = np.linspace(step * np.pi, (step+1)*np.pi, seq_length + 1)
data = np.sin(time_steps)
data.resize((seq_length + 1, 1)) # input_size=1

x = data[:-1]
y = data[1:]

x_tensor = torch.Tensor(x).unsqueeze(0) # unsqueeze gives a 1, batch_size


dimension
y_tensor = torch.Tensor(y)

# outputs from the rnn


prediction, hidden = rnn(x_tensor, hidden)

# Representing Memory #
# make a new variable for hidden and detach the hidden state from its
history
# this way, we don't backpropagate through the entire history
hidden = hidden.data

# calculate the loss


loss = criterion(prediction, y_tensor)
# zero gradients
optimizer.zero_grad()
# perform backprop and update weights
loss.backward()
optimizer.step()

# display loss and predictions


if batch_i%print_every == 0:
print('Loss: ', loss.item())

Lotfi ELAACHAK Page 117


plt.plot(time_steps[1:], x, 'r.') # input
plt.plot(time_steps[1:], prediction.data.numpy().flatten(), 'b.') #
predictions
plt.show()

return rnn

n_steps = 10
print_every = 2
train(rnn, n_steps, print_every)

5.4.2 LSTM One to Many : Image Captioning

https://medium.com/@deepeshrishu09/automatic-image-captioning-with-pytorch-cf576c98d319

5.4.3 LSTM Many to One : Sentiment Analysis


https://www.kaggle.com/code/arunmohan003/sentiment-analysis-using-lstm-pytorch

5.4.4 LSTM Many to Many : Text Generation


https://www.kaggle.com/code/krishanudb/lstm-character-word-pos-tag-model-pytorch

5.4.5 Time series Pytorch

Time Series Pytorch


import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data

Lotfi ELAACHAK Page 118


df = pd.read_csv('airline-passengers.csv')
timeseries = df[["Passengers"]].values.astype('float32')

# train-test split for time series


train_size = int(len(timeseries) * 0.67)
test_size = len(timeseries) - train_size
train, test = timeseries[:train_size], timeseries[train_size:]

def create_dataset(dataset, lookback):


"""Transform a time series into a prediction dataset

Args:
dataset: A numpy array of time series, first dimension is the time steps
lookback: Size of window for prediction
"""
X, y = [], []
for i in range(len(dataset)-lookback):
feature = dataset[i:i+lookback]
target = dataset[i+1:i+lookback+1]
X.append(feature)
y.append(target)
return torch.tensor(X), torch.tensor(y)

lookback = 4
X_train, y_train = create_dataset(train, lookback=lookback)
X_test, y_test = create_dataset(test, lookback=lookback)

class AirModel(nn.Module):
def __init__(self):
super().__init__()
self.lstm = nn.LSTM(input_size=1, hidden_size=50, num_layers=1, batch_first
=True)
self.linear = nn.Linear(50, 1)
def forward(self, x):
x, _ = self.lstm(x)
x = self.linear(x)
return x

model = AirModel()
optimizer = optim.Adam(model.parameters())
loss_fn = nn.MSELoss()
loader = data.DataLoader(data.TensorDataset(X_train, y_train), shuffle=True,
batch_size=8)

n_epochs = 2000
for epoch in range(n_epochs):

Lotfi ELAACHAK Page 119


model.train()
for X_batch, y_batch in loader:
y_pred = model(X_batch)
loss = loss_fn(y_pred, y_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Validation
if epoch % 100 != 0:
continue
model.eval()
with torch.no_grad():
y_pred = model(X_train)
train_rmse = np.sqrt(loss_fn(y_pred, y_train))
y_pred = model(X_test)
test_rmse = np.sqrt(loss_fn(y_pred, y_test))
print("Epoch %d: train RMSE %.4f, test RMSE %.4f" % (epoch, train_rmse,
test_rmse))

with torch.no_grad():
# shift train predictions for plotting
train_plot = np.ones_like(timeseries) * np.nan
y_pred = model(X_train)
y_pred = y_pred[:, -1, :]
train_plot[lookback:train_size] = model(X_train)[:, -1, :]
# shift test predictions for plotting
test_plot = np.ones_like(timeseries) * np.nan
test_plot[train_size+lookback:len(timeseries)] = model(X_test)[:, -1, :]
# plot
plt.plot(timeseries)
plt.plot(train_plot, c='r')
plt.plot(test_plot, c='g')
plt.show()

5.5 RNN Keras

5.5.1 Basic RNN

RNN Basic Keras


import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

print('Tensorflow: {}'.format(tf.__version__))

Lotfi ELAACHAK Page 120


plt.rcParams['figure.figsize'] = (16, 10)
plt.rc('font', size=15)

h = [1, 0, 0, 0]
e = [0, 1, 0, 0]
l = [0, 0, 1, 0]
o = [0, 0, 0, 1]
# In tensorflow, to insert this data, we need to reshape like this order: $$ (\text
{batch_size}, \text{sequence_length}, \text{sequence_width}) $$
X_data = np.array([[h]], dtype=np.float32)
X_data.shape

hidden_size = 2 # 2 nodes
cell = tf.keras.layers.SimpleRNNCell(units=hidden_size)
rnn = tf.keras.layers.RNN(cell, return_sequences=True, return_state=True)
outputs, states = rnn(X_data)

print("X_data: {}, shape: {}".format(X_data, X_data.shape))


print("output: {}, shape: {}".format(outputs, outputs.shape))
print("states: {}, shape: {}".format(states, states.shape))

rnn_2 = tf.keras.layers.SimpleRNN(units=hidden_size, return_sequences=True,


return_state=True)
outputs, states = rnn_2(X_data)

print("X_data: {}, shape: {}".format(X_data, X_data.shape))


print("output: {}, shape: {}".format(outputs, outputs.shape))
print("states: {}, shape: {}".format(states, states.shape))

# Unfolding multiple sequences


X_data = np.array([[h, e, l, l, o]], dtype=np.float32)
X_data.shape

cell = tf.keras.layers.SimpleRNNCell(units=hidden_size)
rnn = tf.keras.layers.RNN(cell, return_sequences=True, return_state=True)
outputs, states = rnn(X_data)

print("X_data: {}, shape: {}\n".format(X_data, X_data.shape))


print("output: {}, shape: {}\n".format(outputs, outputs.shape))
#hidden states
print("states: {}, shape: {}".format(states, states.shape))

5.5.2 RNN GRU LSTM Keras

Lotfi ELAACHAK Page 121


RNN GRU LSTM Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential()
# each exmple is a vector of 28
#Gru
model.add(layers.GRU(64, input_shape=(None, 28)))
# Simple RNN
#model.add(layers.SimpleRNN(64, input_shape=(None, 28)))
#LSTM
#model.add(layers.LSTM(64, input_shape=(None, 28)))
model.add(layers.BatchNormalization())
model.add(layers.Dense(10))
print(model.summary())

mnist = keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train/255.0, x_test/255.0
x_validate, y_validate = x_test[:-10], y_test[:-10]
x_test, y_test = x_test[-10:], y_test[-10:]

model.compile(
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer="sgd",
metrics=["accuracy"],
)

model.fit(
x_train, y_train, validation_data=(x_test, y_test), batch_size=64, epochs=3
)

for i in range(10):
result = tf.argmax(model.predict(tf.expand_dims(x_test[i], 0)), axis=1)
print(result.numpy(), y_test[i])

5.5.3 RNN Many To One Keras

RNN MTO Keras

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Lotfi ELAACHAK Page 122


print('Tensorflow: {}'.format(tf.__version__))

plt.rcParams['figure.figsize'] = (16, 10)


plt.rc('font', size=15)

words = ['good', 'bad', 'worse', 'so good']


y = [1, 0, 0, 1]

# we can generate the token dictionary, which is the mapping table for each
characters.
#the format (or shape) of input data must be fixed. So we need to add the concept
of padding. words with same length
char_set = ['<pad>'] + sorted(list(set(''.join(words))))
idx2char = {idx:char for idx, char in enumerate(char_set)}
char2idx = {char:idx for idx, char in enumerate(char_set)}

char_set

X = list(map(lambda word: [char2idx.get(char) for char in word], words))


X_len = list(map(lambda word: len(word), X))

from tensorflow.keras.preprocessing.sequence import pad_sequences

# Padding the sequence of indices


max_sequence=10

X = pad_sequences(X, maxlen=max_sequence, padding='post', truncating='post')

train_ds = tf.data.Dataset.from_tensor_slices((X, y)).shuffle(buffer_size=4).batch(


batch_size=2)
print(train_ds)

input_dim = len(char2idx)
output_dim = len(char2idx)

input_dim
output_dim

from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Embedding, SimpleRNN, Dense

model = Sequential([
Embedding(input_dim=input_dim, output_dim=output_dim,

Lotfi ELAACHAK Page 123


mask_zero=True, input_length=max_sequence,
trainable=False, embeddings_initializer=tf.keras.initializers.
random_normal()),
SimpleRNN(units=10),
Dense(2)
])

model.summary()

def loss_fn(model, X, y):


return tf.reduce_mean(tf.keras.losses.sparse_categorical_crossentropy(y_true=y,
y_pred=
model(X),

from_logits=True))

optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)

tr_loss_hist = []

for e in range(30):
avg_tr_loss = 0
tr_step = 0

for x_mb, y_mb in train_ds:


with tf.GradientTape() as tape:
tr_loss = loss_fn(model, x_mb, y_mb)

grads = tape.gradient(tr_loss, sources=model.variables)


optimizer.apply_gradients(grads_and_vars=zip(grads, model.variables))
avg_tr_loss += tr_loss
tr_step += 1

avg_tr_loss /= tr_step
tr_loss_hist.append(avg_tr_loss)

if (e + 1) % 5 == 0:
print('epoch: {:3}, tr_loss: {:3f}'.format(e + 1, avg_tr_loss))

y_pred = model.predict(X)
y_pred = np.argmax(y_pred, axis=-1)

y_pred
X

print('acc: {:.2%}'.format(np.mean(y_pred == y)))

Lotfi ELAACHAK Page 124


plt.figure()
plt.plot(tr_loss_hist)
plt.show()

5.5.4 RNN Many To Many Keras

RNN MTM Keras


import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

print('Tensorflow: {}'.format(tf.__version__))

plt.rcParams['figure.figsize'] = (16, 10)


plt.rc('font', size=15)

sentences = [['I', 'feel', 'hungry'],


['tensorflow', 'is', 'very', 'difficult'],
['tensorflow', 'is', 'a', 'framework', 'for', 'deep', 'learning'],
['tensorflow', 'is', 'very', 'fast', 'changing']]

pos = [['pronoun', 'verb', 'adjective'],


['noun', 'verb', 'adverb', 'adjective'],
['noun', 'verb', 'determiner', 'noun', 'preposition', 'adjective', 'noun'],
['noun', 'verb', 'adverb', 'adjective', 'verb']]

word_list =['<pad>'] + sorted(set(sum(sentences, [])))


word2idx = {word:idx for idx, word in enumerate(word_list)}
idx2word = {idx:word for idx, word in enumerate(word_list)}

print(word_list)
print(word2idx)
print(idx2word)

pos_list = ['<pad>'] + sorted(set(sum(pos, [])))


pos2idx = {pos:idx for idx, pos in enumerate(pos_list)}
idx2pos = {idx:pos for idx, pos in enumerate(pos_list)}

print(pos_list)
print(pos2idx)
print(idx2pos)

Lotfi ELAACHAK Page 125


X = list(map(lambda sentence: [word2idx.get(token) for token in sentence],
sentences))
y = list(map(lambda sentence: [pos2idx.get(token) for token in sentence], pos))

print(X)
print(y)

from tensorflow.keras.preprocessing.sequence import pad_sequences

X = pad_sequences(X, maxlen=10, padding='post')


X_mask = (X != 0).astype(np.float32)
X_len = np.array(list((map(lambda sentence: len(sentence), sentences))), dtype=np.
float32)

print(X)
print(X_mask)
print(X_len)

y = pad_sequences(y, maxlen=10, padding='post')

print(y)

train_ds = tf.data.Dataset.from_tensor_slices((X, y, X_len)).shuffle(buffer_size=4)


.batch(batch_size=2)

print(train_ds)

num_classes = len(pos2idx)
input_dim = len(word2idx)
output_dim = len(word2idx)

from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Embedding, TimeDistributed, Dense, SimpleRNN

model = Sequential([
Embedding(input_dim=input_dim, output_dim=output_dim,
mask_zero=True, trainable=False, input_length=10,
embeddings_initializer=tf.keras.initializers.random_normal()),
SimpleRNN(units=10, return_sequences=True),
TimeDistributed(Dense(units=num_classes))
])

model.summary()

def loss_fn(model, x, y, x_len, max_sequence):


masking = tf.sequence_mask(x_len, maxlen=max_sequence, dtype=tf.float32)

Lotfi ELAACHAK Page 126


sequence_loss = tf.keras.losses.sparse_categorical_crossentropy(
y_true=y, y_pred=model(x), from_logits=True
) * masking
sequence_loss = tf.reduce_mean(tf.reduce_sum(sequence_loss, axis=1) / x_len)
return sequence_loss

optimizer = tf.keras.optimizers.Adam(learning_rate=0.1)

tr_loss_hist = []

for e in range(30):
avg_tr_loss = 0
tr_step = 0

for x_mb, y_mb, x_mb_len in train_ds:


with tf.GradientTape() as tape:
tr_loss = loss_fn(model, x_mb, y_mb, x_mb_len, max_sequence=10)
grads = tape.gradient(tr_loss, model.trainable_variables)
optimizer.apply_gradients(grads_and_vars=zip(grads, model.
trainable_variables))
avg_tr_loss += tr_loss
tr_step += 1
avg_tr_loss /= tr_step
tr_loss_hist.append(avg_tr_loss)

if (e + 1) % 5 == 0:
print('Epoch: {:3}, tr_loss: {:.3f}'.format(e+1, avg_tr_loss))

y_pred = model.predict(X)
y_pred = np.argmax(y_pred, axis=-1) * X_mask

y_pred

from pprint import pprint

y_pred_pos = list(map(lambda row: [idx2pos.get(elm) for elm in row], y_pred.astype(


np.int32).tolist()))
x_pred_pos = list(map(lambda row: [idx2word.get(elm) for elm in row], X.astype(np.
int32).tolist()))

pprint(y_pred_pos)

#pprint(pos)

x_pred_pos

Lotfi ELAACHAK Page 127


plt.figure()
plt.plot(tr_loss_hist)
plt.title('Training loss for many-to-many model')
plt.show()

5.5.5 RNN Time Serie Keras

RNN Time Serie


#time series dataset
from pandas import read_csv
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, SimpleRNN
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import math
import matplotlib.pyplot as plt

def create_RNN(hidden_units, dense_units, input_shape, activation):


model = Sequential()
model.add(SimpleRNN(hidden_units, input_shape=input_shape,
activation=activation[0]))
model.add(Dense(units=dense_units, activation=activation[1]))
model.compile(loss='mean_squared_error', optimizer='adam')
return model

demo_model = create_RNN(2, 1, (3,1), activation=['linear', 'linear'])

wx = demo_model.get_weights()[0]
wh = demo_model.get_weights()[1]
bh = demo_model.get_weights()[2]
wy = demo_model.get_weights()[3]
by = demo_model.get_weights()[4]

print('wx = ', wx, ' wh = ', wh, ' bh = ', bh, ' wy =', wy, 'by = ', by)

x = np.array([1, 2, 3])
# Reshape the input to the required sample_size x time_steps x features
x_input = np.reshape(x,(1, 3, 1))
y_pred_model = demo_model.predict(x_input)

m = 2
h0 = np.zeros(m)

Lotfi ELAACHAK Page 128


h1 = np.dot(x[0], wx) + h0 + bh
h2 = np.dot(x[1], wx) + np.dot(h1,wh) + bh
h3 = np.dot(x[2], wx) + np.dot(h2,wh) + bh
o3 = np.dot(h3, wy) + by

print('h1 = ', h1,'h2 = ', h2,'h3 = ', h3)

print("Prediction from network ", y_pred_model)


print("Prediction from our computation ", o3)

# Parameter split_percent defines the ratio of training examples


def get_train_test(url, split_percent=0.8):
df = read_csv(url, usecols=[1], engine='python')
data = np.array(df.values.astype('float32'))
scaler = MinMaxScaler(feature_range=(0, 1))
data = scaler.fit_transform(data).flatten()
n = len(data)
# Point for splitting data into train and test
split = int(n*split_percent)
train_data = data[range(split)]
test_data = data[split:]
return train_data, test_data, data

import ssl
ssl._create_default_https_context = ssl._create_unverified_context
sunspots_url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly
-sunspots.csv'
train_data, test_data, data = get_train_test(sunspots_url)

# Prepare the input X and target Y


def get_XY(dat, time_steps):
# Indices of target array
Y_ind = np.arange(time_steps, len(dat), time_steps)
Y = dat[Y_ind]
# Prepare X
rows_x = len(Y)
X = dat[range(time_steps*rows_x)]
X = np.reshape(X, (rows_x, time_steps, 1))
return X, Y

time_steps = 12
trainX, trainY = get_XY(train_data, time_steps)
testX, testY = get_XY(test_data, time_steps)

model = create_RNN(hidden_units=3, dense_units=1, input_shape=(time_steps,1),


activation=['tanh', 'tanh'])

Lotfi ELAACHAK Page 129


model.fit(trainX, trainY, epochs=20, batch_size=1, verbose=2)

def print_error(trainY, testY, train_predict, test_predict):


# Error of predictions
train_rmse = math.sqrt(mean_squared_error(trainY, train_predict))
test_rmse = math.sqrt(mean_squared_error(testY, test_predict))
# Print RMSE
print('Train RMSE: %.3f RMSE' % (train_rmse))
print('Test RMSE: %.3f RMSE' % (test_rmse))

# make predictions
train_predict = model.predict(trainX)
test_predict = model.predict(testX)
# Mean square error
print_error(trainY, testY, train_predict, test_predict)

# Plot the result


def plot_result(trainY, testY, train_predict, test_predict):
actual = np.append(trainY, testY)
predictions = np.append(train_predict, test_predict)
rows = len(actual)
plt.figure(figsize=(15, 6), dpi=80)
plt.plot(range(rows), actual)
plt.plot(range(rows), predictions)
plt.axvline(x=len(trainY), color='r')
plt.legend(['Actual', 'Predictions'])
plt.xlabel('Observation number after given time steps')
plt.ylabel('Sunspots scaled')
plt.title('Actual and Predicted Values. The Red Line Separates The Training And
Test Examples')
plot_result(trainY, testY, train_predict, test_predict)

Lotfi ELAACHAK Page 130


Chapitre 6
Transformers
NLP’s Transformer is a new architecture that aims to solve tasks sequence-to-sequence while
easily handling long-distance dependencies. Computing the input and output representations
without using sequence-aligned RNNs or convolutions and it relies entirely on self-attention.
The Transformer architecture follows an encoder-decoder structure, See Paper.
Encoder : The encoder is responsible for stepping through the input time steps and encoding the
entire sequence into a fixed-length vector called a context vector.
Decoder : The decoder is responsible for stepping through the output time steps while reading
from the context vector.

Let’s see how this setup of the encoder and the decoder stack works :

6.1 Attention ALL you need

6.1.1 Attention
The attention mechanism describes a recent new group of layers in neural networks that has
attracted a lot of interest in the past few years, especially in sequence tasks. There are a lot of
different possible definitions of “attention” in the literature, but the one we will use here is the
following : the attention mechanism describes a weighted average of (sequence) elements with
the weights dynamically computed based on an input query and elements’ keys.

— The Attention mechanism enables the transformers to have extremely long term memory.
— A transformer model can attend or focus on all previous tokens that have been generated.

The goal is to take an average over the features of multiple elements. However, instead of weigh-
ting each element equally, we want to weight them depending on their actual values. In other

131
words, we want to dynamically decide on which inputs we want to “attend” more than others.
In particular, an attention mechanism has usually four parts we need to specify :

— Query : The query is a feature vector that describes what we are looking for in the
sequence, i.e. what would we maybe want to pay attention to.
— Keys : For each input element, we have a key which is again a feature vector. This feature
vector roughly describes what the element is “offering”, or when it might be important.
The keys should be designed such that we can identify the elements we want to pay
attention to based on the query.
— Values : For each input element, we also have a value vector. This feature vector is the
one we want to average over.
— Score function : To rate which elements we want to pay attention to, we need to specify
a score function . The score function takes the query and a key as input, and output the
score/attention weight of the query-key pair. It is usually implemented by simple simila-
rity metrics like a dot product, or a small MLP.

The weights of the average are calculated by a softmax over all score function outputs. Hence, we
assign those value vectors a higher weight whose corresponding key is most similar to the query.
If we try to describe it with pseudo-math, we can write :
 
fattn (Keyi ,Query)
 e X
αi =  P  , out = αi valuei (6.1)

ef attn (Keyi ,Query)
j

Visually, we can show the attention over a sequence of words as follows :

Lotfi ELAACHAK Page 132


6.2 Transformers Architecture

6.2.1 Embedding & Positional Encoding


Word Embedding

In NLP, word embedding is a projection of a word, consisting of characters into meaningful vec-
tors of real numbers. Conceptually it involves a mathematical embedding from a dimension N (all
words in a given corpus)-often a simple one-hot encoding is used- to a continuous vector space
with a much lower dimensionality, typically 128 or 256 dimensions are used. Word embedding is
a crucial preprocessing step for training a neural network.

— one hot encoding


— embedding layer
— word2vec
— Glove
— FastText
— ELMo

Embedding Layer

An embedding layer is a type of hidden layer in a neural network. In one sentence, this layer
maps input information from a high-dimensional to a lower-dimensional space, allowing the net-
work to learn more about the relationship between inputs and to process the data more efficiently.

For example, in natural language processing (NLP), we often represent words and phrases as

Lotfi ELAACHAK Page 133


one-hot vectors, where each dimension corresponds to a different word in the vocabulary. These
vectors are high-dimensional and sparse, which makes them difficult to work with.

The type of embedding layer depends on the neural network and the embedding process. There
are several types of embedding that exist :
— text embedding
— Image embedding
— Graph embedding and others

Text embedding

A standard approach is, to feed the one-hot encoded tokens (mostly words, or sentence) into a
embedding layer. During training the model tries to find a suitable embedding (lower dimensio-
nality as the input layer). The position of a word within the vector space is learned from text
and is based on the words that surround the word when it is used. In some cases it could be
useful to use a pretrained embedding, which was trained on a hugh corpus.
— Input : one-hot encoding of the word in a vocabulary
— Output : one vector of N dimensions (given by the user, probably tuned with hyperpara-
meter tuning)

Lotfi ELAACHAK Page 134


Positional Encoding

Positional encoding describes the location or position of an entity in a sequence so that each
position is assigned a unique representation. There are many reasons why a single number, such
as the index value, is not used to represent an item’s position in transformer models. For long
sequences, the indices can grow large in magnitude. If you normalize the index value to lie bet-
ween 0 and 1, it can create problems for variable length sequences as they would be normalized
differently.

Transformers use a smart positional encoding scheme, where each position/index is mapped to
a vector. Hence, the output of the positional encoding layer is a matrix, where each row of the
matrix represents an encoded object of the sequence summed with its positional information. An
example of the matrix that encodes only the positional information is shown in the figure below.

Positional Encoding : is to inject positional information into the embeddings (information about
the positions).

Lotfi ELAACHAK Page 135


Here :

— K : Position of an object in the input sequence 0 =< k < L/2


— d : Dimension of the output embedding space
— P(K,j) : Position function for mapping a position in the input k sequence to(K,j) index of
the positional matrix
— n : User-defined scalar, set to 10,000 by the authors of Attention Is All You Need.
— i : Used for mapping to column indices 0 =< i< d/2, with a single value of maps to both
sine and cosine functions.

Lotfi ELAACHAK Page 136


What Is the Final Output of the Positional Encoding Layer ?

6.2.2 Multi-Head Attention


— Attention mechanism calculates attentions, or relevance between “queries” and “keys.
— The encoder applies a specific attention mechanism called self-attention (Understanding
the context).
— Self-attention allows the models to associate each word in the input, to other words. (in
the context)

Lotfi ELAACHAK Page 137


Multi-Head Attention (Query, Key, and Value Vectors)

The idea behind (Q, K, V) is similar to the search engine that will map the query against a set
of keys (video title, description etc.) associated with candidate videos in the database, then
present the best matched videos (values).
— Query : what i am looking for
— key : what i can offer
— Value : what i actually offer

Lotfi ELAACHAK Page 138


Multi-Head Attention (Where are Q and K from)

The transformer encoder training builds the weight parameter matricesWQ and Wk .

The calculation goes like below where x is a sequence of position-encoded word embedding vectors
that represents an input sentence.

1. Q = X · WQT
2. K = X · WKT
3. For each (q, k) pair, their relation strength is calculated using dot product : q_to_k_similarity_scor
matmul(Q, K T )
4. Weight matrices WQ and WK are trained via the back propagations during the Transformer
training.

Multi-Head Attention (V is created using Q and K)

Multi-Head Attention (Scaled product)

Lotfi ELAACHAK Page 139



dk the square root of the dimension of query and key.

Multi-Head Attention (Full process)

6.2.3 Add & Norm


Add & Norm are in fact two separate steps. The add step is a residual connection

It means that we take sum together the output of a layer with the input F(x)+x.
The idea was introduced with the ResNet model. It is one of the solutions for vanishing gradient

Lotfi ELAACHAK Page 140


problem.

The norm step is about layer normalization , it is another way of normalization and it is one
of the many computational tricks to make life easier for training the model, hence improve the
performance and training time.

6.2.4 Global Architecture

Lotfi ELAACHAK Page 141


6.3 Transformers Implementation
— Attention (Colab)
— Transformers (Kaggle)
— Transformers/ASAG (Kaggle)
— Bert/ASAG (Kaggle)

Lotfi ELAACHAK Page 142


Chapitre 7
Auto-encoders
7.1 Autoencoders
Autoencoders are an unsupervised learning technique in which we leverage neural networks for
the task of representation learning. Specifically, we’ll design a neural network architecture such
that we impose a bottleneck in the network which forces a compressed knowledge representation
of the original input. If the input features were each independent of one another, this compression
and subsequent reconstruction would be a very difficult task.

we can take an unlabeled dataset and frame it as a supervised learning problem tasked with out-
putting x̂, a reconstruction of the original input x. This network can be trained by minimizing
the reconstruction error, L(x,x̂), which measures the differences between our original input and
the consequent reconstruction.

he bottleneck is a key attribute of our network design ; without the presence of an information
bottleneck, our network could easily learn to simply memorize the input values by passing these
values along through the network (visualized below).

7.1.1 Latent Variable Models


LVMs map between observation space x ∈ RD and latent space z ∈ RQ :

143
(fw : x → z)

gw : z → x̂
— One latent variable gets associated with each data point in the training set.
— The latent vectors are smaller than the observations (Q < D), compression.
— Models are linear or non-linear, deterministic or stochastic, with/without encoder.

Note : a latent space is defined as an abstract multi-dimensional space that encodes a meaningful
internal representation of externally observed events, compressed understanding of the world to
a computer through a spatial representation.

7.1.2 Generative Latent Variable Models


Generative modeling is a broad area of machine learning which deals with models of probability
distributions p(x) over data points x.

Some generative models (e.g., normalizing flows) allow for computing p(x).Others (e.g., VAEs)
only approximate p(x), but allow to draw samples from p(x).
Generative latent variable models often consider a simple Bayesian model :
Z
p(x) = p(z)p(x | z)dz = Ez∼p(z) (p(x | z))
z

— p(z) is the prior over the latent variable z ∈ RQ .


— p(x|z) is the likelihood (= decoder that transforms z into a distribution over x).
— p(x) is the marginal of the joint distribution p(x, z).

Lotfi ELAACHAK Page 144


7.1.3 Autoencoders Architecture
Autoencoders comprise an encoder fw as well as a decoder gw :

(fw : x → z)gw : z → x̂

hi = g(xi )

Where hi ∈ RQ (the latent feature representation) is the output of the encoder block .

x̃i = f (hi ) = f (g(xi ))

Where x̃i ∈ Rn . Training an autoencoder simply means finding the functions g(·) and f(·) that
satisfy :
arg min =< [∆(xi , f (g(xi ))] >
f,g

arg min =< [∆(xi , x̃i )] >


f,g

where ∆ indicates a measure of how the input and the output of the autoencoder differ (basically
our loss function will penalize the difference between input and output) and < · > indicates the
average over all observations.

Lotfi ELAACHAK Page 145


7.1.4 Regularization in autoencoders
Intuitively Regularization means enforcing sparsity in the latent feature output. The simplest
way of achieving this is to add a l1 orl2 regularization term to the loss function. That will look
like this for the l2 regularization term :

arg min = E[∆(xi , x̃i )] + λΣi θi2


f,g

In the formula the θi are the parameters in the functions f(·) and g(·) (you can imagine that
in the case where the functions are neural networks, the parameters will be the weights).

7.1.5 Feed Forward Autoencoders


A Feed-Forward Autoencoder (FFA) is a neural network made of dense layers with a specific
architecture, as can be schematically seen in Figure.

Activation Function of the Output Layer

Non Linear : Relu , Sigmoid , Linear Activation function.

Loss Function

E[∆(xi , x̃i )]

Remember that an autoencoder is trying to learn an approximation of the identity function ;


therefore, we want to find the weights in the network that gives you the smallest difference
according to some metric (∆(ů)) between xi and x̃i . Two loss functionsare widely used for
autoencoders : Mean Squared Error (MSE) and Binary Cross-Entropy (BCE).

7.1.6 AE Pytorch

AE Pytorch
import torch
from torchvision import datasets
from torchvision import transforms
import matplotlib.pyplot as plt

Lotfi ELAACHAK Page 146


# Transforms images to a PyTorch Tensor
tensor_transform = transforms.ToTensor()

# Download the MNIST Dataset


dataset = datasets.MNIST(root = "./data",
train = True,
download = True,
transform = tensor_transform)

# DataLoader is used to load the dataset


# for training
loader = torch.utils.data.DataLoader(dataset = dataset,
batch_size = 32,
shuffle = True)

# Cretae autoencoder class


#28*28 = 784 ==> 128 ==> 64 ==> 36 ==> 18 ==> 9

# Creating a PyTorch class


# 28*28 ==> 9 ==> 28*28
class AE(torch.nn.Module):
def __init__(self):
super().__init__()

# Building an linear encoder with Linear


# layer followed by Relu activation function
# 784 ==> 9
self.encoder = torch.nn.Sequential(
torch.nn.Linear(28 * 28, 128),
torch.nn.ReLU(),
torch.nn.Linear(128, 64),
torch.nn.ReLU(),
torch.nn.Linear(64, 36),
torch.nn.ReLU(),
torch.nn.Linear(36, 18),
torch.nn.ReLU(),
torch.nn.Linear(18, 9)
)

# Building an linear decoder with Linear


# layer followed by Relu activation function
# The Sigmoid activation function
# outputs the value between 0 and 1
# 9 ==> 784
self.decoder = torch.nn.Sequential(

Lotfi ELAACHAK Page 147


torch.nn.Linear(9, 18),
torch.nn.ReLU(),
torch.nn.Linear(18, 36),
torch.nn.ReLU(),
torch.nn.Linear(36, 64),
torch.nn.ReLU(),
torch.nn.Linear(64, 128),
torch.nn.ReLU(),
torch.nn.Linear(128, 28 * 28),
torch.nn.Sigmoid()
)

def forward(self, x):


encoded = self.encoder(x)
decoded = self.decoder(encoded)
return decoded

# Model Initialization
model = AE()

# Validation using MSE Loss function


loss_function = torch.nn.MSELoss()

# Using an Adam Optimizer with lr = 0.1


optimizer = torch.optim.Adam(model.parameters(),
lr = 1e-1,
weight_decay = 1e-8)
epochs = 20
outputs = []
losses = []
for epoch in range(epochs):
for (image, _) in loader:
image = image.reshape(-1, 28*28)
reconstructed = model(image)
loss = loss_function(reconstructed, image)
optimizer.zero_grad()
loss.backward()
optimizer.step()
losses.append(loss)
outputs.append((epochs, image, reconstructed))
print("Epoch :",epoch,"Loss====>",loss)

#Defining the Plot Style


plt.style.use('fivethirtyeight')
plt.xlabel('Iterations')

Lotfi ELAACHAK Page 148


plt.ylabel('Loss')

# Plotting the last 100 values


with torch.no_grad():
plt.plot(losses[-100:])

with torch.no_grad():
for i, item in enumerate(image):
item = item.reshape(-1, 28, 28)
plt.imshow(item[0])
plt.show()

with torch.no_grad():
for i, item in enumerate(reconstructed):
item = item.reshape(-1, 28, 28)
plt.imshow(item[0])
plt.show()

7.1.7 AE Keras

AE Keras
import numpy as np
import matplotlib.pyplot as plt
from random import randint
from keras import backend as K
from keras.layers import Input, Dense, Conv2D, MaxPooling2D, UpSampling2D
from keras.models import Model
from keras.datasets import mnist
from keras.callbacks import TensorBoard

def load_data():
# defining the input image size
input_image = Input(shape =(28, 28, 1))

# Loading the data and dividing the data into training and testing sets
(X_train, _), (X_test, _) = mnist.load_data()

# Cleaning and reshaping the data as required by the model


X_train = X_train.astype('float32') / 255.
X_train = np.reshape(X_train, (len(X_train), 28, 28, 1))
X_test = X_test.astype('float32') / 255.
X_test = np.reshape(X_test, (len(X_test), 28, 28, 1))

return X_train, X_test, input_image

Lotfi ELAACHAK Page 149


def build_network(input_image):

# Building the encoder of the Auto-encoder


x = Conv2D(16, (3, 3), activation ='relu', padding ='same')(input_image)
x = MaxPooling2D((2, 2), padding ='same')(x)
x = Conv2D(8, (3, 3), activation ='relu', padding ='same')(x)
x = MaxPooling2D((2, 2), padding ='same')(x)
x = Conv2D(8, (3, 3), activation ='relu', padding ='same')(x)
encoded_layer = MaxPooling2D((2, 2), padding ='same')(x)

# Building the decoder of the Auto-encoder


x = Conv2D(8, (3, 3), activation ='relu', padding ='same')(encoded_layer)
x = UpSampling2D((2, 2))(x)
x = Conv2D(8, (3, 3), activation ='relu', padding ='same')(x)
x = UpSampling2D((2, 2))(x)
x = Conv2D(16, (3, 3), activation ='relu')(x)
x = UpSampling2D((2, 2))(x)
decoded_layer = Conv2D(1, (3, 3), activation ='sigmoid', padding ='same')(x)

return decoded_layer

def build_auto_encoder_model(X_train, X_test, input_image, decoded_layer):

# Defining the parameters of the Auto-encoder


autoencoder = Model(input_image, decoded_layer)
autoencoder.compile(optimizer ='adadelta', loss ='binary_crossentropy')

# Training the Auto-encoder


autoencoder.fit(X_train, X_train,
epochs = 15,
batch_size = 256,
shuffle = True,
validation_data =(X_test, X_test))

return autoencoder

def visualize(model, X_test):

# Reconstructing the encoded images


reconstructed_images = model.predict(X_test)

plt.figure(figsize =(20, 4))


for i in range(1, 11):

# Generating a random to get random results


rand_num = randint(0, 10001)

Lotfi ELAACHAK Page 150


# To display the original image
ax = plt.subplot(2, 10, i)
plt.imshow(X_test[rand_num].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)

# To display the reconstructed image


ax = plt.subplot(2, 10, i + 10)
plt.imshow(reconstructed_images[rand_num].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)

# Displaying the plot


plt.show()

X_train, X_test, input_image = load_data()


decoded_layer = build_network(input_image)
auto_encoder_model = build_auto_encoder_model(X_train,
X_test,
input_image,
decoded_layer)
visualize(auto_encoder_model, X_test)

7.2 Variational Auto encoders


Variational autoencoder was proposed in 2013 by Knigma and Welling at Google and Qualcomm.
A variational autoencoder (VAE) provides a probabilistic manner for describing an observation
in latent space. Thus, rather than building an encoder which outputs a single value to describe
each latent state attribute, we’ll formulate our encoder to describe a probability distribution for
each latent attribute.

Variational autoencoder is different from autoencoder in a way such that it provides a statis-
tic manner for describing the samples of the dataset in latent space. Therefore, in variational
autoencoder, the encoder outputs a probability distribution in the bottleneck layer instead of a
single output value.
In autoencoder the encoder outputs are single output value :

Lotfi ELAACHAK Page 151


Using a variational autoencoder, we can describe latent attributes in probabilistic terms.

With this approach, we’ll now represent each latent attribute for a given input as a probability
distribution.

Lotfi ELAACHAK Page 152


7.2.1 Variational Auto encoders Architecture

7.2.2 Loss function


Variational autoencoder uses KL-divergence as its loss function, the goal of this is to minimize
the difference between a supposed distribution and original distribution of dataset.

Suppose we have a distribution z and we want to generate the observation x from it. In other
words, we want to calculate
p(x | z)p(z)
p(z | x) =
p(x)
But, the calculation of p(x) can be quite difficult
Z
p(x) = p(x | z)p(z)dz

This usually makes it an intractable distribution. Hence, we need to approximate p(z|x) to q(z|x)
to make it a tractable distribution. To better approximate p(z|x) to q(z|x), we will minimize the

Lotfi ELAACHAK Page 153


KL-divergence loss which calculates how similar two distributions are :

minKL(q(z | x)||p(z | x))

By simplifying, the above minimization problem is equivalent to the following maximization


problem :
Eq(z|x) logp(x | x) − KL(q(z | x)||p(z | x))

The first term represents the reconstruction likelihood and the other term ensures that our lear-
ned distribution q is similar to the true prior distribution p.

Thus our total loss consists of two terms, one is reconstruction error and other is KL-divergence
loss :

Loss = L(x, x̂) + ΣKL(q(z | x)||p(z | x))

7.2.3 VAE Pytorch


see : https ://avandekleut.github.io/vae/

7.2.4 VAE Keras


see : https ://keras.io/examples/generative/vae/

Lotfi ELAACHAK Page 154


Chapitre 8
Generative Adversarial Networks
A Generative Adversarial Network (GAN) is a deep learning architecture that consists of two
neural networks competing against each other in a zero-sum game framework. The goal of GANs
is to generate new, synthetic data that resembles some known data distribution.

8.0.1 Why GANs


the model after adding noise has higher confidence in the wrong prediction than when it predicted
correctly. The reason for such an adversary is that most machine learning models learn from a
limited amount of data, which is a huge drawback, as it is prone to overfitting. Also, the mapping
between the input and the output is almost linear. Although, it may seem that the boundaries of
separation between the various classes are linear, but in reality, they are composed of linearities,
and even a small change in a point in the feature space might lead to the misclassification of
data.

8.1 Generative Models : Review

Generative classifiers :
— Assume some functional form for P(Y), P(X|Y)
— Estimate parameters of P(X|Y), P(Y) directly from training data
— Use Bayes rule to calculate P(Y |X)

Discriminative Classifiers :
— Assume some functional form for P(Y|X)
— Estimate parameters of P(Y|X) directly from training data

155
— Generative models can generate new data instances.
— Discriminative models discriminate between different kinds of data instances.
A generative model could generate new photos of animals that look like real animals, while a
discriminative model could tell a dog from a cat. GANs are just one kind of generative model.

More formally, given a set of data instances X and a set of labels Y :

— Generative models capture the joint probability p(X, Y), or just p(X) if there are no labels.
— Discriminative models capture the conditional probability p(Y | X).

8.2 Application Of Generative Adversarial Networks


GANs, or Generative Adversarial Networks, have many uses in many different fields. Here are
some of the widely recognized uses of GANs :
— Image Synthesis and Generation : GANs are often used for picture synthesis and generation
tasks, They may create fresh, lifelike pictures that mimic training data by learning the
distribution that explains the dataset. The development of lifelike avatars, high-resolution
photographs, and fresh artwork have all been facilitated by these types of generative
networks.
— Image-to-Image Translation : GANs may be used for problems involving image-to-image
translation, where the objective is to convert an input picture from one domain to another
while maintaining its key features. GANs may be used, for instance, to change pictures
from day to night, transform drawings into realistic images, or change the creative style
of an image.

Lotfi ELAACHAK Page 156


— Text-to-Image Synthesis : GANs have been used to create visuals from descriptions in
text. GANs may produce pictures that translate to a description given a text input, such
as a phrase or a caption. This application might have an impact on how realistic visual
material is produced using text-based instructions.
— Data Augmentation : GANs can augment present data and increase the robustness and
generalizability of machine-learning models by creating synthetic data samples.
— Data Generation for Training : GANs can enhance the resolution and quality of low-
resolution images. By training on pairs of low-resolution and high-resolution images, GANs
can generate high-resolution images from low-resolution inputs, enabling improved image
quality in various applications such as medical imaging, satellite imaging, and video en-
hancement.
— Style Transfer and Editing : GANs have been employed for style transfer and editing
in images and videos. They can learn the style of a reference image or video and apply
that style to other images or videos, enabling artistic transformations, such as converting
photographs into paintings or altering the appearance of videos.

8.3 GANs Model & Architecture


Generative Adversarial Networks (GANs) can be broken down into three parts :
— Generative : To learn a generative model, which describes how data is generated in terms
of a probabilistic model.
— Adversarial : The training of a model is done in an adversarial setting.
— Networks : Use deep neural networks as artificial intelligence (AI) algorithms for training
purposes.

In GANs, there is a Generator and a Discriminator. The Generator generates fake samples of
data(be it an image, audio, etc.) and tries to fool the Discriminator. The Discriminator, on
the other hand, tries to distinguish between the real and fake samples. The Generator and the
Discriminator are both Neural Networks and they both run in competition with each other in the
training phase. The steps are repeated several times and in this, the Generator and Discriminator
get better and better in their respective jobs after each repetition. The work can be visualized
by the diagram given below :

the generative model captures the distribution of data and is trained in such a manner that it
tries to maximize the probability of the Discriminator making a mistake. The Discriminator,

Lotfi ELAACHAK Page 157


on the other hand, is based on a model that estimates the probability that the sample that it
got is received from the training data and not from the Generator. The GANs are formulated
as a minimax game, where the Discriminator is trying to minimize its reward V(D, G) and the
Generator is trying to minimize the Discriminator’s reward or in other words, maximize its loss.
It can be mathematically described by the formula below :

min max V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]
G D

Where :
— G = Generator
— D = Discriminator
— Pdata(x) = distribution of real data
— P(z) = distribution of generator
— x = sample from Pdata(x)
— z = sample from P(z)
— D(x) = Discriminator network
— G(z) = Generator network

8.3.1 Generator Model


The Generator is trained while the Discriminator is idle. After the Discriminator is trained by the
generated fake data of the Generator, we can get its predictions and use the results for training
the Generator and get better from the previous state to try and fool the Discriminator.

8.3.2 Discriminator Model


The Discriminator is trained while the Generator is idle. In this phase, the network is only
forward propagated and no back-propagation is done. The Discriminator is trained on real data
for n epochs and sees if it can correctly predict them as real. Also, in this phase, the Discriminator
is also trained on the fake generated data from the Generator and see if it can correctly predict
them as fake.

Lotfi ELAACHAK Page 158


8.3.3 Advantages of Generative Adversarial Networks (GANs)
— Synthetic data generation : GANs can generate new, synthetic data that resembles some
known data distribution, which can be useful for data augmentation, anomaly detection,
or creative applications.
— High-quality results : GANs can produce high-quality, photorealistic results in image syn-
thesis, video synthesis, music synthesis, and other tasks.
— Unsupervised learning : GANs can be trained without labeled data, making them suitable
for unsupervised learning tasks, where labeled data is scarce or difficult to obtain.
— Versatility : GANs can be applied to a wide range of tasks, including image synthesis, text-
to-image synthesis, image-to-image translation, anomaly detection, data augmentation,
and others.

8.3.4 Disadvantages of Generative Adversarial Networks (GANs)


— Training Instability : GANs can be difficult to train, with the risk of instability, mode
collapse, or failure to converge. Computational Cost : GANs can require a lot of compu-
tational resources and can be slow to train, especially for high-resolution images or large
datasets.
— Overfitting : GANs can overfit the training data, producing synthetic data that is too
similar to the training data and lacking diversity.
— Bias and Fairness : GANs can reflect the biases and unfairness present in the training
data, leading to discriminatory or biased synthetic data.
— Interpretability and Accountability : GANs can be opaque and difficult to interpret or
explain, making it challenging to ensure accountability, transparency, or fairness in their
applications.

8.4 GANs Implementation

8.4.1 GANs Pytroch

GANs Pytroch
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
import numpy as np

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_dataset = datasets.CIFAR10(root='./data',\
train=True, download=True, transform=transform)
dataloader = torch.utils.data.DataLoader(train_dataset, \
batch_size=32, shuffle=True)

Lotfi ELAACHAK Page 159


# Hyperparameters
latent_dim = 100
lr = 0.0002
beta1 = 0.5
beta2 = 0.999
num_epochs = 10

# Define the generator


class Generator(nn.Module):
def __init__(self, latent_dim):
super(Generator, self).__init__()

self.model = nn.Sequential(
nn.Linear(latent_dim, 128 * 8 * 8),
nn.ReLU(),
nn.Unflatten(1, (128, 8, 8)),
nn.Upsample(scale_factor=2),
nn.Conv2d(128, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128, momentum=0.78),
nn.ReLU(),
nn.Upsample(scale_factor=2),
nn.Conv2d(128, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64, momentum=0.78),
nn.ReLU(),
nn.Conv2d(64, 3, kernel_size=3, padding=1),
nn.Tanh()
)

def forward(self, z):


img = self.model(z)
return img

# Define the discriminator


class Discriminator(nn.Module):
def __init__(self):
super(Discriminator, self).__init__()

self.model = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, stride=2, padding=1),
nn.LeakyReLU(0.2),
nn.Dropout(0.25),
nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1),
nn.ZeroPad2d((0, 1, 0, 1)),
nn.BatchNorm2d(64, momentum=0.82),
nn.LeakyReLU(0.25),

Lotfi ELAACHAK Page 160


nn.Dropout(0.25),
nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2d(128, momentum=0.82),
nn.LeakyReLU(0.2),
nn.Dropout(0.25),
nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(256, momentum=0.8),
nn.LeakyReLU(0.25),
nn.Dropout(0.25),
nn.Flatten(),
nn.Linear(256 * 5 * 5, 1),
nn.Sigmoid()
)

def forward(self, img):


validity = self.model(img)
return validity

# Define the generator and discriminator


# Initialize generator and discriminator
generator = Generator(latent_dim).to(device)
discriminator = Discriminator().to(device)

# Loss function
adversarial_loss = nn.BCELoss()

# Optimizers
optimizer_G = optim.Adam(generator.parameters()\
, lr=lr, betas=(beta1, beta2))
optimizer_D = optim.Adam(discriminator.parameters()\
, lr=lr, betas=(beta1, beta2))

# Training loop
for epoch in range(num_epochs):
for i, batch in enumerate(dataloader):
# Convert list to tensor
real_images = batch[0].to(device)

# Adversarial ground truths


valid = torch.ones(real_images.size(0), 1, device=device)
fake = torch.zeros(real_images.size(0), 1, device=device)

# Configure input
real_images = real_images.to(device)

# ---------------------

Lotfi ELAACHAK Page 161


# Train Discriminator
# ---------------------

optimizer_D.zero_grad()

# Sample noise as generator input


z = torch.randn(real_images.size(0), latent_dim, device=device)

# Generate a batch of images


fake_images = generator(z)

# Measure discriminator's ability


# to classify real and fake images
real_loss = adversarial_loss(discriminator\
(real_images), valid)
fake_loss = adversarial_loss(discriminator\
(fake_images.detach()), fake)
d_loss = (real_loss + fake_loss) / 2

# Backward pass and optimize


d_loss.backward()
optimizer_D.step()

# -----------------
# Train Generator
# -----------------

optimizer_G.zero_grad()

# Generate a batch of images


gen_images = generator(z)

# Adversarial loss
g_loss = adversarial_loss(discriminator(gen_images), valid)

# Backward pass and optimize


g_loss.backward()
optimizer_G.step()

# ---------------------
# Progress Monitoring
# ---------------------

if (i + 1) % 100 == 0:
print(
f"Epoch [{epoch+1}/{num_epochs}]\

Lotfi ELAACHAK Page 162


Batch {i+1}/{len(dataloader)} "
f"Discriminator Loss: {d_loss.item():.4f} "
f"Generator Loss: {g_loss.item():.4f}"
)

# Save generated images for every epoch


if (epoch + 1) % 10 == 0:
with torch.no_grad():
z = torch.randn(16, latent_dim, device=device)
generated = generator(z).detach().cpu()
grid = torchvision.utils.make_grid(generated,\
nrow=4, normalize=True)
plt.imshow(np.transpose(grid, (1, 2, 0)))
plt.axis("off")
plt.show()

Lotfi ELAACHAK Page 163


Bibliographie
[1] https ://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-
fakultaet/fachbereiche/informatik/lehrstuehle/autonomous-vision/lectures/deep-learning/
[2] https ://cs229.stanford.edu/summer2022/
[3] https ://machinelearningmastery.com/
[4] https ://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html
[5] https ://keras.io/
[6] https ://stanford.edu/ shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks
[7] https ://cs231n.github.io/convolutional-networks/#overview
[8] https ://www.analyticsvidhya.com/blog/2018/12/guide-convolutional-neural-network-cnn/
[9] https ://github.com/lijqhs/deeplearning-notes/blob/main/C5-Sequence-Models/readme.md
[10] https ://arxiv.org/pdf/2201.03898.pdf
[11] https ://www.geeksforgeeks.org/generative-adversarial-network-gan/
[12] https ://slazebni.cs.illinois.edu/spring17/lec11_gan.pdf
[13] https ://medium.com/@RobinVetsch/nlp-from-word-embedding-to-transformers-
76ae124e6281
[14] https ://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and
[15] https ://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-
models-with-attention/

164

You might also like