Deep Learning Cours
Deep Learning Cours
2023 — 2024
Table des matières
0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.1.1 Machine learning vs Deep learning . . . . . . . . . . . . . . . . . . . . . . 5
0.1.2 History of Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.1.3 Application Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.1.4 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1 Linear Models 12
1.1 Multi Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.2 Cost/Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.4 Probabilistic interpretation (Cost Function) . . . . . . . . . . . . . . . . . 14
1.1.5 Multi Linear Regression with Pytorch . . . . . . . . . . . . . . . . . . . . . 16
1.2 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2.1 Perceptron Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2.2 Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.2.3 Perceptron with Pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.3.1 Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.3.2 Probability Interpretation (Cost Function) . . . . . . . . . . . . . . . . . . 26
1.3.3 Logistic Regression Pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.4 SoftMax Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.4.1 MultiClass Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.4.2 SoftMax Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.4.3 Probability Interpretation (Cost Function) . . . . . . . . . . . . . . . . . . 31
1.4.4 SoftMax Pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1
2.3.6 Kullback Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.4 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4.1 Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4.2 Tanh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.4.3 ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.4.4 Leaky ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.4.5 Exponential Linear Units (ELU) . . . . . . . . . . . . . . . . . . . . . . . . 60
2.4.6 How to choose an activation function . . . . . . . . . . . . . . . . . . . . . 60
2.5 ANN Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6 Transformers 131
6.1 Attention ALL you need . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.1.1 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2 Transformers Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2.1 Embedding & Positional Encoding . . . . . . . . . . . . . . . . . . . . . . 133
6.2.2 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2.3 Add & Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.2.4 Global Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.3 Transformers Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7 Auto-encoders 143
7.1 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
— Aerospace and Defense : Deep learning is used to identify objects from satellites that lo-
cate areas of interest, and identify safe or unsafe zones for troops.
— Medical Research : Cancer researchers are using deep learning to automatically detect can-
cer cells. Teams at UCLA built an advanced microscope that yields a high-dimensional
data set used to train a deep learning application to accurately identify cancer cells.
— Industrial Automation : Deep learning is helping to improve worker safety around heavy
machinery by automatically detecting when people or objects are within an unsafe dis-
tance of machines.
— Electronics : Deep learning is being used in automated hearing and speech translation. For
example, home assistance devices that respond to your voice and know your preferences
are powered by deep learning applications.
Tensor notation is much like matrix notation with a capital letter representing a tensor and
lowercase letters with subscript integers representing scalar values within the tensor.
Tensor Python
# create tensor
from numpy import array
T = array([
[[1,2,3], [4,5,6], [7,8,9]],
[[11,12,13], [14,15,16], [17,18,19]],
[[21,22,23], [24,25,26], [27,28,29]],
])
print(T.shape)
print(T)
Tensors Pytorch
import torch
import math
x = torch.empty(3, 4)
print(type(x))
print(x)
zeros = torch.zeros(2, 3)
print(zeros)
ones = torch.ones(2, 3)
print(ones)
torch.manual_seed(1729)
random = torch.rand(2, 3)
print(random)
ones = torch.zeros(2, 2) + 1
twos = torch.ones(2, 2) * 2
threes = (torch.ones(2, 2) * 7 - 1) / 2
fours = twos ** 2
sqrt2s = twos ** 0.5
print(ones)
print(twos)
print(threes)
print(fours)
print(sqrt2s)
t1 = torch.tensor([1, 2, 3, 4])
t2 = torch.tensor([5, 6, 7, 8])
TensorFlow Tensors
Tensors TesorFlow
import tensorflow as tf
import numpy as np
x ∈ Rd , y ∈ R
for n such examples
h(θ) = θ0 + θ1 x1 + θ2 x2 + ...... + θn xn
h(θ) = Σni=1 θi xi + θ0
x0 = 1 → Intercept
h(θ) = θ0 x0 + θ1 x1 + θ2 x2 + ...... + θn xn
h(θ) = Σni=0 θi xi = θT X
1
J(θ) = Σni=1 (h(θ(i) ) − y (i) )2
2
θ̂ = argminθ J(θ) = argminθ (h(θ(i) ) − y (i) )2
Note : You can add 1/n for standarization.
12
1.1.3 Gradient Descent
Least Mean Squares Algorithm
θ0 ← initialization
θj = θj − α δJ(θ)
δθj
(This update is simultaneously performed for all values of j = 0, . . . , d.) Here, α is called the
learning rate. This is a very natural algorithm that repeatedly takes a step in the direction of
steepest decrease of J.
Let’s first work it out for the case of if we have only one training example (x, y), so that we can
neglect the sum in the definition of J. We have :
δJ(θ) δ 1
δθj
= δθj 2
(hθ (x) − y)2
δJ(θ)
δθj
= 2. 12 (hθ (x) − y) δθδj (hθ (x) − y)
δJ(θ)
δθj
= (hθ (x) − y) δθδj (Σdi=0 θi xi − y)
δJ(θ)
δθj
= (hθ (x) − y)xi
We’d derived the LMS rule for when there was only a single training example. There are several
ways to modify this method for a training set of more than one example. The first is replace it
with the following algorithm :
Repeat until convergence {
(i))
θj := θj + αΣni=1 (y (i) − hθ (x(i)) ))xj , (for every j) (1)
By grouping the updates of the coordinates into an update of the vectorθ, we can rewrite update
(1) in a slightly more succinct way :
where (i) Gaussian Noise, Let us further assume that the (i) are distributed IID (independently
and identically distributed) according to a Gaussian distribution (also called a Normal distribu-
tion) with mean zero and some variance σ 2 , We can write this assumption as (i) ∼ N(0, σ 2 ), the
density of (i) is given by
(i) −θ T x(i) )2
p(y (i) | x(i) ; θ) = √ 1 exp(− (y )
2πσ 2σ 2
As such, the likelihood factorizes. According to the principle of maximum likelihood, the best
values of parameters and are those that maximize the likelihood of the entire dataset :
Qn
L(θ) = i=1 p(y i | xi ; θ)
(i) −θ T x(i) )2
√ 1 exp(− (y
Qn
L(θ) = i=1 2πσ 2σ 2
)
Now, given this probabilistic model relating the y (i) ’s and the x(i) ’s, what is a reasonable way of
choosing our best guess of the parameters θ ?
The principal of maximum likelihood says that we should choose θ so as to make the data as
high probability as possible.
Given the common use of log in the likelihood function, it is referred to as a log-likelihood
function. It is also common in optimization problems to prefer to minimize the cost function
rather than to maximize.
l(θ) = logL(θ)
(i) −θ T x(i) )2
√ 1 exp(− (y
Qn
l(θ) = log i=1 2πσ 2σ 2
)
(i) −θ T x(i) )2
1
l(θ) = Σni=1 log √2πσ exp(− (y 2σ 2
)
For reasons of increased computational ease, it is often easier to minimise the negative of the
log-likelihood rather than maximise the log-likelihood itself. Hence, we can "stick a minus sign
in front of the log-likelihood" to give us the negative log-likelihood (NLL) :
(i) −θ T x(i) )2
1
N LL(θ) = −Σni=1 log √2πσ exp(− (y 2σ 2
)
1 1
N LL(θ) = − n2 log( 2πσ 2) − Σn (y (i)
2σ 2 i=1
− θT x(i) )2
1 n
Σ (y (i)
2 i=1
− θT x(i) )2 is the Residual Sum of Squares, also known as the Sum of Squared Errors
(SSE).
data = pd.read_csv("datasets/Advertising.csv")
data.head()
data.shape[0]
sns.displot(data_sales, x="Sales",kde=True)
plt.show()
import torch
import torch.nn as nn
from tqdm import tqdm #progress Bar
Arguments:
tv (tensor) with the values of tv investment (x_1)
radio (tensor) with the values of radio investment (x_2)
news (tensor) with the newspaper investment (x_3).
"""
error = y_predicted - y_target # element-wise substraction
return torch.sum(error**2 ) / error.numel() # mean (sum/n)
predicted = mylnmodel(tv,radio,news)
loss = MSE(y_predicted = predicted, y_target=sales)
print(loss)
# initial values for the coefficients is random, gradients are not calculated
print(f'theta0 = {float(theta0.item()):+2.4f}, df(a)/da = {theta0.grad}')
print(f'theta1 = {float(theta1.item()):+2.4f}, df(b)/da = {theta1.grad}')
print(f'theta2 = {float(theta2.item()):+2.4f}, df(c)/dc = {theta2.grad}')
print(f'theta3 = {float(theta3.item()):+2.4f}, df(d)/dd = {theta3.grad}')
loss.backward()
plt.plot(myMSE);
plt.xlabel('Epoch (#)'), plt.ylabel('Mean squared Errors')
plt.figure(figsize=(3,3))
plt.scatter(sales, predicted.detach(), c='k', s=4)
plt.xlabel('sales'), plt.ylabel('predicted');
x = y = range(30)
plt.plot(x,y, c='brown')
plt.xlim(0,35), plt.ylim(0,35);
plt.text(25, 15, f'theta0 = {theta0.item():2.4f}', fontsize=8)
plt.text(25, 12, f'tv = {theta1.item():2.4f}', fontsize=8)
plt.text(25, 9, f'radio = {theta2.item():2.4f}', fontsize=8)
plt.text(25, 6, f'news = {theta3.item():2.4f}', fontsize=8)
data = pd.read_csv("datasets/Advertising.csv")
# matrix form
# costum_data a 4 * 200 matrix
costum_data = data.loc[:,['TV','Radio','Newspaper']]
costum_data.insert(loc=0,
column='theta0',
value=1)
X = torch.tensor(costum_data.values)
y = torch.tensor(data.Sales.values)
print(X.shape)
def model(X:torch.Tensor):
"""
Performs the matrix vector multiplication
"""
assert len(X.shape) == 2
return X @ theta.T
predicted = model(X)
loss = MSE(y_predicted = predicted, y_target=y)
print(loss)
theta.grad.zero_()
plt.figure(figsize=(3,3))
plt.scatter(sales, predicted.detach(), c='gray', s=4)
plt.xlabel('sales'), plt.ylabel('predicted');
x = y = range(30)
plt.plot(x,y, c='brown')
plt.xlim(0,35), plt.ylim(0,35);
1.2 Perceptron
x ∈ Rd , y ∈ {0, 1}
hθ = g(θT X)
1 if z ≥ 0
g(z) = (1.1)
0 if z < 0
Binary step function :
If we then let hθ (x) = g(θT X) as before but using this modified definition of g, and if we use the
update rule
θ~ ← init(~0)
For in in 1,2........,n :
θ(j) = θ(j) − α(y (i) − hθ (x(i) ))x(i) )(j)
Pytorch Perceptron
import numpy as np
import matplotlib.pyplot as plt
import torch
import pandas as pd
%matplotlib inline
class Perceptron():
def __init__(self, num_features):
self.num_features = num_features
self.weights = torch.zeros(num_features, 1,dtype=torch.float32, device=
device)
self.bias = torch.zeros(1, dtype=torch.float32, device=device)
for i in range(y.size()[0]):
# use view because backward expects a matrix (i.e., 2D tensor)
errors = self.backward(x[i].view(1, self.num_features), y[i]).view
(-1)
self.weights += (errors * x[i]).view(self.num_features, 1)
self.bias += errors
data = pd.read_csv("datasets/diabetes.csv")
data
array = data.values
X, y = array[:,0:2] , array[:,8]
ppn = Perceptron(num_features=2)
w, b = ppn.weights, ppn.bias
x_min = -2
y_min = ( (-(w[0] * x_min) - b[0])
/ w[1] )
x_max = 2
y_max = ( (-(w[0] * x_max) - b[0])
/ w[1] )
ax[1].legend(loc='upper left')
plt.show()
x ∈ Rd , y ∈ {0, 1}
hθ = g(θT X)
Odds : Simply put, odds are the chances of success divided by the chances of failure. It is repre-
sented in the form of a ratio. (As shown in equation given below)
p
odds =
1−p
p
log( ) = θT X
1−p
p
= exp(θT X)
1−p
exp(θT X)
p=
1 + exp(θT X)
1
p=
1 + exp(−θT X)
1
g(z) =
1 + e−z
n
p(y (i) | x(i) ; θ)
Y
L(θ) =
i=1
n
[hθ (x)]y [1 − hθ (x)](1−y)
Y
L(θ) =
i=1
How do we maximize the likelihood ? Similar to our derivation in the case of linear regression,
we can use gradient ascent. Written in vectorial notation, our updates will therefore be given by
θ := θ + α∇θ l(θ).
∇θ l(θ) = [y − hθ (x)].x
1 n (i)
θj := θj − αΣi=1 [y (i) − hθ (x)].xj
n
data = pd.read_csv("datasets/diabetes.csv")
array = data.values
X, y = array[:,0:2] , array[:,8]
class LogisticRegression(torch.nn.Module):
def __init__(self, input_dim, output_dim):
epochs = 200000
input_dim = 2 # Two inputs x1 and x2
output_dim = 1 # Single binary output
learning_rate = 0.01
model = LogisticRegression(input_dim,output_dim)
criterion = torch.nn.BCELoss()# binary cross entropy
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
losses = []
losses_test = []
Iterations = []
iter = 0
for epoch in tqdm(range(int(epochs)),desc='Training Epochs'):
x = X_train
labels = y_train
optimizer.zero_grad() # Setting our stored gradients equal to zero
outputs = model(X_train)
loss = criterion(torch.squeeze(outputs), labels) # [200,1] -squeeze-> [200]
loss.backward() # Computes the gradient of the given tensor w.r.t. graph leaves
iter+=1
if iter%10000==0:
# calculate Accuracy
with torch.no_grad():
# Calculating the loss and accuracy for the test dataset
predicted_test = outputs_test.round().detach().numpy()
total_test += y_test.size(0)
correct_test += np.sum(predicted_test == y_test.detach().numpy())
accuracy_test = 100 * correct_test/total_test
losses_test.append(loss_test.item())
def model_plot(model,X,y,title):
parm = {}
b = []
for name, param in model.named_parameters():
parm[name]=param.detach().numpy()
w = parm['linear.weight'][0]
b = parm['linear.bias'][0]
plt.scatter(X[:, 0], X[:, 1], c=y,cmap='jet')
u = np.linspace(X[:, 0].min(), X[:, 0].max(), 2)
plt.plot(u, (0.5-b-w[0]*u)/w[1])
plt.xlim(X[:, 0].min()-0.5, X[:, 0].max()+0.5)
plt.ylim(X[:, 1].min()-0.5, X[:, 1].max()+0.5)
plt.xlabel(r'x_1$',fontsize=16)
plt.ylabel(r'x_2$',fontsize=16)
plt.title(title)
plt.show()
# Train Data
model_plot(model,X_train,y_train,'Train Data')
Consider a classification problem which involved k number of classes. Let x as the feature vector
and y as the corresponding class, our predictor variable follows a multinomial distribution, that
is y ∈ {1, 2, . . . , k}.
Now, we would like to model the probability of y given x, P(y|x), which is a vector of probabilities
of y be either of the classes given the features :
P (y = 1 | x) = φ1
P (y = 2 | x) = φ2
P (y = 3 | x) = φ3
P (y | x) = (1.2)
............
P (y = k | x) = φk
Σki=1 φi = 1
we can assume that the log-odd for y=i with respect to y=k is assumed to have linear relationship
with the independent variable x. for i = 1,2,.....,k
ln(oddi ) = θiT x
P (y = i | x)
ln( ) = θiT x
P (y = k | x)
P (y = i | x) T
= e(θi x)
P (y = k | x)
T
P (y = i | x) = e(θi x) P (y = k | x)
Since the sum of P(y=j|x) for j=1, 2, 3, . . . , k is equal to 1, so :
T T
Σkj=1 P (y = j | x) = Σkj=1 e(θj x) P (y = k | x) = P (y = k | x)Σkj=1 e(θj x) = 1
1
P (y = k | x) = T
Σkj=1 e(θj x)
By substitution :
T
θiT x eθi x
P (y = i | x) = e P (y = k | x) = T
Σkj=1 e(θj x)
SoftMax Architecture
k
(i) =l)
P (y (i) | x(i) ) = (P (y (i) = l | x(i) ))(y
Y
l=1
k
m Y
(i) =l)
(P (y (i) = l | x(i) ))(y
Y
L(θ) =
i=1 l=1
k
m Y T
eθi x (i) =l)
)(y
Y
L(θ) = ( (θjT x)
i=1 l=1 Σkj=1 e
The log-likelihood function :
l(θ) = lnL(θ)
k
m Y T
eθi x (i) =l)
)(y
Y
l(θ) = ln ( (θT x)
k
i=1 l=1 Σj=1 e j
T
eθi x (i) =l)
l(θ) = Σm k
i=1 Σl=1 ln( (θjT x)
)(y
Σkj=1 e
The partial derivative of loss function with respect to any element of the weight matrix is :
T
δ(−l(θ)) eθi x (i) (i)
= Σm
i=1 ( T x) − 1(y = g))xh
δθgh (θ
Σkj=1 e j
df = pd.read_csv('datasets/Iris.csv')
df.head()
array = df.values
X, y = array[:,1:5] , array[:,5]
class SoftmaxRegression(torch.nn.Module):
self.linear.weight.detach().zero_()
self.linear.bias.detach().zero_()
# Note: the trailing underscore
# means "in-place operation" in the context
# of PyTorch
num_epochs = 50
for epoch in range(num_epochs):
print('\nModel parameters:')
print(' Weights: %s' % model2.linear.weight)
print(' Bias: %s' % model2.linear.bias)
In a computational graph nodes are either input values or functions for combining values. Edges
receive their weights as the data flows through the graph. Outbound edges from an input node
are weighted with that input value ; outbound nodes from a function node are weighted by com-
bining the weights of the inbound edges using the specified function.
— Node : A node in a graph is used to indicate a variable. The variable may be a scalar,
vector, matrix, tensor, or even a variable of another type.
— Edge : An edge represents a function argument and also data dependency. These are just
like pointers to nodes.
— Operation : An operation is a simple function of one or more variables. There is a fixed
set of allowable operations. Functions more complicated than these operations in this set
may be described by composing many operations together.
Properties of nodes & edges : The nodes represent the operations that are applied directly on
the data flowing in and out through the edges. For the above set of equations, we can keep the
following things in mind while implementing it in TensorFlow :
— Involves two phases :
Phase 1 :- Make a plan for your architecture.
Phase 2 :- To train the model and generate predictions, feed it a lot of data.
36
— Since the inputs act as the edges of the graph, we can use the tf.Placeholder() object
which can take any input of the desired datatype.
— For calculating the output ‘c’, we define a simple multiplication operation and start a ten-
sorflow session where we pass in the required input values through the feed_dict attribute
in the session.run() method for calculating the outputs and the gradients.
Static Graph TF1
# Importing tensorflow version 1
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
Advantages :
— Since the graph is static, it provides many possibilities of optimizations in structure and
resource distribution.
— The computations are slightly faster than a dynamic graph because of the fixed structure.
Properties of nodes & edges : The nodes represent the data(in form of tensors) and the edges
represent the operations applied to the input data.
For the equations given in the Introduction, we can keep the following things in mind while
implementing it in Pytorch
Advantages :
— Scalability to different dimensional inputs : Scales very well for different dimensional inputs
as a new pre-processing layer can be dynamically added to the network itself.
— Ease in debugging : These are very easy to debug and are one of the reasons why many
people are shifting from Tensorflow to Pytorch. As the nodes are created dynamically
before any information flows through them, the error becomes very easy to spot as the
user is in complete control of the variables used in the training process.
Disadvantages :
— Allows very little room for graph optimization because a new graph needs to be created
for each training instance/batch.
For example, consider the relatively simple expression : f(x, y, z) = (x + y) * z. This is how we
would represent that function as as computational graph :
print(f'a.is_leaf: {a.is_leaf}')
print(f'a.grad_fn: {a.grad_fn}')
print(f'a.grad: {a.grad}')
print()
print(f'b.is_leaf: {b.is_leaf}')
print(f'b.grad_fn: {b.grad_fn}')
print(f'b.grad: {b.grad}')
print()
print(f'c.is_leaf: {c.is_leaf}')
print(f'c.grad_fn: {c.grad_fn}')
print(f'c.grad: {c.grad}')
print()
print(f'd.is_leaf: {d.is_leaf}')
print(f'e.is_leaf: {e.is_leaf}')
print(f'e.grad_fn: {e.grad_fn}')
print(f'e.grad: {e.grad}')
print()
print(f'u.is_leaf: {u.is_leaf}')
print(f'u.grad_fn: {u.grad_fn}')
print(f'u.grad: {u.grad}')
print()
print(f'v.is_leaf: {v.is_leaf}')
print(f'v.grad_fn: {v.grad_fn}')
print(f'v.grad: {v.grad}')
print()
print(f't.is_leaf: {t.is_leaf}')
print(f't.grad_fn: {t.grad_fn}')
print(f't.grad: {t.grad}')
print(tf.__version__)
a=tf.constant(2.0)
b=tf.constant(3.0)
c=tf.constant(5.0)
d=tf.constant(10.0)
def log10(x):
x1 = tf.math.log(x)
x2 = tf.math.log(10.0)
return x1/ x2
u = tf.multiply(a,b,name='u')
t = log10(d)
v = tf.multiply(t,c,name='v')
e = tf.add(u,v,name='e')
print(u,t,v,e,sep='\n')
In calculus, the chain rule is a formula for computing the derivative of the composition of two
or more functions. The chain rule may be written in Leibniz’s notation in the following way. If a
variable z depends on the variable y, which itself depends on the variable x, so that y and z are
therefore dependent variables, then z, via the intermediate variable of y, depends on x as well.
The chain rule is as follows.
dx dx dy
= .
dz dy dz
Back Propagation
Forward Pass :
y = x2
L = 2y
Loss : L = 2x2
Backward Pass :
Exemples
Forward Pass :
(2)u = u(y)
(2)v = v(y)
(3)L = L(u, v)
Loss : L(u(y(x)), v(y(x)))
Backward Pass :
u = a*b
t = torch.log10(d)
v = t*c
t.retain_grad()
e = u+v
print(e)
e.backward()
display(Math(fr'\frac{{\partial e}}{{\partial a}} = {a.grad.item()}'))
Autograd Module : The autograd provides the functionality of easy calculation of gradients
without the explicitly manual implementation of forward and backward pass for all layers.
Backward computation TensorFlow
import tensorflow as tf
import os
from IPython.display import display, Math
import numpy as np
def log10(x):
x1 = tf.math.log(x)
x2 = tf.math.log(10.0)
return x1/ x2
# Create graphs
for g in grad :
print(g)
(aT x + b)
.
— Each neuron in each layer is fully connected to all neurons of the previous layer.
— The overall length of the chain is the depth of the model – “Deep Learning”.
— The output layer is the last layer in a neural network which computes the output
— I The loss function compares the result of the output layer to the target value(s)
— I Choice of output layer and loss function depends on task (discrete, continuous, ..)
Log-Likelihood :
wM L = argmax pmodel (y|X, w)
N
Y
wM L = argmax pmodel (yi |xi , w)
i=1
N
X
wM L = argmax log pmodel (yi |xi , w)
i=1
Mapping :
fw : RW ∗H → R
Binary Classification
Mapping :
fw : RW ∗H → 1, 00 Beach0 ,0 N oBeach0
Multi-Class Classification
Gaussian Distribution :
1 (y − µ)2
p(y|x, w) = √ exp(− )
2πσ 2 2σ 2
— µ : mean.
— σ : standard deviation.
— The distribution has thin “tails” :
p(y) → 0 quickly as y → ∞
N
X
wM L = argmaxw log pmodel (yi |xi , w)
i=1
N N
1 2 1
(fw (xi ) − yi )2
X X
wM L = argmaxw − log(2πσ ) − 2
i=1 2 i=1 2σ
N
(fw (xi ) − yi )2
X
wM L = argmaxw −
i=1
N
(fw (xi ) − yi )2
X
wM L = argminw
i=1
N
1 X
M SE = (fw (xi ) − yi )2
N i=1
Laplace Distribution :
1 |y − µ|
p(y) = √ exp(− )
2b b
— µ : location.
— b : scale.
— The distribution has heavy “tails” :
p(y) → 0 slowly as y → ∞
N
X
wM L = argmaxw log pmodel (yi |xi , w)
i=1
N N
X X 1
wM L = argmaxw − log(2b) − |fw (xi ) − yi |
i=1 i=1 b
N
X
wM L = argmaxw − |fw (xi ) − yi |
i=1
N
X
wM L = argminw |fw (xi ) − yi |
i=1
N
1 X
M AE = |fw (xi ) − yi |
N i=1
We minimize the absolute loss (=L1 loss) which is more robust than L2 .
Let p(y|x, w) = √ 1
exp(− |y−f w (x)|
gw (x)
) be a Laplace distribution. We obtain :
2gw (x)
N N
X X 1
wM L = argmaxw − log(2gw (x)) − |fw (xi ) − yi |
i=1 i=1 gw (x)
In this case, we predict both the location µ and the scale b with a neural network, predict
uncertainty (variance/scale).
N
log[fW (Xi )yi (1 − fW (Xi ))(1−yi ) ]
X
wM L = argmaxw
i=1
N
X
wM L = argminw −yi logfW (Xi ) − (1 − yi )log(1 − fW (Xi ))
i=1
Alternative notation :
C
µyc c
Y
p(y) =
c=1
N C
fw(c) (x)yc ]
X Y
wM L = argmaxw log[
i=1 c=1
C
N X
−yi,c logfw(c) (xi )
X
wM L = argminw
i=1 c=1
ex1 exC
sof tmax(x) = ( , ...., )
ΣC
k=1 e
(xk ) ΣC
k=1 e
(xk )
exC
fw(c) (x) =
ΣC
k=1 e
(xk )
X P (X)
DKL (P ||Q) = P (X)log( )
x∈X Q(X)
The K-L divergence is an important feature in a variety of machine learning models. One in
particular is the Variational Autoencoder (VAE).
Example
Suppose we have two probability distributions, P & Q. and we want to find the difference between
the two probabilities, we can simply apply the KL divergence as shown below.
D(q||p) = 0.096nats
“nats” is simply the unit of information obtained by using the natural logarithm (ln(x)).
Python code
KL Divergence
# box =[P(green),P(blue),P(red),P(yellow)]
box_1 = [0.25, 0.33, 0.23, 0.19]
box_2 = [0.21, 0.21, 0.32, 0.26]
import numpy as np
from scipy.special import rel_entr
# D( p || p) =0
print('KL-divergence(box_1 || box_1): %.3f ' % kl_divergence(box_1,box_1))
The role of the Activation Function is to derive output from a set of input values fed to a node.
— Hidden layer hi = g(Ai hi−1 + bi ) with activation function g(·) and weights Ai , bi
— The activation function is frequently applied element-wise to its input
— Activation functions must be non-linear to learn non-linear mappings
— Some of them are not differentiable everywhere (but still ok for training)
2.4.1 Sigmoid
1
g(x) =
1 + e−x
— Maps input to range [0, 1].
— Neuroscience interpretation as saturating “firing rate” of neurons.
2.4.2 Tanh
2 ex ˘e−x
g(x) = − 1 =
1 + e−2x ex + e−x
— Maps input to range [0, 1].
— Zero-centered.
Problems
g(x) = max(0, x)
Problems
— Not zero-centered
— No learning for x < 0, dead ReLUs
g(x) = max(0.01x, x)
Note : there is also an alternative Parametric ReLU : g(x) = max(αx, x) with the same advantages
as Leaky ReLU and Parameter α learned from data.
x if x > 0
g(x) = x (2.1)
α(e − 1) if x ≤ 0
— All benefits of Leaky ReLU
— Adds some robustness to noise
— Default α = 1
This problem can be described as approximating a function that maps examples of inputs to
examples of outputs. Approximating a function can be solved by framing the problem as func-
tion optimization.
Gradient Descent :
w0 = winit
wt+1 = wt − α∇w L(wt )
— Neural network loss L(w) is not convex wrt. the network parameters w
— There exist multiple local minima, but we will find only one through optimization
— Example : we can permute all hidden units in a layer and get the same solution
— it is known that many local minima in deep networks are good ones
61
Learning rate
— Choosing the learning rate too low leads to very slow progress
— Choosing the learning rate too high might lead to divergence
Exploding gradients
Saddle point
The word ‘stochastic‘ means a system or process linked with a random probability. Hence, in
Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set
for each iteration. In Gradient Descent, there is a term called “batch” which denotes the total
number of samples from a dataset that is used for calculating the gradient for each iteration.
Algorithm :
1. Initialize weights w0 and pick learning rate η and minibatch size |Xbatch |.
2. Draw random minibatch (x1 , y1 ), ..., (xB , yB ) ⊆ X(withB << N )
3. For all minibatch elements b ∈ 1, ..., B do :
(a) Forward propagate xi through network to calculate prediction ŷi
(b) Backpropagate to obtain gradient∇w Li (wt ) = ∇w Li (ŷi , yi , wt )
4. Update gradients :wt+1 = wt − η N1 Σi ∇w Li (wt ).
5. If validation error decreases, go to step 2, otherwise stop.
Choosing the learning rate η too high leads to oscillations (also no convergence).
A good learning rate η works better, but still slow, inefficient and no convergence.
Convergence of SGD :
A series is convergent if there exists a number s∗ such that for every arbitrarily small positive
number :
| sn − s∗ |<
Optimization converges if there exists a vector w∗ such that for every arbitrarily small positive
number , there exists an integer T such that for all t ≥ T :
|| wt − w∗ ||<
Problems of SGD
Momentum is an extension to the gradient descent optimization algorithm that allows the search
to build inertia in a direction in the search space and overcome the oscillations of noisy gradients
and coast across flat spots of the search space.
— SGD oscillates along w2 axis, we should dampen, e.g., by averaging over time.
— SGD makes slow progress along w1 axis, we like to accelerate in this direction.
— Idea of momentum : update weights with exponential moving average of gradients.
wt+1 = wt + mt+1
wt+1 = wt + ηmt+1
Let us abbreviate the gradient at iteration t with gt ≡ ∇w LB (wt ). We have :
mt+1 = β1 mt − (1 − β1 )gt
mt = (1 − β1 )Σt−1 t−i−1
i=0 β1 gi
Example :
t1 , t2 , t3 , ......, tn
b1 , b2 , b13 , ......, bn
with 0 ≥ γ ≤ 1.
Vt1 = b1
Vt2 = γVt1 + b2
Vt3 = γ 2 Vt1 + Vt2 γ + b3
SGD with Nesterov Momentum : Leads to faster dampening and Has significantly increased the
performance of RNNs on a variety of tasks.
RMSprop
Root Mean Squared Propagation, or RMSProp, is an extension of gradient descent and the
AdaGrad version of gradient descent that uses a decaying average of partial gradients in the
adaptation of the step size for each parameter, The main idea is “Divide the gradient by a run-
ning average of its recent magnitude”..
Motivation for RMSprop :
— Gradient distribution is very uneven (not equal)and thus requires conservative learning
rates.
— In this SGD example, gradients are very large in w2 but small in w1 .
— Idea of RMSprop : divide learning rate by moving average of squared gradients.
∇w LB (wt )
wt+1 = wt − η √
v t+1 +
Adam
Adaptive Moment Estimation (Adam) is a method that computes adaptive learning rates for each
parameter. It stores both the decaying average of the past gradients mt , similar to momentum
and also the decaying average of the past squared gradients vt , similar to RMSprop and Adadelta.
Thus, it combines the advantages of both the methods. Adam is the default choice of the optimizer
for any application in general.
mt+1
wt+1 = wt − α √
v t+1 +
— Fixed learning rate (not a good idea : too slow in the beginning and fast in the end)
— Inverse proportional decay : ηt = η/t (Robbins and Monro)
— Exponential decay : ηt = ηαt
— Step decay : η ← αη (every K iterations/epochs, common in practice : α = 0.5)
Hyperparameter Search
Hyperparameters :
— Network architecture
— Number of iterations
— Batch size
— Learning rate schedule
— Regularization
Methods :
— Manual search
— Most common
— Build intuitions
— Grid search
— Define ranges
— Systematically evaluate
— Requires large resources
— Random search
— Like grid search but
— hyperparameters selected
— based on random draws
3.2 Regularization
— Underfitting : Model too simple, does not achieve low error on training set.
— Overfitting : Training error small, but test error (= generalization error) large.
Bais : While making predictions, a difference occurs between prediction values made by the model
and actual values/expected values, and this difference is known as bias errors or Errors due to bias.
— Low Bias : A low bias model will make fewer assumptions about the form of the target
function.
— High Bias : A model with a high bias makes more assumptions, and the model becomes
unable to capture the important features of our dataset. A high bias model also cannot
perform well on new data.
Variance : Variance tells that how much a random variable is different from its expected value.
Ideally, a model should not vary too much from one training dataset to another, which means
the algorithm should be good in understanding the hidden mapping between inputs and output
variables.
Ways to reduce High Variance :
L2 Regularzation
The L2 regularization is the most common type of all regularization techniques and is also com-
monly known as weight decay or Ridge Regression.
So, if we’re predicting house prices again, this means the less significant features for predicting
the house price would still have some influence over the final prediction, but it would only be a
small influence.
1
L̂(X, w) = L(X, w) + α ||W ||22
2
L1 Regularzation
L1 regularization, also known as L1 norm or Lasso (in regression problems), combats overfitting
by shrinking the parameters towards 0. This makes some features obsolete.
It’s a form of feature selection, because when we assign a feature with a 0 weight, we’re multi-
plying the feature values by 0 which returns 0, eradicating the significance of that feature.
1
L̂(X, w) = L(X, w) + α ||W ||1
2
L1 vs L2
— L1 regularization penalizes the sum of absolute values of the weights, whereas L2 regula-
rization penalizes the sum of squares of the weights.
— The L1 regularization solution is sparse. The L2 regularization solution is non-sparse.
— L2 regularization doesn’t perform feature selection, since weights are only reduced to
values near 0 instead of 0. L1 regularization has built-in feature selection.
— L1 regularization is robust to outliers, L2 regularization is not.
3.2.4 Dropout
Dropout means that during training with some probability P a neuron of the neural network gets
turned off during training. Let’s look at a visual example.
DNN Pytorch
from numpy import vstack
from numpy import sqrt
from pandas import read_csv
from sklearn.metrics import mean_squared_error
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data import random_split
from torch import Tensor
from torch.nn import Linear
from torch.nn import Sigmoid
from torch.nn import Module
from torch.optim import SGD
from torch.nn import MSELoss
from torch.nn.init import xavier_uniform_
# dataset definition
class CSVDataset(Dataset):
# load the dataset
def __init__(self, path):
# load the csv file as a dataframe
df = read_csv(path, header=None)
# store the inputs and outputs
self.X = df.values[:, :-1].astype('float32')
self.y = df.values[:, -1].astype('float32')
# ensure target has the right shape
self.y = self.y.reshape((len(self.y), 1))
DNN Pytorch
# pytorch mlp for binary classification
from numpy import vstack
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data import random_split
from torch import Tensor
from torch.nn import Linear
from torch.nn import ReLU
from torch.nn import Sigmoid
from torch.nn import Module
from torch.optim import SGD
from torch.nn import BCELoss
from torch.nn.init import kaiming_uniform_
from torch.nn.init import xavier_uniform_
from tqdm import tqdm
# dataset definition
class CSVDataset(Dataset):
# load the dataset
def __init__(self, path):
# load the csv file as a dataframe
df = read_csv(path, header=None)
# store the inputs and outputs
self.X = df.values[:, :-1]
self.y = df.values[:, -1]
# ensure input data is floats
self.X = self.X.astype('float32')
# label encode target and ensure the values are floats
self.y = LabelEncoder().fit_transform(self.y)
self.y = self.y.astype('float32')
self.y = self.y.reshape((len(self.y), 1))
# model definition
class MLP(Module):
# define model elements
def __init__(self, n_inputs):
super(MLP, self).__init__()
# input to first hidden layer
self.hidden1 = Linear(n_inputs, 10)
kaiming_uniform_(self.hidden1.weight, nonlinearity='relu')
self.act1 = ReLU()
# second hidden layer
self.hidden2 = Linear(10, 8)
kaiming_uniform_(self.hidden2.weight, nonlinearity='relu')
self.act2 = ReLU()
# third hidden layer and output
self.hidden3 = Linear(8, 1)
xavier_uniform_(self.hidden3.weight)
self.act3 = Sigmoid()
DNN Pytorch
# pytorch mlp for multiclass classification
from numpy import vstack
from numpy import argmax
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from torch import Tensor
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
# dataset definition
class CSVDataset(Dataset):
# load the dataset
def __init__(self, path):
# load the csv file as a dataframe
df = read_csv(path, header=None)
# store the inputs and outputs
self.X = df.values[:, :-1]
self.y = df.values[:, -1]
# ensure input data is floats
self.X = self.X.astype('float32')
# label encode target and ensure the values are floats
self.y = LabelEncoder().fit_transform(self.y)
# model definition
class MLP(Module):
# define model elements
def __init__(self, n_inputs):
DNN Keras
# Regression Example With Boston Dataset: Standardized
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
model = baseline_model()
model.fit(X_train, y_train, epochs=20)
pred_train= model.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train)))
pred= model.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred)))
DNN Keras
# first neural network with keras tutorial
from numpy import loadtxt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# load the dataset
dataset = loadtxt('datasets/pima-indians-diabetes.csv', delimiter=',')
# split into input (X) and output (y) variables
X = dataset[:,0:8]
y = dataset[:,8]
# define the keras model
model = Sequential()
model.add(Dense(12, input_shape=(8,), activation='relu'))
DNN Keras
# multi-class classification with Keras
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from keras.utils import to_categorical
# load dataset
dataframe = pandas.read_csv("datasets/Iris.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:4].astype(float)
Y = dataset[:,4]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = to_categorical(encoded_Y)
The tools we have had so far : Expensive to learn ! ! Will not generalize well ! ! Does not exploit
the order and local relations in the data ! !
Consequently, the biggest advantage of a convolutional neural network, when compared to a fully
connected neural network, is a smaller number of parameters.
92
For example, if the input I has 32 × 32dimension and we apply 10 filters with dimension 3 × 3,
the output will be a tensor with the format 30 × 30 × 10. Every filter has 3 · 3 = 9 parameters
plus one bias element which is in total for 10 filters.
10 × 10 = 100 parameters. On the other hand, with a fully connected neural network, we would
need to flatten the input matrix into a 32 · 32 = 1024 dimensional vector. In order to have
the output with the same dimension as above, we would need 1024 × 30 × 30 × 10 = 921600
parameters, or weights.
where f(x, y) is the input image and w is the filter or kernel. More intuitively, we can imagine
this process looking at the illustration below :
A filter provides a measure for how close a patch or a region of the input resembles a feature. It
acts as a single template or pattern, which, when convolved across the input, finds similarities
Padding
Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input.
This value can either be manually specified or automatically set through one of the three modes
detailed below :
Every time we apply a convolutional operation, the size of the image shrinks. Pixels present in
the corner of the image are used only a few number of times during convolution as compared
to the central pixels. Hence, we do not focus too much on the corners since that can lead to
information loss.
To overcome these issues, we can pad the image with an additional border, i.e., we add one pixel
all around the edges. This means that the input will be an 8 X 8 matrix (instead of a 6 X 6
matrix). Applying convolution of 3 X 3 on it will result in a 6 X 6 matrix which is the original
shape of the image. This is where padding comes to the fore :
— Input : n X n
— Filter/Kernel size : f X f
— Padding : p
— Output : (n+2p-f+1) X (n+2p-f+1)
For a convolutional or a pooling operation, the stride S denotes the number of pixels by which
the window moves after each operation.
— Input : 6 X 6 X 3
— Filter : 3 X 3 X 3
Instead of using just a single filter, we can use multiple filters as well. Let’s say the first filter
will detect vertical edges and the second filter will detect horizontal edges from the image. If we
use multiple filters, the output dimension will change. So, instead of having a 4 X 4 output as in
the above example, we would have a 4 X 4 X 2 output (if we have used 2 filters) :
Here, nc is the number of channels in the input and filter, while nc’ is the number of filters.
It is common to periodically insert a Pooling layer in-between successive Conv layers in a Conv-
Net architecture. Its function is to progressively reduce the spatial size of the representation to
reduce the amount of parameters and computation in the network.
The Pooling Layer operates independently on every depth slice of the input and resizes it spa-
tially, using the MAX operation.
https ://poloclub.github.io/cnn-explainer/
4.5.1 LeNet-5
4.5.2 AlexNet
This network is similar to LeNet-5 with just more convolution and pooling layers :
— Parameters : 60 million
— Activation function : ReLu
4.5.3 VGG-16
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
batch_size = 4
def imshow(img):
img = img / 2 + 0.5 # unnormalize
npimg = img.numpy()
plt.imshow(np.transpose(npimg, (1, 2, 0)))
plt.show()
# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()
# show images
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
net = Net()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
# print statistics
running_loss += loss.item()
if i % 2000 == 1999: # print every 2000 mini-batches
print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}')
running_loss = 0.0
print('Finished Training')
dataiter = iter(testloader)
images, labels = dataiter.next()
# print images
imshow(torchvision.utils.make_grid(images))
print('GroundTruth: ', ' '.join(f'{classes[labels[j]]:5s}' for j in range(4)))
outputs = net(images)
_, predicted = torch.max(outputs, 1)
correct = 0
total = 0
# since we're not training, we don't need to calculate the gradients for our
outputs
with torch.no_grad():
for data in testloader:
images, labels = data
# calculate outputs by running images through the network
outputs = net(images)
# the class with the highest energy is what we choose as prediction
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print(f'Accuracy of the network on the 10000 test images: {100 * correct // total}
%')
CNN Keras
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
import numpy as np
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
(x_train, y_train), (x_test, y_test) = mnist.load_data()
if K.image_data_format() == 'channels_first':
x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
model = Sequential()
model.add(Conv2D(32, kernel_size = (3, 3), activation = 'relu', input_shape =
input_shape))
model.add(Conv2D(64, (3, 3), activation = 'relu'))
model.add(MaxPooling2D(pool_size = (2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation = 'softmax'))
model.compile(loss = keras.losses.categorical_crossentropy,
optimizer = keras.optimizers.Adadelta(), metrics = ['accuracy'])
model.fit(
x_train, y_train,
batch_size = 128,
epochs = 12,
verbose = 1,
validation_data = (x_test, y_test)
)
pred = model.predict(x_test)
pred = np.argmax(pred, axis = 1)[:5]
label = np.argmax(y_test,axis = 1)[:5]
print(pred)
print(label)
Ordinary feed forward neural networks are only meant for data points, which are independent
of each other. However, if we have data in a sequence such that one data point depends upon
the previous data point, we need to modify the neural network to incorporate the dependencies
between these data points. RNNs have the concept of ‘memory’ that helps them store the states
or information of previous inputs to generate the next output of the sequence.
— Core idea : update hidden state h based on input and previous hidden state using same
update rule (same/shared parameters) at each time step.
105
— Allows for processing sequences of variable length, not only fixed-sized vectors.
— Infinite memory : h is function of all previous inputs (long-term dependencies).
ht = tanh(Ah ht−1 + Ax xt + b)
ŷt = Ay ht
— Hidden state ht = linear combination of input xt and previous hidden state ht−1 .
— Output ŷt = linear prediction based on current hidden state ht .
— tanh(·) is commonly used as activation function (data is in the range [-1, 1]).
— Parameters Ah , Ax , Ay , b are constant over time (sequences may vary in length).
For each timestep t, the activation a<t> and the output y <t> are expressed as follows :
One-to-one Tx = Ty = 1
One-to-many Tx = 1, Ty > 1
Many-to-one Tx > 1, Ty = 1
δLt δLt
= ΣTt=1 |t
δW δW
Sometimes, we only need to look at recent information to perform the present task. For example,
consider a language model trying to predict the next word based on the previous ones. If we are
trying to predict the last word in “the clouds are in the sky,” we don’t need any further context
– it’s pretty obvious the next word is going to be sky. In such cases, where the gap between
the relevant information and the place that it’s needed is small, RNNs can learn to use the past
information.
But there are also cases where we need more context. Consider trying to predict the last word
in the text “I grew up in France. . . I speak fluent French.” Recent information suggests that the
next word is probably the name of a language, but if we want to narrow down which language,
we need the context of France, from further back. It’s entirely possible for the gap between the
relevant information and the point where it is needed to become very large.
Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.
5.2.1 Gates
In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs
and usually have a well-defined purpose. They are usually noted Γ "Gamma" and are equal to :
where W, U, b are coefficients specific to the gate and σ is the sigmoid function.,The main ones
are summed up as below :
The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only
some minor linear interactions. It’s very easy for information to just flow along it unchanged.
The LSTM does have the ability to remove or add information to the cell state, carefully regu-
lated by structures called gates.
Gates are a way to optionally let information through. They are composed out of a sigmoid
neural net layer and a pointwise multiplication operation.
An LSTM has three of these gates, to protect and control the cell state.
Let’s go back to our example of a language model trying to predict the next word based on all the
previous ones. In such a problem, the cell state might include the gender of the present subject,
so that the correct pronouns can be used. When we see a new subject, we want to forget the
gender of the old subject.
The next step is to decide what new information we’re going to store in the cell state. This has
two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update.
Next, a tanh layer creates a vector of new candidate values, Ct , that could be added to the state.
In the next step, we’ll combine these two to create an update to the state.
In the example of our language model, we’d want to add the gender of the new subject to the
cell state, to replace the old one we’re forgetting.
We multiply the old state by ft , forgetting the things we decided to forget earlier. Then we add
it ∗ Ct . This is the new candidate values, scaled by how much we decided to update each state
value.
In the case of the language model, this is where we’d actually drop the information about the
old subject’s gender and add the new information, as we decided in the previous steps.
Finally, we need to decide what we’re going to output. This output will be based on our cell
state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the
cell state we’re going to output. Then, we put the cell state through tanh(to push the values to
be between -1 and 1 ) and multiply it by the output of the sigmoid gate, so that we only output
the parts we decided to.
For the language model example, since it just saw a subject, it might want to output information
relevant to a verb, in case that’s what is coming next. For example, it might output whether
the subject is singular or plural, so that we know what form a verb should be conjugated into if
that’s what follows next.
# data generation
seq_length = 20
time_steps = np.linspace(0, np.pi, seq_length + 1)
data = np.sin(time_steps)
data.resize((seq_length + 1, 1))
#size becomes (seq_length+1, 1), adds an input_size dimension
class RNN(nn.Module):
def __init__(self, input_size, output_size, hidden_dim, n_layers):
super(RNN, self).__init__()
self.hidden_dim=hidden_dim
# batch_first means that the first dim of the input and output will be the
batch_size
self.rnn = nn.RNN(input_size, hidden_dim, n_layers, batch_first=True)
# last, fully-connected layer
self.fc = nn.Linear(hidden_dim, output_size)
input_size=1
output_size=1
hidden_dim=32
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(rnn.parameters(), lr=0.01)
x = data[:-1]
y = data[1:]
# Representing Memory #
# make a new variable for hidden and detach the hidden state from its
history
# this way, we don't backpropagate through the entire history
hidden = hidden.data
return rnn
n_steps = 10
print_every = 2
train(rnn, n_steps, print_every)
https://medium.com/@deepeshrishu09/automatic-image-captioning-with-pytorch-cf576c98d319
Args:
dataset: A numpy array of time series, first dimension is the time steps
lookback: Size of window for prediction
"""
X, y = [], []
for i in range(len(dataset)-lookback):
feature = dataset[i:i+lookback]
target = dataset[i+1:i+lookback+1]
X.append(feature)
y.append(target)
return torch.tensor(X), torch.tensor(y)
lookback = 4
X_train, y_train = create_dataset(train, lookback=lookback)
X_test, y_test = create_dataset(test, lookback=lookback)
class AirModel(nn.Module):
def __init__(self):
super().__init__()
self.lstm = nn.LSTM(input_size=1, hidden_size=50, num_layers=1, batch_first
=True)
self.linear = nn.Linear(50, 1)
def forward(self, x):
x, _ = self.lstm(x)
x = self.linear(x)
return x
model = AirModel()
optimizer = optim.Adam(model.parameters())
loss_fn = nn.MSELoss()
loader = data.DataLoader(data.TensorDataset(X_train, y_train), shuffle=True,
batch_size=8)
n_epochs = 2000
for epoch in range(n_epochs):
with torch.no_grad():
# shift train predictions for plotting
train_plot = np.ones_like(timeseries) * np.nan
y_pred = model(X_train)
y_pred = y_pred[:, -1, :]
train_plot[lookback:train_size] = model(X_train)[:, -1, :]
# shift test predictions for plotting
test_plot = np.ones_like(timeseries) * np.nan
test_plot[train_size+lookback:len(timeseries)] = model(X_test)[:, -1, :]
# plot
plt.plot(timeseries)
plt.plot(train_plot, c='r')
plt.plot(test_plot, c='g')
plt.show()
print('Tensorflow: {}'.format(tf.__version__))
h = [1, 0, 0, 0]
e = [0, 1, 0, 0]
l = [0, 0, 1, 0]
o = [0, 0, 0, 1]
# In tensorflow, to insert this data, we need to reshape like this order: $$ (\text
{batch_size}, \text{sequence_length}, \text{sequence_width}) $$
X_data = np.array([[h]], dtype=np.float32)
X_data.shape
hidden_size = 2 # 2 nodes
cell = tf.keras.layers.SimpleRNNCell(units=hidden_size)
rnn = tf.keras.layers.RNN(cell, return_sequences=True, return_state=True)
outputs, states = rnn(X_data)
cell = tf.keras.layers.SimpleRNNCell(units=hidden_size)
rnn = tf.keras.layers.RNN(cell, return_sequences=True, return_state=True)
outputs, states = rnn(X_data)
model = keras.Sequential()
# each exmple is a vector of 28
#Gru
model.add(layers.GRU(64, input_shape=(None, 28)))
# Simple RNN
#model.add(layers.SimpleRNN(64, input_shape=(None, 28)))
#LSTM
#model.add(layers.LSTM(64, input_shape=(None, 28)))
model.add(layers.BatchNormalization())
model.add(layers.Dense(10))
print(model.summary())
mnist = keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train/255.0, x_test/255.0
x_validate, y_validate = x_test[:-10], y_test[:-10]
x_test, y_test = x_test[-10:], y_test[-10:]
model.compile(
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer="sgd",
metrics=["accuracy"],
)
model.fit(
x_train, y_train, validation_data=(x_test, y_test), batch_size=64, epochs=3
)
for i in range(10):
result = tf.argmax(model.predict(tf.expand_dims(x_test[i], 0)), axis=1)
print(result.numpy(), y_test[i])
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# we can generate the token dictionary, which is the mapping table for each
characters.
#the format (or shape) of input data must be fixed. So we need to add the concept
of padding. words with same length
char_set = ['<pad>'] + sorted(list(set(''.join(words))))
idx2char = {idx:char for idx, char in enumerate(char_set)}
char2idx = {char:idx for idx, char in enumerate(char_set)}
char_set
input_dim = len(char2idx)
output_dim = len(char2idx)
input_dim
output_dim
model = Sequential([
Embedding(input_dim=input_dim, output_dim=output_dim,
model.summary()
from_logits=True))
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
tr_loss_hist = []
for e in range(30):
avg_tr_loss = 0
tr_step = 0
avg_tr_loss /= tr_step
tr_loss_hist.append(avg_tr_loss)
if (e + 1) % 5 == 0:
print('epoch: {:3}, tr_loss: {:3f}'.format(e + 1, avg_tr_loss))
y_pred = model.predict(X)
y_pred = np.argmax(y_pred, axis=-1)
y_pred
X
print('Tensorflow: {}'.format(tf.__version__))
print(word_list)
print(word2idx)
print(idx2word)
print(pos_list)
print(pos2idx)
print(idx2pos)
print(X)
print(y)
print(X)
print(X_mask)
print(X_len)
print(y)
print(train_ds)
num_classes = len(pos2idx)
input_dim = len(word2idx)
output_dim = len(word2idx)
model = Sequential([
Embedding(input_dim=input_dim, output_dim=output_dim,
mask_zero=True, trainable=False, input_length=10,
embeddings_initializer=tf.keras.initializers.random_normal()),
SimpleRNN(units=10, return_sequences=True),
TimeDistributed(Dense(units=num_classes))
])
model.summary()
optimizer = tf.keras.optimizers.Adam(learning_rate=0.1)
tr_loss_hist = []
for e in range(30):
avg_tr_loss = 0
tr_step = 0
if (e + 1) % 5 == 0:
print('Epoch: {:3}, tr_loss: {:.3f}'.format(e+1, avg_tr_loss))
y_pred = model.predict(X)
y_pred = np.argmax(y_pred, axis=-1) * X_mask
y_pred
pprint(y_pred_pos)
#pprint(pos)
x_pred_pos
wx = demo_model.get_weights()[0]
wh = demo_model.get_weights()[1]
bh = demo_model.get_weights()[2]
wy = demo_model.get_weights()[3]
by = demo_model.get_weights()[4]
print('wx = ', wx, ' wh = ', wh, ' bh = ', bh, ' wy =', wy, 'by = ', by)
x = np.array([1, 2, 3])
# Reshape the input to the required sample_size x time_steps x features
x_input = np.reshape(x,(1, 3, 1))
y_pred_model = demo_model.predict(x_input)
m = 2
h0 = np.zeros(m)
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
sunspots_url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly
-sunspots.csv'
train_data, test_data, data = get_train_test(sunspots_url)
time_steps = 12
trainX, trainY = get_XY(train_data, time_steps)
testX, testY = get_XY(test_data, time_steps)
# make predictions
train_predict = model.predict(trainX)
test_predict = model.predict(testX)
# Mean square error
print_error(trainY, testY, train_predict, test_predict)
Let’s see how this setup of the encoder and the decoder stack works :
6.1.1 Attention
The attention mechanism describes a recent new group of layers in neural networks that has
attracted a lot of interest in the past few years, especially in sequence tasks. There are a lot of
different possible definitions of “attention” in the literature, but the one we will use here is the
following : the attention mechanism describes a weighted average of (sequence) elements with
the weights dynamically computed based on an input query and elements’ keys.
— The Attention mechanism enables the transformers to have extremely long term memory.
— A transformer model can attend or focus on all previous tokens that have been generated.
The goal is to take an average over the features of multiple elements. However, instead of weigh-
ting each element equally, we want to weight them depending on their actual values. In other
131
words, we want to dynamically decide on which inputs we want to “attend” more than others.
In particular, an attention mechanism has usually four parts we need to specify :
— Query : The query is a feature vector that describes what we are looking for in the
sequence, i.e. what would we maybe want to pay attention to.
— Keys : For each input element, we have a key which is again a feature vector. This feature
vector roughly describes what the element is “offering”, or when it might be important.
The keys should be designed such that we can identify the elements we want to pay
attention to based on the query.
— Values : For each input element, we also have a value vector. This feature vector is the
one we want to average over.
— Score function : To rate which elements we want to pay attention to, we need to specify
a score function . The score function takes the query and a key as input, and output the
score/attention weight of the query-key pair. It is usually implemented by simple simila-
rity metrics like a dot product, or a small MLP.
The weights of the average are calculated by a softmax over all score function outputs. Hence, we
assign those value vectors a higher weight whose corresponding key is most similar to the query.
If we try to describe it with pseudo-math, we can write :
fattn (Keyi ,Query)
e X
αi = P , out = αi valuei (6.1)
ef attn (Keyi ,Query)
j
In NLP, word embedding is a projection of a word, consisting of characters into meaningful vec-
tors of real numbers. Conceptually it involves a mathematical embedding from a dimension N (all
words in a given corpus)-often a simple one-hot encoding is used- to a continuous vector space
with a much lower dimensionality, typically 128 or 256 dimensions are used. Word embedding is
a crucial preprocessing step for training a neural network.
Embedding Layer
An embedding layer is a type of hidden layer in a neural network. In one sentence, this layer
maps input information from a high-dimensional to a lower-dimensional space, allowing the net-
work to learn more about the relationship between inputs and to process the data more efficiently.
For example, in natural language processing (NLP), we often represent words and phrases as
The type of embedding layer depends on the neural network and the embedding process. There
are several types of embedding that exist :
— text embedding
— Image embedding
— Graph embedding and others
Text embedding
A standard approach is, to feed the one-hot encoded tokens (mostly words, or sentence) into a
embedding layer. During training the model tries to find a suitable embedding (lower dimensio-
nality as the input layer). The position of a word within the vector space is learned from text
and is based on the words that surround the word when it is used. In some cases it could be
useful to use a pretrained embedding, which was trained on a hugh corpus.
— Input : one-hot encoding of the word in a vocabulary
— Output : one vector of N dimensions (given by the user, probably tuned with hyperpara-
meter tuning)
Positional encoding describes the location or position of an entity in a sequence so that each
position is assigned a unique representation. There are many reasons why a single number, such
as the index value, is not used to represent an item’s position in transformer models. For long
sequences, the indices can grow large in magnitude. If you normalize the index value to lie bet-
ween 0 and 1, it can create problems for variable length sequences as they would be normalized
differently.
Transformers use a smart positional encoding scheme, where each position/index is mapped to
a vector. Hence, the output of the positional encoding layer is a matrix, where each row of the
matrix represents an encoded object of the sequence summed with its positional information. An
example of the matrix that encodes only the positional information is shown in the figure below.
Positional Encoding : is to inject positional information into the embeddings (information about
the positions).
The idea behind (Q, K, V) is similar to the search engine that will map the query against a set
of keys (video title, description etc.) associated with candidate videos in the database, then
present the best matched videos (values).
— Query : what i am looking for
— key : what i can offer
— Value : what i actually offer
The transformer encoder training builds the weight parameter matricesWQ and Wk .
The calculation goes like below where x is a sequence of position-encoded word embedding vectors
that represents an input sentence.
1. Q = X · WQT
2. K = X · WKT
3. For each (q, k) pair, their relation strength is calculated using dot product : q_to_k_similarity_scor
matmul(Q, K T )
4. Weight matrices WQ and WK are trained via the back propagations during the Transformer
training.
It means that we take sum together the output of a layer with the input F(x)+x.
The idea was introduced with the ResNet model. It is one of the solutions for vanishing gradient
The norm step is about layer normalization , it is another way of normalization and it is one
of the many computational tricks to make life easier for training the model, hence improve the
performance and training time.
we can take an unlabeled dataset and frame it as a supervised learning problem tasked with out-
putting x̂, a reconstruction of the original input x. This network can be trained by minimizing
the reconstruction error, L(x,x̂), which measures the differences between our original input and
the consequent reconstruction.
he bottleneck is a key attribute of our network design ; without the presence of an information
bottleneck, our network could easily learn to simply memorize the input values by passing these
values along through the network (visualized below).
143
(fw : x → z)
gw : z → x̂
— One latent variable gets associated with each data point in the training set.
— The latent vectors are smaller than the observations (Q < D), compression.
— Models are linear or non-linear, deterministic or stochastic, with/without encoder.
Note : a latent space is defined as an abstract multi-dimensional space that encodes a meaningful
internal representation of externally observed events, compressed understanding of the world to
a computer through a spatial representation.
Some generative models (e.g., normalizing flows) allow for computing p(x).Others (e.g., VAEs)
only approximate p(x), but allow to draw samples from p(x).
Generative latent variable models often consider a simple Bayesian model :
Z
p(x) = p(z)p(x | z)dz = Ez∼p(z) (p(x | z))
z
(fw : x → z)gw : z → x̂
hi = g(xi )
Where hi ∈ RQ (the latent feature representation) is the output of the encoder block .
Where x̃i ∈ Rn . Training an autoencoder simply means finding the functions g(·) and f(·) that
satisfy :
arg min =< [∆(xi , f (g(xi ))] >
f,g
where ∆ indicates a measure of how the input and the output of the autoencoder differ (basically
our loss function will penalize the difference between input and output) and < · > indicates the
average over all observations.
In the formula the θi are the parameters in the functions f(·) and g(·) (you can imagine that
in the case where the functions are neural networks, the parameters will be the weights).
Loss Function
E[∆(xi , x̃i )]
7.1.6 AE Pytorch
AE Pytorch
import torch
from torchvision import datasets
from torchvision import transforms
import matplotlib.pyplot as plt
# Model Initialization
model = AE()
with torch.no_grad():
for i, item in enumerate(image):
item = item.reshape(-1, 28, 28)
plt.imshow(item[0])
plt.show()
with torch.no_grad():
for i, item in enumerate(reconstructed):
item = item.reshape(-1, 28, 28)
plt.imshow(item[0])
plt.show()
7.1.7 AE Keras
AE Keras
import numpy as np
import matplotlib.pyplot as plt
from random import randint
from keras import backend as K
from keras.layers import Input, Dense, Conv2D, MaxPooling2D, UpSampling2D
from keras.models import Model
from keras.datasets import mnist
from keras.callbacks import TensorBoard
def load_data():
# defining the input image size
input_image = Input(shape =(28, 28, 1))
# Loading the data and dividing the data into training and testing sets
(X_train, _), (X_test, _) = mnist.load_data()
return decoded_layer
return autoencoder
Variational autoencoder is different from autoencoder in a way such that it provides a statis-
tic manner for describing the samples of the dataset in latent space. Therefore, in variational
autoencoder, the encoder outputs a probability distribution in the bottleneck layer instead of a
single output value.
In autoencoder the encoder outputs are single output value :
With this approach, we’ll now represent each latent attribute for a given input as a probability
distribution.
Suppose we have a distribution z and we want to generate the observation x from it. In other
words, we want to calculate
p(x | z)p(z)
p(z | x) =
p(x)
But, the calculation of p(x) can be quite difficult
Z
p(x) = p(x | z)p(z)dz
This usually makes it an intractable distribution. Hence, we need to approximate p(z|x) to q(z|x)
to make it a tractable distribution. To better approximate p(z|x) to q(z|x), we will minimize the
The first term represents the reconstruction likelihood and the other term ensures that our lear-
ned distribution q is similar to the true prior distribution p.
Thus our total loss consists of two terms, one is reconstruction error and other is KL-divergence
loss :
Generative classifiers :
— Assume some functional form for P(Y), P(X|Y)
— Estimate parameters of P(X|Y), P(Y) directly from training data
— Use Bayes rule to calculate P(Y |X)
Discriminative Classifiers :
— Assume some functional form for P(Y|X)
— Estimate parameters of P(Y|X) directly from training data
155
— Generative models can generate new data instances.
— Discriminative models discriminate between different kinds of data instances.
A generative model could generate new photos of animals that look like real animals, while a
discriminative model could tell a dog from a cat. GANs are just one kind of generative model.
— Generative models capture the joint probability p(X, Y), or just p(X) if there are no labels.
— Discriminative models capture the conditional probability p(Y | X).
In GANs, there is a Generator and a Discriminator. The Generator generates fake samples of
data(be it an image, audio, etc.) and tries to fool the Discriminator. The Discriminator, on
the other hand, tries to distinguish between the real and fake samples. The Generator and the
Discriminator are both Neural Networks and they both run in competition with each other in the
training phase. The steps are repeated several times and in this, the Generator and Discriminator
get better and better in their respective jobs after each repetition. The work can be visualized
by the diagram given below :
the generative model captures the distribution of data and is trained in such a manner that it
tries to maximize the probability of the Discriminator making a mistake. The Discriminator,
min max V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]
G D
Where :
— G = Generator
— D = Discriminator
— Pdata(x) = distribution of real data
— P(z) = distribution of generator
— x = sample from Pdata(x)
— z = sample from P(z)
— D(x) = Discriminator network
— G(z) = Generator network
GANs Pytroch
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
import numpy as np
# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_dataset = datasets.CIFAR10(root='./data',\
train=True, download=True, transform=transform)
dataloader = torch.utils.data.DataLoader(train_dataset, \
batch_size=32, shuffle=True)
self.model = nn.Sequential(
nn.Linear(latent_dim, 128 * 8 * 8),
nn.ReLU(),
nn.Unflatten(1, (128, 8, 8)),
nn.Upsample(scale_factor=2),
nn.Conv2d(128, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128, momentum=0.78),
nn.ReLU(),
nn.Upsample(scale_factor=2),
nn.Conv2d(128, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64, momentum=0.78),
nn.ReLU(),
nn.Conv2d(64, 3, kernel_size=3, padding=1),
nn.Tanh()
)
self.model = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, stride=2, padding=1),
nn.LeakyReLU(0.2),
nn.Dropout(0.25),
nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1),
nn.ZeroPad2d((0, 1, 0, 1)),
nn.BatchNorm2d(64, momentum=0.82),
nn.LeakyReLU(0.25),
# Loss function
adversarial_loss = nn.BCELoss()
# Optimizers
optimizer_G = optim.Adam(generator.parameters()\
, lr=lr, betas=(beta1, beta2))
optimizer_D = optim.Adam(discriminator.parameters()\
, lr=lr, betas=(beta1, beta2))
# Training loop
for epoch in range(num_epochs):
for i, batch in enumerate(dataloader):
# Convert list to tensor
real_images = batch[0].to(device)
# Configure input
real_images = real_images.to(device)
# ---------------------
optimizer_D.zero_grad()
# -----------------
# Train Generator
# -----------------
optimizer_G.zero_grad()
# Adversarial loss
g_loss = adversarial_loss(discriminator(gen_images), valid)
# ---------------------
# Progress Monitoring
# ---------------------
if (i + 1) % 100 == 0:
print(
f"Epoch [{epoch+1}/{num_epochs}]\
164