BackPropagation
There will be some functions that start with the word "grader" ex: grader_sigmoid(), grader_forwardprop(),
grader_backprop() etc, you should not change those function definition.
Every Grader function has to return True.
Loading data
In [1]:
import pickle
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
with open('data.pkl', 'rb') as f:
data = pickle.load(f)
print(data.shape)
X = data[:, :5]
y = data[:, -1]
print(X.shape, y.shape)
(506, 6)
(506, 5) (506,)
Computational graph
If you observe the graph, we are having input features [f1, f2, f3, f4, f5] and 9 weights [w1, w2, w3, w4, w5, w6, w7, w8,
w9].
The final output of this graph is a value L which is computed as (Y-Y')^2
Task 1: Implementing backpropagation and Gradient checking
Task 1: Implementing backpropagation and Gradient checking
Check this video for better understanding of the computational graphs and back propagation
In [2]:
from IPython.display import YouTubeVideo
YouTubeVideo('i94OvYb6noo',width="1000",height="500")
Out[2]:
Gradient clipping
Check this blog link for more details on Gradient clipping
we know that the derivative of any function is
$$\lim_{\epsilon\to0}\frac{f(x+\epsilon)-f(x-\epsilon)}{2\epsilon}$$
The definition above can be used as a numerical approximation of the derivative. Taking an epsilon small enough, the
calculated approximation will have an error in the range of epsilon squared.
In other words, if epsilon is 0.001, the approximation will be off by 0.00001.
Therefore, we can use this to approximate the gradient, and in turn make sure that backpropagation is implemented
properly. This forms the basis of gradient checking!
Gradient checking example</font>
lets understand the concept with a simple example: $f(w1,w2,x1,x2)=w_{1}^{2} . x_{1} + w_{2} . x_{2}$
from the above function , lets assume $w_{1}=1$, $w_{2}=2$, $x_{1}=3$, $x_{2}=4$ the gradient of $f$ w.r.t $w_{1}$ is
\begin{array} {lcl} \frac{df}{dw_{1}} = dw_{1} &=&2.w_{1}.x_{1} \\& = &2.1.3\\& = &6 \end{array}
let calculate the aproximate gradient of $w_{1}$ as mentinoned in the above formula and considering $\epsilon=0.0001$
\begin{array} {lcl} dw_1^{approx} & = & \frac{f(w1+\epsilon,w2,x1,x2)-f(w1-\epsilon,w2,x1,x2)}{2\epsilon} \\ & = &
\frac{((1+0.0001)^{2} . 3 + 2 . 4) - ((1-0.0001)^{2} . 3 + 2 . 4)}{2\epsilon} \\ & = & \frac{(1.00020001 . 3 + 2 . 4) - (0.99980001. 3 + 2
. 4)}{2*0.0001} \\ & = & \frac{(11.00060003) - (10.99940003)}{0.0002}\\ & = & 5.99999999999 \end{array}
. 4)}{2*0.0001} \\ & = & \frac{(11.00060003) - (10.99940003)}{0.0002}\\ & = & 5.99999999999 \end{array}
Then, we apply the following formula for gradient check: gradient_check = $\frac{\left\Vert\left (dW-dW^{approx}\rm\right)
\right\Vert_2}{\left\Vert\left (dW\rm\right) \right\Vert_2+\left\Vert\left (dW^{approx}\rm\right) \right\Vert_2}$
The equation above is basically the Euclidean distance normalized by the sum of the norm of the vectors. We use
normalization in case that one of the vectors is very small. As a value for epsilon, we usually opt for 1e-7. Therefore, if
gradient check return a value less than 1e-7, then it means that backpropagation was implemented correctly. Otherwise,
there is potentially a mistake in your implementation. If the value exceeds 1e-3, then you are sure that the code is not
correct.
in our example: gradient_check $ = \frac{(6 - 5.999999999994898)}{(6 + 5.999999999994898)} = 4.2514140356330737e^{-13}$
you can mathamatically derive the same thing like this
\begin{array} {lcl} dw_1^{approx} & = & \frac{f(w1+\epsilon,w2,x1,x2)-f(w1-\epsilon,w2,x1,x2)}{2\epsilon} \\ & = &
\frac{((w_{1}+\epsilon)^{2} . x_{1} + w_{2} . x_{2}) - ((w_{1}-\epsilon)^{2} . x_{1} + w_{2} . x_{2})}{2\epsilon} \\ & = & \frac{4.
\epsilon.w_{1}. x_{1}}{2\epsilon} \\ & = & 2.w_{1}.x_{1} \end{array}
Implement Gradient checking
(Write your code in def gradient_checking())
Algorithm
Task 2 : Optimizers
As a part of this task, you will be implementing 3 type of optimizers(methods to update weight)
Use the same computational graph that was mentioned above to do this task
Initilze the 9 weights from normal distribution with mean=0 and std=0.01
Check below video and this blog
In [3]:
from IPython.display import YouTubeVideo
YouTubeVideo('gYpoJMlgyXA',width="1000",height="500")
Out[3]:
Algorithm
for each epoch(1-100):
for each data point in your data:
using the functions forward_propagation() and backword_propagation() compute the
gradients of weights
update the weigts with help of gradients ex: w1 = w1-learning_rate*dw1
Implement below tasks</b>
Task 2.1: you will be implementing the above algorithm with Vanilla update of weights
Task 2.2: you will be implementing the above algorithm with Momentum update of weights
Task 2.3: you will be implementing the above algorithm with Adam update of weights
Note : If you get any assertion error while running grader functions, please print the variables in grader functions and check
which variable is returning False .Recheck your logic for that variable .
Write two functions
Forward propagation</b>(Write your code in def forward_propagation())
For easy debugging, we will break the computational graph into 3 parts.
Part 1</b>
Part 2</b>
Part 3</b>
def forward_propagation(X, y, W):
# X: input data point, note that in this assignment you are having 5-d dat
a points
# y: output varible
# W: weight array, its of length 9, W[0] corresponds to w1 in graph, W[1]
corresponds to w2 in graph,
..., W[8] corresponds to w9 in graph.
# you have to return the following variables
# exp= part1 (compute the forward propagation until exp and then store
the values in exp)
# tanh =part2(compute the forward propagation until tanh and then store th
e values in tanh)
# sig = part3(compute the forward propagation until sigmoid and then store
the values in sig)
# now compute remaining values from computional graph and get y'
# write code to compute the value of L=(y-y')^2
# compute derivative of L w.r.to Y' and store it in dl
# Create a dictionary to store all the intermediate values
# store L, exp,tanh,sig,dl variables
return (dictionary, which you might need to use for back propagation)
Backward propagation(Write your code in def backward_propagation()) </b>
def backward_propagation(L, W,dictionary):
# L: the loss we calculated for the current point
# dictionary: the outputs of the forward_propagation() function
# write code to compute the gradients of each weight [w1,w2,w3,...,w9]
# Hint: you can use dict type to store the required variables
# return dW, dW is a dictionary with gradients of all the weights
return dW
Task 1
Forward propagation
In [4]:
def sigmoid(z):
'''In this function, we will compute the sigmoid(z)'''
# we can use this function in forward and backward propagation
return 1/(1 + np.exp(-z))
def forward_propagation(x, y, w):
t1=w[0]*x[0]
t2=w[1]*x[1]
t3=t1+t2
t4=t3*t3
t5=t4+w[5]
exp=np.exp(t5)
t7=exp+w[6]
tanh=np.tanh(t7)
t9=w[2]*x[2]
t10=np.sin(t9)
t11=w[3]*x[3]
t12=w[4]*x[4]
t13=t11+t12
t14=t13*t10
t15=t14+w[7]
sig=sigmoid(t15)
t16=sig*w[8]
y_hat= t16+tanh
L=np.square(y-y_hat)
dl=-2*(y-y_hat)
dic={
'exp':exp,
'sigmoid':sig,
'tanh':tanh,
'loss':L,
'dy_pr':dl,
'sin':t10,
'cos':np.cos(t9)
return dic
Grader function - 1
In [5]:
def grader_sigmoid(z):
val=sigmoid(z)
assert(val==0.8807970779778823)
return True
grader_sigmoid(2)
Out[5]:
True
In [6]:
def grader_forwardprop(data):
dl = (data['dy_pr']==-1.9285278284819143)
dl = (data['dy_pr']==-1.9285278284819143)
loss=(data['loss']==0.9298048963072919)
part1=(data['exp']==1.1272967040973583)
part2=(data['tanh']==0.8417934192562146)
part3=(data['sigmoid']==0.5279179387419721)
assert(dl and loss and part1 and part2 and part3)
return True
w=np.ones(9)*0.1
d1=forward_propagation(X[0],y[0],w)
grader_forwardprop(d1)
Out[6]:
True
In [7]:
print(d1)
{'exp': 1.1272967040973583, 'sigmoid': 0.5279179387419721, 'tanh': 0.8417934192562146, 'loss': 0.9
298048963072919, 'dy_pr': -1.9285278284819143, 'sin': -0.14538296400984968, 'cos':
0.9893754564247643}
Backward propagation
In [8]:
def backward_propagation(L,W,dic):
'''In this function, we will compute the backward propagation '''
dw1=dic['dy_pr']*(1-np.square(dic['tanh']))*dic['exp']*2*((W[0]*L[0]+W[1]*L[1])*L[0])
dw2=dic['dy_pr']*(1-np.square(dic['tanh']))*dic['exp']*2*((W[1]*L[1]+W[0]*L[0])*L[1])
dw3=dic['dy_pr']*W[8]*dic['sigmoid']*(1-dic['sigmoid'])*(L[3]*W[3]+L[4]*W[4])*L[2]*dic['cos']
dw4=dic['dy_pr']*W[8]*dic['sigmoid']*(1-dic['sigmoid'])*L[3]*dic['sin']
dw5=dic['dy_pr']*W[8]*dic['sigmoid']*(1-dic['sigmoid'])*L[4]*dic['sin']
dw6=dic['dy_pr']*(1-np.square(dic['tanh']))*dic['exp']
dw7=dic['dy_pr']*(1-np.square(dic['tanh']))
dw8=dic['dy_pr']*W[8]*dic['sigmoid']*(1-dic['sigmoid'])
dw9=dic['sigmoid']*dic['dy_pr']
dW={
'dw1':dw1,
'dw2':dw2,
'dw3':dw3,
'dw4':dw4,
'dw5':dw5,
'dw6':dw6,
'dw7':dw7,
'dw8':dw8,
'dw9':dw9
}
return dW
In [9]:
def grader_backprop(data):
dw1=(data['dw1']==-0.22973323498702003)
dw2=(data['dw2']==-0.021407614717752925)
dw3=(data['dw3']==-0.005625405580266319)
dw4=(data['dw4']==-0.004657941222712423)
dw5=(data['dw5']==-0.0010077228498574246)
dw6=(data['dw6']==-0.6334751873437471)
dw7=(data['dw7']==-0.561941842854033)
dw8=(data['dw8']==-0.04806288407316516)
dw8=(data['dw8']==-0.04806288407316516)
dw9=(data['dw9']==-1.0181044360187037)
assert(dw1 and dw2 and dw3 and dw4 and dw5 and dw6 and dw7 and dw8 and dw9)
return True
w=np.ones(9)*0.1
d1=forward_propagation(X[0],y[0],w)
d1=backward_propagation(X[0],w,d1)
grader_backprop(d1)
Out[9]:
True
Implement gradient checking
W = initilize_randomly
def gradient_checking(data_point, W):
# compute the L value using forward_propagation()
# compute the gradients of W using backword_propagation()</font>
approx_gradients = []
for each wi weight value in W:<font color='grey'>
# add a small value to weight wi, and then find the values of L with the updated
weights
# subtract a small value to weight wi, and then find the values of L with the
updated weights
# compute the approximation gradients of weight wi</font>
approx_gradients.append(approximation gradients of weight wi)<font color='grey'>
# compare the gradient of weights W from backword_propagation() with the aproximation
gradients of weights with <br> gradient_check formula</font>
return gradient_check</font>
NOTE: you can do sanity check by checking all the return values of gradient_checking(),
they have to be zero. if not you have bug in your code
In [20]:
W = np.random.rand(9)
eps=0.0001
def gradient_checking(data_point, W):
d1=forward_propagation(data_point,y[0],W)
grad=backward_propagation(data_point,W,d1)
grad=list(grad.values())
# compute the L value using forward_propagation()
# compute the gradients of W using backword_propagation()
approx_gradients = []
for i in range(len(W)):
w_plus=W.copy()
w_plus[i]+=eps
L_plus=forward_propagation(data_point,y[0],w_plus)
grad_plus=backward_propagation(data_point,w_plus,L_plus)
L_plus=L_plus['loss']
w_minus=W.copy()
w_minus[i]-=eps
L_minus=forward_propagation(data_point,y[0],w_minus)
grad_minus=backward_propagation(data_point,w_minus,L_minus)
L_minus=L_minus['loss']
L_minus=L_minus['loss']
approx_grad=(L_plus-L_minus)/(2*eps)
approx_gradients.append(approx_grad)
gradient_check=[]
for i in range(len(W)):
num = np.linalg.norm(grad[i] - approx_gradients[i] )
den = np.linalg.norm(grad[i]) + np.linalg.norm(approx_gradients[i])
diff = num / den
gradient_check.append(diff)
# add a small value to weight wi, and then find the values of L with the updated weights
# subtract a small value to weight wi, and then find the values of L with the updated weig
hts
# compute the approximation gradients of weight wi
#approx_gradients.append(approximation gradients of weight wi)
# compare the gradient of weights W from backword_propagation() with the aproximation
gradients of weights with gradient_check formula
return gradient_check
g=gradient_checking(X[0],W)
In [21]:
g
Out[21]:
[1.446076278481714e-08,
1.2264881537525243e-10,
9.564345727570781e-11,
1.9892512093219975e-10,
7.8985457914592e-12,
1.2992058661307797e-09,
4.214611098858496e-09,
8.936748335692261e-10,
1.2188639210214776e-13]
Task 2: Optimizers
Algorithm with Vanilla update of weights
In [12]:
w=list(np.random.normal(0.0, 0.01, 9))
w
Out[12]:
[-0.001345358300390486,
0.013196112781944023,
0.0019041861049636545,
-0.011440720702484709,
0.017669818548239284,
0.006658934006215388,
0.0068233799771498585,
0.00656258256147008,
0.017630243873985256]
In [13]:
vw=w
Loss=[]
for epoch in range(100):
for i,j in zip(X,y):
d1=forward_propagation(i,j,vw)
loss=d1['loss']
dw=backward_propagation(i,vw,d1)
dw=list(dw.values())
dw=[i * 0.01 for i in dw]
vw=np.subtract(vw,dw)
Loss.append(loss)
Plot between epochs and loss
In [14]:
import matplotlib.pyplot as plt
epoch=list(range(1,101))
plt.plot(epoch,Loss)
plt.xlabel('Epoch')
plt.ylabel('Loss')
Out[14]:
Text(0, 0.5, 'Loss')
Algorithm with Vanilla update of weights
In [15]:
mw=w
v=list(np.zeros(9))
Loss_momentum=[]
for epoch in range(100):
for i,j in zip(X,y):
d1=forward_propagation(i,j,mw)
loss_momentum=d1['loss']
dw=backward_propagation(i,mw,d1)
dw=list(dw.values())
for i in range(len(dw)):
dw[i]=v[i]=0.9*v[i]-0.01*dw[i]
mw=np.add(mw,v)
Loss_momentum.append(loss_momentum)
Plot between epochs and loss
In [16]:
epoch=list(range(1,101))
plt.plot(epoch,Loss_momentum)
plt.xlabel('Epoch')
plt.ylabel('Loss')
Out[16]:
Text(0, 0.5, 'Loss')
Algorithm with Vanilla update of weights
In [33]:
aw=w
av=list(np.zeros(9))
am=list(np.zeros(9))
Loss_adam=[]
for epoch in range(100):
for i,j in zip(X,y):
d1=forward_propagation(i,j,aw)
loss_adam=d1['loss']
dw=backward_propagation(i,aw,d1)
dw=list(dw.values())
for i in range(len(dw)):
am[i] = 0.9*am[i] + (1-0.9)*dw[i]
av[i] = 0.999*av[i] + (1-0.999)*(dw[i]**2)
aw[i]+= - 0.1* am[i] / (np.sqrt(av[i]) + 1e-8)
Loss_adam.append(loss_adam)
Plot between epochs and loss
Plot between epochs and loss
In [34]:
epoch=list(range(1,101))
plt.plot(epoch,Loss_adam)
plt.xlabel('Epoch')
plt.ylabel('Loss')
Out[34]:
Text(0, 0.5, 'Loss')
Comparision plot between epochs and loss with different optimizers
In [35]:
plt.plot(epoch, Loss, 'r--')
plt.plot(epoch,Loss_momentum, 'b--')
plt.plot(epoch,Loss_adam, 'g--')
plt.legend(['vanilla', 'momentun','adam'])
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show();
In [ ]: