AI - FOUNDATION AND APPLICATION
Instructor:
Assoc. Prof. Dr. Truong Ngoc Son
Chapter 1
Introduction of Neural network
How a neuron is modelled?
Input
b
x1 w1
Output
x2 w2 f y
xn wn
Synaptic weight
How a neuron is modelled ?
Training a network – Optimization method
Activation functions
Sigmoid function Tanh function ReLU function
1 𝑒 𝑥 −𝑒 −𝑥 f 𝑥 = max(x,0)
𝜎 𝑥 = f 𝑥 = 𝑒 𝑥 +𝑒 −𝑥
1 + 𝑒 −𝑥
Neural network
Hidden layer
h1 h2 h3
Input layer
Output layer
Artificial neuron network
b1,1
o1 h1
f
W1,1
W1,1
b1,2 b2,1
W2,1 o2 h2 W1,2 a1 y1
𝑜1 = 𝑥1 𝑊1,1 + … + 𝑥𝑛 𝑊1,𝑛 + 𝑏1,1
x1 f f
W1,n Wk,1
ℎ1 = 𝑓 𝑜1
W1,m
W2,n Wk,2 𝑜2 = 𝑥1 𝑊2,1 + … + 𝑥𝑛 𝑊2,𝑛 + 𝑏1,2
b2,k
ℎ2 = 𝑓 𝑜2
ak yk
W1,m
b1,j f 𝑎1 = ℎ1 𝑊1,1 + … + ℎ𝑚 𝑊1,𝑚 + 𝑏2,1
xn oj hj
Wm,n
f Wk,m
𝑦1 = 𝑓 𝑎1
Output layer
Input layer b1,k 𝑎𝑘 = ℎ1 𝑊𝑘,1 + … + ℎ𝑚 𝑊𝑘,𝑚 + 𝑏2,𝑘
om
f
hm
𝑦𝑘 = 𝑓 𝑎𝑘
Hidden layer
TRAINING NEURAL
NETWORK
Supervised learning vs. unsupervised learning
Training an artificial neural network
Supervised learning
Output
Output
Input
Update weight
Error
Desired output
Training an artificial neural network
Unsupervised learning
Output
Input data
Output result
Simple neural network: Understanding of neuron
network learning
(-1) (-1) (-1)
Input
b
x1 w1
Output
x2 w2 f y
xn (+1) (+1) (+1)
wn
Synaptic weight
Quantifying the loss
The loss of a network measure the cost incurred from incorrect prediction
MSE: mean squared error
Output
Output
Input
Error Cross-entropy loss
Desired output
Training a network
Training a neural network is a process of using an optimization algorithm to find
a set of weights to best map inputs to outputs.
In other word, this is the way to minimize the loss
𝑛
∗
1
𝑊 = argmin ℒ(𝑓 𝑥 𝑖 , 𝑊 , 𝑦 𝑖 )
𝑤 𝑛
𝑖=1
So hard? Don’t worry, we will dive into the detail later
GRADIENT DESCENT
Training Neural Networks– Optimization of
the loss
What is Gradient ?
𝜕𝑓 𝜕𝑓 𝜕𝑓
𝑓(𝑥, 𝑦, 𝑧) 𝛻𝑓 = , , 𝜕𝑓 𝜕𝑓 𝜕𝑓
𝜕𝑥 𝜕𝑦 𝜕𝑧 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 −
𝜕𝑥
,𝑦 −
𝜕𝑦
,𝑧 −
𝜕𝑧
𝜕𝑓 𝜕𝑓 𝜕𝑓
Gradient of f 𝛻𝑓 = 𝑖+ 𝑗+ 𝑘
𝜕𝑥 𝜕𝑦 𝜕𝑧
Training a network – Optimization of the loss
Gradient Descent
Gradient descent is an optimization algorithm used to find the values of parameters
(coefficients) of a function (f) that minimizes a cost function (loss). This can be done
by iteratively moving in the direction of steepest descent as defined by the negative
of the gradient
𝑥0 = (𝑥0 , 𝑦0 , 𝑧0 )
𝑥𝑛+1 = 𝑥𝑛 − 𝜂𝛻𝑓 𝑥𝑛
𝜕𝑓 𝜕𝑓 𝜕𝑓
𝑥, 𝑦, 𝑧 = 𝑥 − 𝜂 ,𝑦 − 𝜂 ,𝑧 − 𝜂
𝜕𝑥 𝜕𝑦 𝜕𝑧
Optimization of the loss with gradient descent
Example: Linear Regression
y= mx+b
Desired output
Error
Predicted output
x1 x2
n
loss o y
i1
i i
2
m = m + Dm
(loss)
b = b + Db
Optimization of the loss with gradient descent
Assignment 01: Logistic Regresstion
y= mx+b
Desired output
Error
Predicted output
x1 x2
n
loss o y
i1
i i
2
m = m + Dm
(loss)
b = b + Db
Training Neural networks – Optimization method
x1
W1,1
W1,2 a1 o1
f
x2
W1,784
W10,2
W10,1
a10 o10
W10,784 f
x784
LOSS OPTIMIZATION WITH
GRADIENT DESCENT
Mathematical modeling of Training Process
Example
x1
W1,1
W1,2 a1 o1
f
x2
W1,784
W10,2
W10,1
a10 o10
W10,784 f
x784
Mathematical modeling of Training Process
Desired outputs, labels
1 x1
W1,1
0 W1,2 a1 o1
xi
f 1
W1,784 Wj,1
W10,2 aj oj
Wi,j f 0
W10,1
a10 o10
Wj,784 f
0
W10,784
1 x784
Mathematical modeling of Training Process
Desired outputs, labels
0 x1
W1,1
0 W1,2 a1
f
o1
0
xi
W1,784 Wj,1
W10,2 aj oj
Wi,j f 0
W10,1
a10 o10
Wj,784 f
W10,784
1
x784
1
Mathematical modeling of Training Process
Randomly initialize
Weights, W
1 x1
W1,1
0 W1,2 a1
f
o1
0.5
xi
W1,784 Wj,1
W10,2 aj oj
Wi,j f 0.6
W10,1
a10 o10
Wj,784 f 0.9
W10,784
1 x784
Mathematical modeling of Training Process
W = ArgMin (Loss)
1 x1
W1,1
0 W1,2 a1
f
o1
0.9
xi
W1,784 Wj,1
W10,2 aj oj
Wi,j f 0.2
W10,1
a10 o10
Wj,784 f
0.1
W10,784
1 x784
Mathematical modeling of Training Process
Training process
Update w Predictive Desired
1 x1
output,o output, y
W1,1
0 W1,2 a1 o1
xi
f 0.5 1
W1,784 Wj,1
W10,2 aj oj
f
Wi,j
0.6 0
W10,1
a10 o10
Wj,784 f
W10,784 0.9 0
1 x784
Error
Mathematical modeling of Training Process
For being simple, b=0
x1
Formulate the output loss For jth output
W1,1
W1,2 a1 o1 784 10 10
xi
f
1 784
2
𝑎𝑗 = 𝑥𝑖 𝑤𝑗,𝑖 𝐿= 𝑦𝑗𝑡 − 𝑜𝑗𝑡
W10,2
W1,784 Wj,1 aj oj 10 𝑎𝑗 = 𝑥𝑖 𝑤𝑗,𝑖
Wi,j f 𝑖=1 𝑡=1 𝑗=1
1 𝑖=1
𝑜𝑗 = 𝜎 𝑎𝑗 = 1
W10,1
a10 o10 1 + 𝑒 −𝑎𝑗 For jth output 𝑜𝑗 = 𝜎 𝑎𝑗 =
Wj,784 f 1 + 𝑒 −𝑎𝑗
W10,784
x784 For jth output 1
10
2
𝐿= 𝑦𝑗𝑡 − 𝑜𝑗𝑡 Gradient descent
10
𝑡=1
10 𝜕𝐿
𝜕𝐿 2 𝜕𝑜𝑗𝑡 𝑤𝑗,𝑖 ← 𝑤𝑗,𝑖 − 𝜂
=− 𝑦𝑗𝑡 − 𝑜𝑗𝑡 𝜕𝑤𝑗,𝑖
𝜕𝑤𝑗,𝑖 10 𝜕𝑤𝑗,𝑖
𝑡=1
10
10 𝜕𝐿 2
𝜕𝐿 2 𝜕𝑜𝑗𝑡 𝜕𝑎𝑗𝑡 =− 𝑦𝑗𝑡 − 𝑜𝑗𝑡 𝑜𝑗𝑡 (1 − 𝑜𝑗𝑡 )𝑥𝑖𝑡
=− 𝑦𝑗𝑡 − 𝑜𝑗𝑡 𝜕𝑤𝑗,𝑖 10
𝜕𝑤𝑗,𝑖 10 𝜕𝑎𝑗𝑡 𝜕𝑤𝑗,𝑖 𝑡=1
𝑡=1
1 𝜕𝑜𝑗𝑡 𝑤𝑗,𝑖 = 𝑤𝑗,𝑖 + ∆𝑤𝑗,𝑖
𝑜𝑗𝑡 =𝜎 𝑎𝑗𝑡 = = 𝑜𝑗𝑡 (1 − 𝑜𝑗𝑡 )
−𝑎𝑗𝑡 𝜕𝑎𝑗𝑡
1+𝑒
10
2
∆𝑤𝑗,𝑖 =𝜂 𝑦𝑗𝑡 − 𝑜𝑗𝑡 𝑜𝑗𝑡 (1 − 𝑜𝑗𝑡 )𝑥𝑖𝑡
10
𝑡=1
PYTHON CODE
Translating mathematics into code
Translating mathematics into code
x1
MNIST Dataset W1,1
W1,2 a1
f
o1
x2
W1,784
W10,2
W10,1
a10 o10
W10,784 f
x784
Gradient descent
10
𝜕𝐿 2
=− 𝑦𝑗𝑡 − 𝑜𝑗𝑡 𝑜𝑗𝑡 (1 − 𝑜𝑗𝑡 )𝑥𝑖𝑡
𝜕𝑤𝑗,𝑖 10
𝑡=1
60,000 training samples
10,000 testing samples 𝑤𝑗,𝑖 = 𝑤𝑗,𝑖 + ∆𝑤𝑗,𝑖
10
2
∆𝑤𝑗,𝑖 =𝜂 𝑦𝑗𝑡 − 𝑜𝑗𝑡 𝑜𝑗𝑡 (1 − 𝑜𝑗𝑡 )𝑥𝑖𝑡
10
𝑡=1
Neuron’s output
n1 W1,1 W1,2 W1,3 W1,784
W2,1 W2,2 W2,3 W2,784
W
n10 W10,1 W10,2 W10,3 W10,784 WT
n1 n2 n10
x1
W1,1 W1,1 W2,1 W10,1
W1,2 a1 o1 W1,2
n1 f W2,2 W10,2
x2 o1 X x1 x2 x784
W1,3 W2,3 W10,3
W1,784
W10,2
W1,784 W2,784 W10,784
W10,1
a10 o10 a1 a2 a10
W10,784 n10 f o10
x784
𝑎 = 𝑋𝑊 𝑇
784
1
𝑎𝑗 = 𝑥𝑖 𝑤𝑗,𝑖 𝑜=𝜎 𝑎 =
1 + 𝑒 −𝑎
𝑖=1
1
𝑜𝑗 = 𝜎 𝑎𝑗 =
1 + 𝑒 −𝑎𝑗
Neuron’s output - batch of neurons
WT
n1 n2 n10
(1) (1) (1) W1,1 W2,1 W10,1
n1 W1,1 W1,2 W1,3 W1,784 X1 X2 x784
W1,2 W2,2 W10,2
W2,1 W2,2 W2,3 W2,784
W
(2) (2) (2)
W1,3 W2,3 W10,3
X x1 x2 x784
n10 W10,1 W10,2 W10,3 W10,784
W1,784 W2,784 W10,784
(10) (10) (10)
x1 x1 x2 x784
W1,1
W1,2 a1 o1 (1) (2) (10)
n1 f o1 o1 o1
x2
(1) (1) (1) (1) (1) (1)
W1,784 a1 a2 a10 o1 o2 o10
W10,2
W10,1 (2) (2) (2) (2) (2) (2)
a1 a2 a10 o1 o2 o10
a10 o10 (1) (2) (10)
W10,784 n10 f o10 o10 o10
(10) (10) (10) (10) (10) (10)
a1 a2 a10 o1 o2 o10
x784 784
𝑎𝑗 = 𝑥𝑖 𝑤𝑗,𝑖
𝑖=1
1
𝑜𝑗 = 𝜎 𝑎𝑗 = 𝑎 = 𝑋𝑊 𝑇
1 + 𝑒 −𝑎𝑗
1
𝑜=𝜎 𝑎 =
1 + 𝑒 −𝑎
Gradient calculating dT
X
(1) (1) (1)
X1 X2 x784
10 (1) (2) (10)
𝜕𝐿 2 d1 d1 d1
=− 𝑦𝑗𝑡 − 𝑜𝑗𝑡 𝑜𝑗𝑡 (1 − 𝑜𝑗𝑡 )𝑥𝑖𝑡
𝜕𝑤𝑗,𝑖 10 (1) (2) (10)
𝑡=1 d2 d2 d2 (2)
x1
(2)
x2
(2)
x784
𝑤𝑗,𝑖 = 𝑤𝑗,𝑖 + ∆𝑤𝑗,𝑖 (1)
d10
(2)
d10
(10)
d10 (10) (10) (10)
x1 x2 x784
10
2
∆𝑤𝑗,𝑖 =𝜂 𝑦𝑗𝑡 − 𝑜𝑗𝑡 𝑜𝑗𝑡 (1 − 𝑜𝑗𝑡 )𝑥𝑖𝑡
10
𝑡=1
ΔW1,1 ΔW1,2 ΔW1,3 ΔW1,784
(1) (1) (1)
d1 d2 d10 ΔW2,1 ΔW2,2 ΔW2,3 ΔW2,784
ΔW
(2) (2) (2)
d1 d2 d10 ΔW10,1ΔW10,2 ΔW10,3 ΔW10,784
d
(10) (10) (10)
d1 d2 d10 𝑑 = 𝑦 − 𝑜 𝑜(1 − 𝑜) (element-wise product)
2 𝑇
∆𝑤 = 𝜂 𝑑 𝑋
10
(element-wise product)
Load data set, pick out 10 images
𝑑 = 𝑦 − 𝑜 𝑜(1 − 𝑜)
2 𝑇
∆𝑤 = 𝜂 𝑑 𝑋
10
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
print('load data from MNIST')
mnist = tf.keras.datasets.mnist
(x_train,y_train), (x_test,y_test) = mnist.load_data()
dig = np.array([1,3,5,7,9,11,13,15,17,19]) # get the digit 0 - 9
x = x_train[dig,:,:]
y = np.eye(10,10)
plt.subplot(121)
plt.imshow(x[0])
plt.subplot(122)
plt.imshow(x[1])
x = np.reshape(x,(-1,784))/255
Define parameters and functions
def sigmoid(x):
return 1./(1.+np.exp(-x))
W = np.random.uniform(-0.1,0.1,(10,784))
o = sigmoid(np.matmul(x,W.transpose())) # matrix multiplication
print('output of first neuron with 10 digits ', o[:,0])
fig = plt.figure()
plt.bar([i for i, _ in enumerate(o)],o[:,0])
plt.show()
Training
#training process
n = 0.05
num_epoch = 10
for epoch in range(num_epoch):
This is just a simple
o = sigmoid(np.matmul(x,W.transpose())) example to intuitively
loss =np.power(o-y,2).mean() understand how to
#calculate update for all wegihts in matrix
dW =np.transpose((y-o)*o*(1-o))@x ∆𝑤 = 𝜂
2 𝑇
𝑑 𝑋
translate math into
10
#update python code
𝑑 = 𝑦 − 𝑜 𝑜(1 − 𝑜)
W=W+n*dW
print(loss)
o = sigmoid(np.matmul(x,W.transpose()))
print('output of the first neuron with 10 input digits ', o[:,0])
fig = plt.figure()
plt.bar([i for i, _ in enumerate(o)],o[:,0])
plt.show()