GeoStat DeepLearn NDesassis 15 06 22
GeoStat DeepLearn NDesassis 15 06 22
GeoStat DeepLearn NDesassis 15 06 22
nicolas.desassis@minesparis.psl.eu
January 2022
1/57
Realistic simulations of meandering system: Flumy
2/57
Realistic simulations of meandering system: Flumy
3/57
Realistic simulations of meandering system: Flumy
3/57
Introduction
Deep-learning is the part of machine-learning dealing with
neural-networks
4/57
Introduction
Deep-learning is the part of machine-learning dealing with
neural-networks
A neural-network (NN) is a functions fW from Rn to (a
subset of) Rp parameterized with a large number of
parameters (the weights, denoted W).
4/57
Introduction
Deep-learning is the part of machine-learning dealing with
neural-networks
A neural-network (NN) is a functions fW from Rn to (a
subset of) Rp parameterized with a large number of
parameters (the weights, denoted W).
In genereral, NN have a high-capacity or a great expressive
power given that the number of weights is important (see
later)
4/57
Introduction
Deep-learning is the part of machine-learning dealing with
neural-networks
A neural-network (NN) is a functions fW from Rn to (a
subset of) Rp parameterized with a large number of
parameters (the weights, denoted W).
In genereral, NN have a high-capacity or a great expressive
power given that the number of weights is important (see
later)
Thank to generic algorithms (gradient based), NN can help to
solve problem expressed as optimization of a functional:
4/57
Examples
Classification
5/57
Examples
Classification
Compression (or dimension reduction)
Find an encoder fe which reduces the dimension of the input
(e.g image) and the associated decoder fd to transform back.
The composition fd ◦ fe has to minimize the reconstruction
error.
5/57
Examples
Classification
Compression (or dimension reduction)
Find an encoder fe which reduces the dimension of the input
(e.g image) and the associated decoder fd to transform back.
The composition fd ◦ fe has to minimize the reconstruction
error.
Spatio-temporal prediction
Example: weather forecasting from satellite images
5/57
Examples
Classification
Compression (or dimension reduction)
Find an encoder fe which reduces the dimension of the input
(e.g image) and the associated decoder fd to transform back.
The composition fd ◦ fe has to minimize the reconstruction
error.
Spatio-temporal prediction
Example: weather forecasting from satellite images
Partial Differential Equation resolution
5/57
Examples
Classification
Compression (or dimension reduction)
Find an encoder fe which reduces the dimension of the input
(e.g image) and the associated decoder fd to transform back.
The composition fd ◦ fe has to minimize the reconstruction
error.
Spatio-temporal prediction
Example: weather forecasting from satellite images
Partial Differential Equation resolution
Find a random process to generate images with a distribution
"close" the one of the training images
5/57
Some applications
Data-sciences
Computer vision
Predictive maintenance
Medical diagnoses Click prediction
Facial recognition Default payement prediction
Handwritting recognition ...
...
Automatons
Natural langage processing (NLP)
Chess and Go
Chatbots Generation of art
Automatic translation (music, pictures, text,...)
Spam detection ...
...
...
6/57
Some reasons for such a success
Big Data
New algorithms
Python
Google & Facebook
− > TensorFlow, Pytorch
New computational facilities
− > cloud, GPU, TPU,...
Open source codes
Blogs
MOOCs
Impressive successes
7/57
Outline
8/57
Classification and regression
N individuals (for instance N images)
For each individual i, we know
a set of features
x1,i
xi = ...
xn,i
an interest variable yi ∈ Rm
If yi is a categorical variable, yi is named label
Aim
Predict the value of y for a new individual, knowing its features x
9/57
Classification and regression
N individuals (for instance N images)
For each individual i, we know
a set of features
x1,i
xi = ...
xn,i
an interest variable yi ∈ Rm
If yi is a categorical variable, yi is named label
Aim
Predict the value of y for a new individual, knowing its features x
9/57
Classification and regression
N individuals (for instance N images)
For each individual i, we know
a set of features
x1,i
xi = ...
xn,i
an interest variable yi ∈ Rm
If yi is a categorical variable, yi is named label
Aim
Predict the value of y for a new individual, knowing its features x
10/57
Matricial representation
Matrix of the features for the N individuals
x1,1 . . . x1,N
..
X = ... ..
. .
xn,1 . . . xn,N
Matrix of labels
y1,1 . . . y1,N
...
Y = ... ..
.
ym,1 . . . ym,N
10/57
Matricial representation
Matrix of the features for the N individuals
x1,1 . . . x1,N
..
X = ... ..
. .
xn,1 . . . xn,N
Matrix of labels
y1,1 . . . y1,N
...
Y = ... ..
.
ym,1 . . . ym,N
yk,i = 1yi =k
and m = K
10/57
Example: binary classification
Input Hidden Output
layer layer layer
Find W such as all the fW (xi )’s of a training set are "close" to
the associated labels yi ’s where
xi is the vector of features of the i th image (pixel values)
yi is the label (cat or not cat)
11/57
Artificial neural
A first model: Mc Culloch and Pitts (1943)
12/57
Artificial neural
A first model: Mc Culloch and Pitts (1943)
1
[ ω0 ω1 ... ωn ] 1
ω0
x1
x1
| {z }
ω̃ t
.. x̃
.
ω1
xn
n
x2
X 1 if s≥θ
ω2 s= ωi xi h(s) =
h 0 if s<θ
k=1
1
s h(s)
ωn−1
0
ωn
xn−1 Inputs xi ∈ {0, 1}
Weights ωi
Step function h
xn Threshold θ
Output h(s) ∈ {0, 1}
12/57
Implementing logical gates
NOT
1 if s≥θ−0
s = ω1 x1 h(s) =
h 0 if s < θ−0
ω1
1
x1 s h(s)
0
θ
x1 h(s)
0 1
1 0
13/57
Implementing logical gates
NOT
1 if s ≥ −0.5
s = −x1 h(s) =
h 0 if s < −0.5
−1
1
x1 s h(s)
0
θ
x1 h(s)
0 1
1 0
13/57
Implementing logical gates
AND
x1
1 if s ≥ 1.2−
1 s = x1 + x2 h(s) =
h 0 if s < 1.2−
1
s h(s)
1
0
θ
x2 x1 x2 h(s)
1
0 0 0 θ=1.2
0 1 0
1 0 0
1 1 1
0
0 1
13/57
Implementing logical gates
OR
x1
1 if s ≥ 0.8−
1 s = x1 + x2 h(s) =
h 0 if s < 0.8−
1
s h(s)
1
0
θ
x2 x1 x2 h(s)
1
0 0 0
0 1 1 θ=0.8
1 0 1
1 1 1
0
0 1
13/57
Implementing logical gates
XOR?
x1
1 if s ≥ θ−.8
ω1 s = ω1 x1 + ω2 x2 h(s) =
h 0 if s < θ−.8
1
s h(s)
ω2
0
θ
x2 x1 x2 h(s)
1
0 0 0
0 1 1
1 0 1
1 1 0
0
0 1
13/57
Implementing logical gates
XOR?
x1
1 if s ≥ θ−.8
ω1 s = ω1 x1 + ω2 x2 h(s) =
h 0 if s < θ−.8
1
s h(s)
ω2
0
θ
x2 x1 x2 h(s)
1
0 0 0
0 1 1
1 0 1
1 1 0
0
0 1
x1 ⊕ x2 = (x1 ∨ x2 ) ∧ ¬(x1 ∧ x2 )
14/57
Idea: combine neurons
x1 ⊕ x2 = (x1 ∨ x2 ) ∧ ¬(x1 ∧ x2 )
x1 ∧
x2 ∨
x1 x2 h(s)
1
0 0 0
0 1 1
1 0 1
1 1 0
0
0 1
14/57
Idea: combine neurons
x1 ⊕ x2 = (x1 ∨ x2 ) ∧ ¬(x1 ∧ x2 )
x1 ∧
x2 ∨
x1 x2 h(s)
1
0 0 0
0 1 1
1 0 1
1 1 0
0
0 1
14/57
A first theorem on the capacity
0 or 1
For every integer n, one can build a graph which contains all
the functions from {0, 1}n to {0, 1}.
15/57
A first theorem on the capacity
0 or 1
For every integer n, one can build a graph which contains all
the functions from {0, 1}n to {0, 1}.
Unfortunately, the number of necessary neurons in the
intermediate layer grows exponentially with n
15/57
First steps: the perceptron
16/57
The perceptron
1
[ ω0 ω1 ... ωn ] 1
ω0
x1
x1
| {z }
ω̃ t
.. x̃
.
ω1
xn
n
x2
X 1 if s≥θ
ω2 s= ωi xi h(s) =
h 0 if s<θ
k=1
1
s h(s)
ωn−1
0
θ
ωn
xn−1 Inputs xi ∈ R
Weights ωi
Step function h
xn Threshold θ
Output h(s) ∈ {0, 1}
17/57
The perceptron
1
[ ω0 ω1 ... ωn ] 1
ω0
x1
x1
| {z }
ω̃ t
.. x̃
.
ω1
xn
n
x2
X 1 if s≥0
ω2 s = ω0 .1 + ωi xi h(s) =
h 0 if s<0
k=1
1
s h(s)
ωn−1
0
0
ωn
xn−1 Inputs xi ∈ R
Bias ω0
Weights ωi
xn Heavyside h
Output h(s) ∈ {0, 1}
17/57
The perceptron
1
[ ω0 ω1 ... ωn ] 1
ω0
x1
x1
| {z }
ω̃ t
.. x̃
.
ω1
xn
x2
1 if s≥0
ω2 s = hω̃, x̃i h(s) =
h 0 if s<0
1
s h(s)
ωn−1
0
0
ωn
xn−1 Inputs xi ∈ R
Bias ω0
Weights ωi
xn Heavyside h
Output h(s) ∈ {0, 1}
17/57
Perceptron algorithm
Choose an initial weight vector ω̃
continue ← true
while continue do
continue ← false
for i from 1 to N do
if
ei ⟩ ≥ 0 and yi = 0
⟨ω̃, x
or
ei ⟩ < 0 and yi = 1
⟨ω̃, x
then
ω̃ ← ω̃ + (yi − 1⟨ω̃,exi ⟩≥0 )xi
continue ← true
end if
end for
end while
Return ω̃
18/57
Perceptron algorithm
Choose an initial weight vector ω̃
continue ← true
while continue do
continue ← false
for i from 1 to N do
if
ei ⟩ ≥ 0 and yi = 0
⟨ω̃, x
or
ei ⟩ < 0 and yi = 1
⟨ω̃, x
then
ω̃ ← ω̃ + (yi − 1⟨ω̃,exi ⟩≥0 )xi
continue ← true
end if
end for
end while
Return ω̃
This algorithm converges in a finite number of iterations
18/57
Perceptron algorithm
Choose an initial weight vector ω̃
continue ← true
while continue do
continue ← false
for i from 1 to N do
if
ei ⟩ ≥ 0 and yi = 0
⟨ω̃, x
or
ei ⟩ < 0 and yi = 1
⟨ω̃, x
then
ω̃ ← ω̃ + (yi − 1⟨ω̃,exi ⟩≥0 )xi
continue ← true
end if
end for
end while
Return ω̃
This algorithm converges in a finite number of iterations
if a solution exists!
18/57
Implementation
19/57
Implementation
20/57
Regression for binary outcomes
Probit model (Bliss and Fisher, 1935) and Logistic Regression (Berkson, 1944)
20/57
Regression for binary outcomes
Probit model (Bliss and Fisher, 1935) and Logistic Regression (Berkson, 1944)
1
Probit model
h=G
20/57
Regression for binary outcomes
Probit model (Bliss and Fisher, 1935) and Logistic Regression (Berkson, 1944)
1
Probit model
h=G
1
Logistic model
1
h(x) =
1 + e −x
−5 0 5
20/57
1
[ ω0 ω1 ... ωn ] 1
ω0
x1 x1
| {z }
ω̃ t
x̃
..
ω1
.
xn
x2
1 if s≥0
ω2 s = hω̃, x̃i h(s) =
h 0 if s<0
1
s h(s)
ωn−1
0
0
ωn
xn−1 Inputs xi ∈ R
Bias ω0
Weights ωi
xn Heavyside h
Output h(s) ∈ {0, 1}
21/57
1
[ ω0 ω1 ... ωn ] 1
ω0
x1 x1
| {z }
ω̃ t
x̃
..
ω1
.
xn
x2
1 if s≥0
ω2 s = hω̃, x̃i h(s) =
h 0 if s<0
1
s h(s)
ωn−1
0
0
ωn
xn−1 Inputs xi ∈ R
Bias ω0
Weights ωi
xn Activation function h
Output h(s) ∈ (0, 1)
22/57
Inference by maximum-likelihood
The likelihood of observation i is
23/57
Inference by maximum-likelihood
The likelihood of observation i is
Y
N
Likelihood(ω̃; y1 , . . . , yN ) = pi (ω̃)yi (1 − pi (ω̃))1−yi
i=1
23/57
Inference by maximum-likelihood
The likelihood of observation i is
Y
N
Likelihood(ω̃; y1 , . . . , yN ) = pi (ω̃)yi (1 − pi (ω̃))1−yi
i=1
It is equivalent to maximize the log-likelihood
X
N
[yi log pi (ω̃) + (1 − yi ) log(1 − pi (ω̃))]
i=1
23/57
Inference by maximum-likelihood
The likelihood of observation i is
Y
N
Likelihood(ω̃; y1 , . . . , yN ) = pi (ω̃)yi (1 − pi (ω̃))1−yi
i=1
It is equivalent to maximize the log-likelihood
X
N
[yi log pi (ω̃) + (1 − yi ) log(1 − pi (ω̃))]
i=1
or to minimize the binary cross-entropy
X
N
L(ω̃) = − [yi log pi (ω̃) + (1 − yi ) log(1 − pi (ω̃))]
i=1
23/57
Neural network
Multi-Layer perceptron
A neural network is a graph in which outputs of neurons (nodes)
are become the inputs of some other neurons (oriented graph)
24/57
One layer
1
(l−1)
v1
(l)
s1
(l)
v1
(l)
h(l) s1
(l−1) (l)
v2 s2 (l)
(l)
v2 h(l) s2
. =
. ..
.
.
(l)
(l)
vnl h(l) snl
| {z }
v(l)
| {z }
(l)
snl h(l) (s(l) )
(l−1)
vnl−1
(l)
(l) (l) (l) (l) 1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1 (l−1)
v1
(l) (l) (l) (l) (l)
s2 ω2,0 ω2,1 ω2,2 . . . ω2,nl−1
(l−1)
v2
. =
.. .. ..
. ..
. . . .
(l) (l) (l) (l) (l)
.
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l−1)
v1
(l)
s1
INPUT OUPUT
(l)
(l−1) (l)
v1 h(l) s1
v1 (l−1) (l)
(l−1) v2 s2
(l)
(l)
v2 h(l) s2
v2 . =
..
. ..
.
.
.
(l−1) (l)
(l)
vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
(l)
(l) (l) (l) (l) 1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1 (l−1)
v1
(l) (l) (l) (l) (l)
s2 ω2,0 ω2,1 ω2,2 . . . ω2,nl−1
(l−1)
v2
. =
.. .. ..
. ..
. . . .
(l) (l) (l) (l) (l)
.
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l−1)
v1
(l)
s1
INPUT OUPUT
(l)
(l−1) (l)
v1 h(l) s1
v1 (l−1) (l)
(l−1) v2 s2
(l)
(l)
v2 h(l) s2
v2 . =
..
. ..
.
.
.
(l−1) (l)
(l)
vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
(l)
(l) (l) (l) (l) 1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1 (l−1)
v1
(l) (l) (l) (l) (l)
s2 ω2,0 ω2,1 ω2,2 . . . ω2,nl−1
(l−1)
v2
. =
.. .. ..
. ..
. . . .
(l) (l) (l) (l) (l)
.
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l−1)
v1
(l)
s1
INPUT OUPUT
(l)
(l−1) (l)
v1 h(l) s1
v1 (l−1) (l)
(l−1) v2 s2
(l)
(l)
v2 h(l) s2
v2 . =
..
. ..
.
.
.
(l−1) (l)
(l)
vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
2) Compute s(l) = W̃ (l) ṽ(l−1)
3) Return v(l) = h(l) (s(l) )
(l)
(l) (l) (l) (l) 1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1 (l−1)
v1
(l) (l) (l) (l) (l)
s2 ω2,0 ω2,1 ω2,2 . . . ω2,nl−1
(l−1)
v2
. =
.. .. ..
. ..
. . . .
(l) (l) (l) (l) (l)
.
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l−1)
v1
(l)
s1
INPUT OUPUT
(l)
(l−1) (l)
v1 h(l) s1
v1 (l−1) (l)
(l−1) v2 s2
(l)
(l)
v2 h(l) s2
v2 . =
..
. ..
.
.
.
(l−1) (l)
(l)
vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
2) Compute s(l) = W̃ (l) ṽ(l−1)
3) Return v(l) = h(l) (s(l) )
(l)
(l) (l) (l) (l) 1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1 (l−1)
v1
(l) (l) (l) (l) (l)
s2 ω2,0 ω2,1 ω2,2 . . . ω2,nl−1
(l−1)
v2
. =
.. .. ..
. ..
. . . .
(l) (l) (l) (l) (l)
.
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l−1)
v1
(l)
s1
INPUT OUPUT
(l)
(l−1) (l)
v1 h(l) s1
v1 (l−1) (l)
(l−1) v2 s2
(l)
(l)
v2 h(l) s2
v2 . =
..
. ..
.
.
.
(l−1) (l)
(l)
vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
2) Compute s(l) = W̃ (l) ṽ(l−1)
3) Return v(l) = h(l) (s(l) )
(l)
(l) (l) (l) (l) 1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1 (l−1)
v1
(l) (l) (l) (l) (l)
s2 ω2,0 ω2,1 ω2,2 . . . ω2,nl−1
(l−1)
v2
. =
.. .. ..
. ..
. . . .
(l) (l) (l) (l) (l)
.
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l−1)
v1
(l)
s1
INPUT OUPUT
(l)
(l−1) (l)
v1 h(l) s1
v1 (l−1) (l)
(l−1) v2 s2
(l)
(l)
v2 h(l) s2
v2 . =
..
. ..
.
.
.
(l−1) (l)
(l)
vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
2) Compute s(l) = W̃ (l) ṽ(l−1)
3) Return v(l) = h(l) (s(l) )
(l)
(l) (l) (l) (l) 1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1 (l−1)
v1
(l) (l) (l) (l) (l)
s2 ω2,0 ω2,1 ω2,2 . . . ω2,nl−1
(l−1)
v2
. =
.. .. ..
. ..
. . . .
(l) (l) (l) (l) (l)
.
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l−1)
v1
(l)
s1
INPUT OUPUT
(l)
(l−1) (l)
v1 h(l) s1
v1 (l−1) (l)
(l−1) v2 s2
(l)
(l)
v2 h(l) s2
v2 . =
..
. ..
.
.
.
(l−1) (l)
(l)
vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
2) Compute s(l) = W̃ (l) ṽ(l−1)
3) Return v(l) = h(l) (s(l) )
(l)
(l) (l) (l) (l) 1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1 (l−1)
v1
(l) (l) (l) (l) (l)
s2 ω2,0 ω2,1 ω2,2 . . . ω2,nl−1
(l−1)
v2
. =
.. .. ..
. ..
. . . .
(l) (l) (l) (l) (l)
.
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1 (l)
ω1,0
(l)
(l−1) ω1,1
v1
(l)
s1
(l)
INPUT ω1,2 OUPUT
(l)
(l−1) (l)
v1 h(l) s1
v1 (l−1) (l)
(l−1) v2 s2
(l)
(l)
v2 h(l) s2
v2 (l) . =
.. ω1,nl−1 . ..
.
.
.
(l−1) (l)
(l)
vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
nl−1
(l) (l)
X (l) (l−1) 2) Compute s(l) = W̃ (l) ṽ(l−1)
s1 =ω1,0 .1 + ω1,j vj 3) Return v(l) = h(l) (s(l) )
j=1
(l)
(l) (l) (l) (l) 1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1 (l−1)
v1
(l) (l) (l) (l) (l)
s2 ω2,0 ω2,1 ω2,2 . . . ω2,nl−1
(l−1)
v2
. =
.. .. ..
. ..
. . . .
(l) (l) (l) (l) (l)
.
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l)
ω2,0
(l−1)
v1 (l)
ω2,1 (l)
s1
INPUT OUPUT
(l) (l)
(l−1) ω2,2 (l) h(l) s1
v1 (l−1) (l) v1
(l−1) v2 s2
(l)
(l)
v2 h(l) s2
v2 . =
..
. ..
.
.
.
(l)
(l−1) ω2,nl−1 (l)
vnl
(l)
vnl−1 h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
nl−1
(l) (l)
X (l) (l−1) 2) Compute s(l) = W̃ (l) ṽ(l−1)
s2 =ω2,0 .1 + ω2,j vj 3) Return v(l) = h(l) (s(l) )
j=1
(l)
(l) (l) (l) (l) 1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1 (l−1)
v1
(l) (l) (l) (l) (l)
s2 ω2,0 ω2,1 ω2,2 . . . ω2,nl−1
(l−1)
v2
. =
.. .. ..
. ..
. . . .
(l) (l) (l) (l) (l)
.
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l)
ωnl ,0
(l−1)
v1
(l)
s1
(l)
INPUT ωnl ,1 OUPUT
(l)
(l−1) (l)
v1 h(l) s1
v1 (l−1) (l)
(l−1) v2 (l)
s2
(l)
(l)
v2 h(l) s2
v2 ωnl ,2 . =
..
. ..
.
.
.
(l−1) (l)
(l)
vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
ωnl ,nl−1 (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
nl−1
(l)
X (l) (l−1) 2) Compute s(l) = W̃ (l) ṽ(l−1)
s(l)
nl =ωnl ,0 .1 + ωnl ,j vj 3) Return v(l) = h(l) (s(l) )
j=1
(l)
(l) (l) (l) (l) 1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1 (l−1)
v1
(l) (l) (l) (l) (l)
s2 ω2,0 ω2,1 ω2,2 . . . ω2,nl−1
(l−1)
v2
. =
.. .. ..
. ..
. . . .
(l) (l) (l) (l) (l)
.
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l−1)
v1
(l)
v1
INPUT OUPUT
(l)
(l−1) (l)
v1 h(l) s1
v1 (l−1) (l)
(l−1) v2 v2
(l)
(l)
v2 h(l) s2
v2 . =
..
. ..
.
.
.
(l−1) (l)
(l)
vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
vnl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
nl−1
(l) (l)
X (l) (l−1) 2) Compute s(l) = W̃ (l) ṽ(l−1)
si =ωi,0 .1 + ωi,j vj 3) Return v(l) = h(l) (s(l) )
j=1
(l)
(l) (l) (l) (l) 1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1 (l−1)
v1
(l) (l) (l) (l) (l)
s2 ω2,0 ω2,1 ω2,2 . . . ω2,nl−1
(l−1)
v2
. =
.. .. ..
. ..
. . . .
(l) (l) (l) (l) (l)
.
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
2 attributes for layer l:
1) A weight matrix W̃ (l)
2) An activation function h(l)
(l−1)
v1
(l)
v1
INPUT OUPUT
(l)
(l−1) (l)
v1 h(l) s1
v1 (l−1) (l)
(l−1) v2 v2
(l)
(l)
v2 h(l) s2
v2 . =
..
. ..
.
.
.
(l−1) (l)
(l)
vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
vnl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
nl−1
(l) (l)
X (l) (l−1) 2) Compute s(l) = W̃ (l) ṽ(l−1)
si =ωi,0 .1 + ωi,j vj 3) Return v(l) = h(l) (s(l) )
j=1
(l)
(l) (l) (l) (l) 1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1 (l−1)
v1
(l) (l) (l) (l) (l)
s2 ω2,0 ω2,1 ω2,2 . . . ω2,nl−1
(l−1)
v2
. =
.. .. ..
. ..
. . . .
(l) (l) (l) (l) (l)
.
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
Summary
(l)
The l th layer of perceptrons is a function fW̃ (l) from Rnl−1 to Rnl
defined by:
(l)
v(l) = fW̃ (l) (v(l−1) )
26/57
Summary
(l)
The l th layer of perceptrons is a function fW̃ (l) from Rnl−1 to Rnl
defined by:
(l)
v(l) = fW̃ (l) (v(l−1) ) = h(l) (W̃ (l) ṽ(l−1) )
26/57
Summary
(l)
The l th layer of perceptrons is a function fW̃ (l) from Rnl−1 to Rnl
defined by:
(l)
v(l) = fW̃ (l) (v(l−1) ) = h(l) (W̃ (l) ṽ(l−1) ) = h(l) (b(l) + W (l) v(l−1) )
26/57
Multi-Layer Perceptron
v(0) = x
For l = 1, . . . , L compute
(l)
v(l) = fW̃ (l) (v(l−1) )
v(L) = fW (x)
27/57
Capacity of a neural network
Universal approximation theorem (Hornik, 1991)
Let f be a continuous function f : [0, 1]n → R and ε > 0.
Let h (1) be a non-constant, increasing, bounded real function.
Then, there exists
an integer q,
a matrix of weights W̃ (1) ∈ Rq×n
a vector of weights ω (2) ∈ Rq
a real b(2)
such as for all x ∈ [0, 1]n ,
D E
e⟩
f (x) − b(2) − ω (2) , h (1) ⟨W̃ (1) , x <ε
28/57
Capacity of a neural network
Universal approximation theorem (Hornik, 1991)
Let f be a continuous function f : [0, 1]n → R and ε > 0.
Let h (1) be a non-constant, increasing, bounded real function.
Then, there exists
an integer q,
a matrix of weights W̃ (1) ∈ Rq×n
a vector of weights ω (2) ∈ Rq
a real b(2)
such as for all x ∈ [0, 1]n ,
D E
e⟩
f (x) − b(2) − ω (2) , h (1) ⟨W̃ (1) , x <ε
X
N
L(W) = δ(yi , fW (xi ))
i=1
29/57
Loss functions
X
N
L(W) = δ(yi , fW (xi ))
i=1
Regression
X
N
δ(y, fW (x)) = (yk − fW (x))2
k=1
Sum of squares
Likelihood for Gaussian noise
Simplifications occur when the activation function of the
output layer is the identity
29/57
Loss functions
X
N
L(W) = δ(yi , fW (xi ))
i=1
Binary classification
δ(y, fW (x)) = −y log(fW (x)) − (1 − y) log(1 − fW (x))
Binary cross-entropy
Likelihood for Bernouilli
Simplifications occur when the activation function of the
output layer is the sigmoid (a.k.a logistic)
29/57
Loss functions
X
N
L(W) = δ(yi , fW (xi ))
i=1
Multiclass classification
X
K
δ(y, fW (x)) = − yk log(fW (x)k )
k=1
Cross-entropy
Likelihood for Multinomial distribution
Simplifications occur when the activation function of the
output layer is the softmax
29/57
Softmax activation function
30/57
Softmax activation function
X
nL
vi = 1 and vi > 0
i=1
30/57
Training neural network: Gradient descent
The gradient
dL
dω0 (ω̃)
..
∇L(ω̃) = .
dL
dωn (ω̃)
31/57
Training neural network: Gradient descent
The gradient
dL
dω0 (ω̃)
..
Initialisation: choose ∇L(ω̃) = .
dL
ω̃(0) ∈ Rn+1 dωn (ω̃)
η
ω̃(t+1) = ω̃(t) − ∇L ω̃(t)
N
η > 0 is the learning rate
chosen by the user
31/57
Train a neural network: Back-propagation
Chain rule:
32/57
Train a neural network: Back-propagation
Chain rule:
(f ◦ g ◦ h)′ (x)
32/57
Train a neural network: Back-propagation
Chain rule:
32/57
Train a neural network: Back-propagation
Chain rule:
32/57
Train a neural network: Back-propagation
Chain rule:
Leibniz notation
h g f
x −−→ y −−→ u −−→ z
dz dz du dy
=
dx du dy dx
32/57
Back-propagation
dL
(L)
=
dωi,j
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL
(L−1)
=
dωi,j
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL
(L−1)
=
dωi,j
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1)
(L−1)
= (L−1)
dωi,j dv(L) ds(L) dv(L−1) ds(L−1) dωi,j
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1)
(L−1)
= (L−1)
dωi,j dv(L) ds(L) dv(L−1) ds(L−1) dωi,j
dL
(L−2)
=
dωi,j
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1)
(L−1)
= (L−1)
dωi,j dv(L) ds(L) dv(L−1) ds(L−1) dωi,j
dL
(L−2)
=
dωi,j
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1)
(L−1)
= (L−1)
dωi,j dv(L) ds(L) dv(L−1) ds(L−1) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1) dv(L−2) ds(L−2)
(L−2)
= (L−2)
|dv {zds } dv ds(L−1) dv(L−2) ds(L−2) dωi,j
(L) (L) (L−1)
dωi,j
E (L)
| {z }
E (L−1)
| {z }
E (L−2)
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1)
(L−1)
= (L−1)
dωi,j dv(L) ds(L) dv(L−1) ds(L−1) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1) dv(L−2) ds(L−2)
(L−2)
= (L−2)
|dv {zds } dv ds(L−1) dv(L−2) ds(L−2) dωi,j
(L) (L) (L−1)
dωi,j
E (L)
| {z }
E (L−1)
| {z }
E (L−2)
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1)
(L−1)
= (L−1)
dωi,j dv(L) ds(L) dv(L−1) ds(L−1) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1) dv(L−2) ds(L−2)
(L−2)
= (L−2)
|dv {zds } dv ds(L−1) dv(L−2) ds(L−2) dωi,j
(L) (L) (L−1)
dωi,j
E (L)
| {z }
E (L−1)
| {z }
E (L−2)
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1)
(L−1)
= (L−1)
dωi,j dv(L) ds(L) dv(L−1) ds(L−1) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1) dv(L−2) ds(L−2)
(L−2)
= (L−2)
|dv {zds } dv ds(L−1) dv(L−2) ds(L−2) dωi,j
(L) (L) (L−1)
dωi,j
E (L)
| {z }
E (L−1)
| {z }
E (L−2)
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1)
(L−1)
= (L−1)
dωi,j dv(L) ds(L) dv(L−1) ds(L−1) dωi,j
dL dL dv(L) ds(L) dv(L−1) ds(L−1) dv(L−2) ds(L−2)
(L−2)
= (L−2)
|dv {zds } dv ds(L−1) dv(L−2) ds(L−2) dωi,j
(L) (L) (L−1)
dωi,j
E (L)
| {z }
E (L−1)
| {z }
E (L−2)
33/57
Stochastic gradient and variants
Groupe de lecture - Mike Pereira
The evaluation of the loss function gradient can be costly
when the training set is large
Instead of computing the gradient on the entire training set, it
will be computed on mini-batches of moderate size nb (e.g
nb = 20 but can be 1).
W(t+1) = W(t) − η∇Lnb W(t)
The number of epochs is the number of times the entire data
set is visited
At the beginning of each epoch, the data-set is randomly
shuffled
The cost-function and its gradient are divided by nb (to make
the choice of η independent of nb )
34/57
Adaptive Moment Estimation (ADAM)
Kingma and Ba (2014)
Initialisation
For t ≥ 1
35/57
Generative model: the problem
Aim
Given a (large?) set of training examples, generate new examples
which "look like" the training examples in terms of
fidelity or realism
diversity or variability
36/57
Generative model: the problem
Aim
Given a (large?) set of training examples, generate new examples
which "look like" the training examples in terms of
fidelity or realism
diversity or variability
In other words, given an empirical distribution PD over Rn , find a
probabilistic distribution (the model) PG with its simulation
algorithm (the generator) which is close to PD (or its theoretical
version P)
36/57
First idea: use a Gaussian transform approach
Generate a latent vector
z ∼ Nn (0, In )
Compute
x = G(z)
where G = fW is a neural network with weights W to
determine
37/57
First idea: use a Gaussian transform approach
Generate a latent vector
z ∼ Nn (0, In )
Compute
x = G(z)
where G = fW is a neural network with weights W to
determine
Mathematical justifications:
Most of the distributions of Rn can be written as a function
of a Gaussian vector of Rn with independant components
(inversion method + decomposition of multivariate densities
as a product of conditional distributions)
By using a high-capacity neural network G, we can
approximate the target function (universal approximation
theorem)
37/57
Generator
NOISE FAKE
38/57
Example of geostatistical deep generative model
Cholesky algorithm
Σ = LLT
Sample
z ∼ Nn (0, In )
Return
x = µ + Lz
39/57
Example of geostatistical deep generative model
Cholesky algorithm
Σ = LLT
Sample
z ∼ Nn (0, In )
Return
x = µ + Lz
39/57
Problem: How to handle the large number of parameters?
[https://line.17qq.com/articles/cmmcohgcpv.html]
40/57
Example - Dimension reduction (or compression)
41/57
Second idea: Dimension reduction
[https://www.compthree.com/blog/autoencoder/]
42/57
Example: compression of Flumy images
43/57
Simulate in the latent space
44/57
Simulate in the latent space
44/57
Generator
NOISE FAKE
45/57
Generator
NOISE FAKE
45/57
Generative Adversarial Network (GAN)
A two players game
REAL
REAL or FAKE?
NOISE FAKE
DISCRIMINATOR
GENERATOR
46/57
Generative Adversarial Networks (GAN)
Two adversarial neural networks
A generator X = Gθ (z) where z is a Gaussian noise N(0, Id )
and θ are the weights
A discriminator Dϕ (X ) which classifies images "true" or
"fake" (ϕ are the weights)
47/57
Generative Adversarial Networks (GAN)
Two adversarial neural networks
A generator X = Gθ (z) where z is a Gaussian noise N(0, Id )
and θ are the weights
A discriminator Dϕ (X ) which classifies images "true" or
"fake" (ϕ are the weights)
47/57
Generative Adversarial Networks (GAN)
Two adversarial neural networks
A generator X = Gθ (z) where z is a Gaussian noise N(0, Id )
and θ are the weights
A discriminator Dϕ (X ) which classifies images "true" or
"fake" (ϕ are the weights)
47/57
Generative Adversarial Networks (GAN)
Two adversarial neural networks
A generator X = Gθ (z) where z is a Gaussian noise N(0, Id )
and θ are the weights
A discriminator Dϕ (X ) which classifies images "true" or
"fake" (ϕ are the weights)
with
47/57
48/57
Training
Goodfellow et al. (2014)
1 X
mb
∇ϕ [log Dϕ (xi ) + log(1 − Dϕ (Gθ (zi )))]
mb i=1
end for
Sample minibatch of mb noise samples {z1 , . . . , zmb } ∼ N(0, Id )
Update the generator by descending its stochastic gradient
1 X
mb
∇θ log(1 − Dϕ (Gθ (zi )))
mb i=1
end for
49/57
Flumy generator
50/57
Results
Horizontal section
51/57
Example on Plurigaussian
52/57
Local adaptation
53/57
Local adaptation
53/57
Local adaptation
53/57
Conditioning
Zc → Gc (Zc ) → X
54/57
Results
Reference image
9 conditional simulations
55/57
Results
Reference image
Conditional probabilities
56/57
References
Chapelle, O., J. Weston, L. Bottou, and V. Vapnik (2001). Vicinal risk minimization. Advances in neural
information processing systems, 416–422.
Glorot, X. and Y. Bengio (2010). Understanding the difficulty of training deep feedforward neural networks. In
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256.
JMLR Workshop and Conference Proceedings.
Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014).
Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680.
He, K., X. Zhang, S. Ren, and J. Sun (2016). Deep residual learning for image recognition. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 770–778.
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural networks 4 (2), 251–257.
Ioffe, S. and C. Szegedy (2015). Batch normalization: Accelerating deep network training by reducing internal
covariate shift. In International conference on machine learning, pp. 448–456. PMLR.
Kingma, D. P. and J. Ba (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
Krizhevsky, A., I. Sutskever, and G. E. Hinton (2012). Imagenet classification with deep convolutional neural
networks. Advances in neural information processing systems 25, 1097–1105.
LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner (1998). Gradient-based learning applied to document recognition.
Proceedings of the IEEE 86 (11), 2278–2324.
Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014). Dropout: A simple way to
prevent neural networks from overfitting. Journal of Machine Learning Research 15 (56), 1929–1958.
Szegedy, C., W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich
(2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pp. 1–9.
57/57