GeoStat DeepLearn NDesassis 15 06 22

First steps to deep-learning
From the perceptron to the deep generative models
Ferdinand Bhavsar, Nicolas Desassis, Fabien Ors

Thomas Romary
nicolas.desassis@minesparis.psl.eu
January 2022
1/57
Realistic simulations of meandering system: Flumy
2/57
Flumy is a process-based model for meandering channels in

fluvial environment
It provides very realistic simulations
High computational cost
Parameterization is not obvious (black-box)
How to perform conditional simulations?
Can we approximate Flumy with a geostatistical model?
3/57
Flumy is a process-based model for meandering channels in

fluvial environment
It provides very realistic simulations
High computational cost
Parameterization is not obvious (black-box)
How to perform conditional simulations?
Can we approximate Flumy with a geostatistical model?
Generative models of deep-learning?
3/57
Introduction
Deep-learning is the part of machine-learning dealing with
neural-networks
4/57
Introduction
neural-networks
A neural-network (NN) is a functions fW from Rn to (a
subset of) Rp parameterized with a large number of
parameters (the weights, denoted W).
4/57
Introduction
neural-networks
In genereral, NN have a high-capacity or a great expressive
power given that the number of weights is important (see
later)
4/57
Introduction
neural-networks
In genereral, NN have a high-capacity or a great expressive
power given that the number of weights is important (see
later)
Thank to generic algorithms (gradient based), NN can help to
solve problem expressed as optimization of a functional:
argmin T (f ) ≃ arg minm T (fW )

f ∈E W∈R
where T is a functional defined over a space E and fW is a

neural-network parameterized with m weights stored in W
4/57
Examples
Classification
5/57
Examples
Classification
Compression (or dimension reduction)
Find an encoder fe which reduces the dimension of the input
(e.g image) and the associated decoder fd to transform back.
The composition fd ◦ fe has to minimize the reconstruction
error.
5/57
Examples
Classification
error.
Spatio-temporal prediction
Example: weather forecasting from satellite images
5/57
Examples
Classification
error.
Partial Differential Equation resolution
5/57
Examples
Classification
error.
Partial Differential Equation resolution
Find a random process to generate images with a distribution
"close" the one of the training images
5/57
Some applications
Data-sciences
Computer vision
Predictive maintenance
Medical diagnoses Click prediction
Facial recognition Default payement prediction
Handwritting recognition ...
...
Automatons
Natural langage processing (NLP)
Chess and Go
Chatbots Generation of art
Automatic translation (music, pictures, text,...)
Spam detection ...
...
...
6/57
Some reasons for such a success
Big Data
New algorithms
Python
Google & Facebook
− > TensorFlow, Pytorch
New computational facilities
− > cloud, GPU, TPU,...
Open source codes
Blogs
MOOCs
Impressive successes
7/57
Outline
What is a neural network?

How to train a neural network?
Generative models
8/57
Classification and regression
N individuals (for instance N images)
For each individual i, we know
a set of features  
x1,i
 
xi =  ... 
xn,i
an interest variable yi ∈ Rm
If yi is a categorical variable, yi is named label
Aim
Predict the value of y for a new individual, knowing its features x
9/57
x1,i
 
xi =  ... 
xn,i
Aim
Regression: the components of y are continuous

Classification: y is categorical
9/57
x1,i
 
xi =  ... 
xn,i
Aim
Regression: the components of y are continuous

Classification: y is categorical
Binary classification: m = 1 and y ∈ {0, 1}
Multiclass classification: y ∈ {1, . . . , K }
9/57
Matricial representation
Matrix of the features for the N individuals
 
x1,1 . . . x1,N
 .. 
X =  ... ..
. . 
xn,1 . . . xn,N
10/57
 
x1,1 . . . x1,N
 .. 
X =  ... ..
. . 
xn,1 . . . xn,N
Matrix of labels
 
y1,1 . . . y1,N
 ... 
Y =  ... ..
. 
ym,1 . . . ym,N
10/57
 
x1,1 . . . x1,N
 .. 
X =  ... ..
. . 
xn,1 . . . xn,N
Matrix of labels
 
y1,1 . . . y1,N
 ... 
Y =  ... ..
. 
ym,1 . . . ym,N
When y is categorical with K categories (with K > 2), we

work with the indicators of each category (one-hot encoding)
yk,i = 1yi =k
and m = K
10/57
Example: binary classification
Input Hidden Output
layer layer layer
0.98 This is a cat
Find W such as all the fW (xi )’s of a training set are "close" to
the associated labels yi ’s where
xi is the vector of features of the i th image (pixel values)
yi is the label (cat or not cat)
11/57
Artificial neural
A first model: Mc Culloch and Pitts (1943)
12/57
Artificial neural
A first model: Mc Culloch and Pitts (1943)
1
 
[ ω0 ω1 ... ωn ] 1 
ω0 
 x1 
x1
| {z } 
ω̃ t
 ..  x̃
 
 . 
ω1 

xn

n
x2

X 1 if s≥θ
ω2 s= ωi xi h(s) =
h 0 if s<θ
k=1
1
s h(s)
ωn−1
0
ωn
xn−1 Inputs xi ∈ {0, 1}
Weights ωi
Step function h
xn Threshold θ
Output h(s) ∈ {0, 1}
12/57
Implementing logical gates
NOT

1 if s≥θ−0
s = ω1 x1 h(s) =
h 0 if s < θ−0
ω1
1
x1 s h(s)
0
θ
x1 h(s)
0 1
1 0
13/57
NOT

1 if s ≥ −0.5
s = −x1 h(s) =
h 0 if s < −0.5
−1
1
x1 s h(s)
0
θ
x1 h(s)
0 1
1 0
13/57
AND
x1

1 if s ≥ 1.2−
1 s = x1 + x2 h(s) =
h 0 if s < 1.2−
1
s h(s)
1
0
θ
x2 x1 x2 h(s)
1
0 0 0 θ=1.2
0 1 0
1 0 0
1 1 1
0
0 1
13/57
OR
x1

1 if s ≥ 0.8−
1 s = x1 + x2 h(s) =
h 0 if s < 0.8−
1
s h(s)
1
0
θ
x2 x1 x2 h(s)
1
0 0 0
0 1 1 θ=0.8
1 0 1
1 1 1
0
0 1
13/57
XOR?
x1

1 if s ≥ θ−.8
ω1 s = ω1 x1 + ω2 x2 h(s) =
h 0 if s < θ−.8
1
s h(s)
ω2
0
θ
x2 x1 x2 h(s)
1
0 0 0
0 1 1
1 0 1
1 1 0
0
0 1
13/57
XOR?
x1

1 if s ≥ θ−.8
ω1 s = ω1 x1 + ω2 x2 h(s) =
h 0 if s < θ−.8
1
s h(s)
ω2
0
θ
x2 x1 x2 h(s)
1
0 0 0
0 1 1
1 0 1
1 1 0
0
0 1
The problem is not linearly separable

13/57
Idea: combine neurons
x1 ⊕ x2 = (x1 ∨ x2 ) ∧ ¬(x1 ∧ x2 )
14/57
x1 ⊕ x2 = (x1 ∨ x2 ) ∧ ¬(x1 ∧ x2 )
x1 ∧
x2 ∨
x1 x2 h(s)
1
0 0 0
0 1 1
1 0 1
1 1 0
0
0 1
14/57
x1 ⊕ x2 = (x1 ∨ x2 ) ∧ ¬(x1 ∧ x2 )
x1 ∧
x2 ∨
x1 x2 h(s)
1
0 0 0
0 1 1
1 0 1
1 1 0
0
0 1
14/57
A first theorem on the capacity
0 or 1
For every integer n, one can build a graph which contains all
the functions from {0, 1}n to {0, 1}.
15/57
A first theorem on the capacity
0 or 1
For every integer n, one can build a graph which contains all
the functions from {0, 1}n to {0, 1}.
Unfortunately, the number of necessary neurons in the
intermediate layer grows exponentially with n
15/57
First steps: the perceptron
Rosenblatt, F. (1957) The perceptron. A perceiving and recognizing automaton.

Cornell aeronautical laboratory, Inc.
16/57
The perceptron
1
 
[ ω0 ω1 ... ωn ] 1 
ω0 
 x1 
x1
| {z } 
ω̃ t
 ..  x̃
 
 . 
ω1 

xn

n
x2

X 1 if s≥θ
ω2 s= ωi xi h(s) =
h 0 if s<θ
k=1
1
s h(s)
ωn−1
0
θ
ωn
xn−1 Inputs xi ∈ R
Weights ωi
Step function h
xn Threshold θ
17/57
The perceptron
1
 
[ ω0 ω1 ... ωn ] 1 
ω0 
 x1 
x1
| {z } 
ω̃ t
 ..  x̃
 
 . 
ω1 

xn

n
x2

X 1 if s≥0
ω2 s = ω0 .1 + ωi xi h(s) =
h 0 if s<0
k=1
1
s h(s)
ωn−1
0
0
ωn
Bias ω0
Weights ωi
xn Heavyside h
17/57
The perceptron
1
 
[ ω0 ω1 ... ωn ] 1 
ω0 
 x1 
x1
| {z } 
ω̃ t
 ..  x̃
 
 . 
ω1 

xn

x2

1 if s≥0
ω2 s = hω̃, x̃i h(s) =
h 0 if s<0
1
s h(s)
ωn−1
0
0
ωn
Bias ω0
Weights ωi
xn Heavyside h
17/57
Perceptron algorithm
Choose an initial weight vector ω̃
continue ← true
while continue do
continue ← false
for i from 1 to N do
if
ei ⟩ ≥ 0 and yi = 0
⟨ω̃, x
or
ei ⟩ < 0 and yi = 1
⟨ω̃, x
then
ω̃ ← ω̃ + (yi − 1⟨ω̃,exi ⟩≥0 )xi
continue ← true
end if
end for
end while
Return ω̃
18/57
continue ← true
while continue do
continue ← false
if
ei ⟩ ≥ 0 and yi = 0
⟨ω̃, x
or
ei ⟩ < 0 and yi = 1
⟨ω̃, x
then
ω̃ ← ω̃ + (yi − 1⟨ω̃,exi ⟩≥0 )xi
continue ← true
end if
end for
end while
Return ω̃
This algorithm converges in a finite number of iterations
18/57
continue ← true
while continue do
continue ← false
if
ei ⟩ ≥ 0 and yi = 0
⟨ω̃, x
or
ei ⟩ < 0 and yi = 1
⟨ω̃, x
then
ω̃ ← ω̃ + (yi − 1⟨ω̃,exi ⟩≥0 )xi
continue ← true
end if
end for
end while
Return ω̃
This algorithm converges in a finite number of iterations
if a solution exists!
18/57
Implementation
19/57
Implementation
Method only adapted for linearly separable problems

19/57
Regression for binary outcomes
Probit model (Bliss and Fisher, 1935) and Logistic Regression (Berkson, 1944)
yi ∈ {0, 1} is a realization of a Bernoulli distribution with

probability
ei ⟩)
pi (ω̃) = h(⟨ω̃, x
for a link function h
1
0
2.5 5.0 7.5

Chloroquine dose
20/57

probability
ei ⟩)
pi (ω̃) = h(⟨ω̃, x
1
0
2.5 5.0 7.5

Chloroquine dose
20/57

probability
ei ⟩)
pi (ω̃) = h(⟨ω̃, x
1
Probit model
h=G
the c.d.f of the Gaussian −5 0 5
20/57

probability
ei ⟩)
pi (ω̃) = h(⟨ω̃, x
1
Probit model
h=G
the c.d.f of the Gaussian −5 0 5
1
Logistic model
1
h(x) =
1 + e −x
−5 0 5
20/57
1
 
[ ω0 ω1 ... ωn ] 1
ω0


x1 x1
| {z }  

ω̃ t
 x̃
 
 ..
ω1
 . 


xn

x2

1 if s≥0
ω2 s = hω̃, x̃i h(s) =
h 0 if s<0
1
s h(s)
ωn−1
0
0
ωn
Bias ω0
Weights ωi
xn Heavyside h
21/57
1
 
[ ω0 ω1 ... ωn ] 1
ω0


x1 x1
| {z }  

ω̃ t
 x̃
 
 ..
ω1
 . 


xn

x2

1 if s≥0
ω2 s = hω̃, x̃i h(s) =
h 0 if s<0
1
s h(s)
ωn−1
0
0
ωn
Bias ω0
Weights ωi
xn Activation function h
Output h(s) ∈ (0, 1)
22/57
Inference by maximum-likelihood
The likelihood of observation i is
P(Y = yi |xi ) = pi (ω̃)yi (1 − pi (ω̃))1−yi
23/57

For N independant observations
Y
N
Likelihood(ω̃; y1 , . . . , yN ) = pi (ω̃)yi (1 − pi (ω̃))1−yi
i=1
23/57

Y
N
i=1
It is equivalent to maximize the log-likelihood
X
N
[yi log pi (ω̃) + (1 − yi ) log(1 − pi (ω̃))]
i=1
23/57

Y
N
i=1
It is equivalent to maximize the log-likelihood
X
N
[yi log pi (ω̃) + (1 − yi ) log(1 − pi (ω̃))]
i=1
or to minimize the binary cross-entropy
X
N
L(ω̃) = − [yi log pi (ω̃) + (1 − yi ) log(1 − pi (ω̃))]
i=1
23/57
Neural network
Multi-Layer perceptron
A neural network is a graph in which outputs of neurons (nodes)
are become the inputs of some other neurons (oriented graph)
24/57
One layer
1
(l−1)
v1
(l)
s1
 
(l)

v1
(l)
 h(l) s1
(l−1) (l)
v2 s2  (l)

(l)

 v2   h(l) s2
  

 . = 
 .   .. 
 .
.
  
(l) 
(l)

vnl h(l) snl
| {z }
v(l)
| {z }
(l)
snl h(l) (s(l) )
(l−1)
vnl−1
 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
snl ω ωnl ,1 ωnl ,2 . . . ωnl ,nl−1 (l−1)
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l−1)
v1
(l)
s1
INPUT OUPUT
 
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 s2 
 (l)  

(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
.
  
(l−1) (l) 
(l)

vnl−1 vnl h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l−1)
v1
(l)
s1
INPUT OUPUT
 
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 s2 
 (l)  

(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
.
  
(l−1) (l) 
(l)

| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l−1)
v1
(l)
s1
INPUT OUPUT
 
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 s2 
 (l)  

(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
.
  
(l−1) (l) 
(l)

| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
1) Stack a 1 to the input
2) Compute s(l) = W̃ (l) ṽ(l−1)
3) Return v(l) = h(l) (s(l) )
 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l−1)
v1
(l)
s1
INPUT OUPUT
 
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 s2 
 (l)  

(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
.
  
(l−1) (l) 
(l)

| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l−1)
v1
(l)
s1
INPUT OUPUT
 
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 s2 
 (l)  

(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
.
  
(l−1) (l) 
(l)

| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l−1)
v1
(l)
s1
INPUT OUPUT
 
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 s2 
 (l)  

(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
.
  
(l−1) (l) 
(l)

| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l−1)
v1
(l)
s1
INPUT OUPUT
 
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 s2 
 (l)  

(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
.
  
(l−1) (l) 
(l)

| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1 (l)
ω1,0
(l)
(l−1) ω1,1
v1
(l)
s1
(l)
INPUT ω1,2 OUPUT
 
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 s2 
 (l)  

(l)

 v2   h(l) s2

 v2  (l)  . =

 ..  ω1,nl−1  .   ..



.
  .  
.
  
(l−1) (l) 
(l)

| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
nl−1
(l) (l)
X (l) (l−1) 2) Compute s(l) = W̃ (l) ṽ(l−1)
s1 =ω1,0 .1 + ω1,j vj 3) Return v(l) = h(l) (s(l) )
j=1
 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l)
ω2,0
(l−1)
v1 (l)
ω2,1 (l)
s1
INPUT OUPUT
 
(l) (l)
 (l−1)  ω2,2  (l)  h(l) s1
v1 (l−1) (l) v1
 (l−1)  v2 s2 
 (l)  

(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
.

(l)
 
(l−1) ω2,nl−1 (l)
vnl

(l)

vnl−1 h(l) snl
| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
nl−1
(l) (l)
s2 =ω2,0 .1 + ω2,j vj 3) Return v(l) = h(l) (s(l) )
j=1
 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l)
ωnl ,0
(l−1)
v1
(l)
s1
(l)
INPUT ωnl ,1 OUPUT
 
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 (l)
s2 
 (l)  

(l)

 v2   h(l) s2

 v2  ωnl ,2  . =

..

   .   .. 

.
  .  
.
  
(l−1) (l) 
(l)

| {z } | {z }
(l)
| {z }
v(l−1) (l)
ωnl ,nl−1 (l)
v
h(l) (s(l) )
snl
(l−1)
vnl−1
3 steps:
nl−1
(l)
s(l)
nl =ωnl ,0 .1 + ωnl ,j vj 3) Return v(l) = h(l) (s(l) )
j=1
 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
(l−1)
v1
(l)
v1
INPUT OUPUT
 
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 v2 
 (l)  

(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
.
  
(l−1) (l) 
(l)

| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
vnl
(l−1)
vnl−1
3 steps:
nl−1
(l) (l)
si =ωi,0 .1 + ωi,j vj 3) Return v(l) = h(l) (s(l) )
j=1
 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
One layer
1
2 attributes for layer l:
1) A weight matrix W̃ (l)
2) An activation function h(l)
(l−1)
v1
(l)
v1
INPUT OUPUT
 
(l)
 (l−1)   (l) 
v1 h(l) s1
v1 (l−1) (l)
 (l−1)  v2 v2 
 (l)  

(l)

 v2   h(l) s2

 v2   . =

..

   .   .. 

.
  .  
.
  
(l−1) (l) 
(l)

| {z } | {z }
(l)
| {z }
v(l−1) (l)
v
h(l) (s(l) )
vnl
(l−1)
vnl−1
3 steps:
nl−1
(l) (l)
si =ωi,0 .1 + ωi,j vj 3) Return v(l) = h(l) (s(l) )
j=1
 
 (l)
  (l) (l) (l) (l)  1
s1 ω1,0 ω1,1 ω1,2 ... ω1,nl−1  (l−1)
  v1

 (l)   (l) (l) (l) (l)
 s2   ω2,0 ω2,1 ω2,2 . . . ω2,nl−1

  (l−1)
  v2

 . =
  .. .. ..

 .  .. 
 .   . . .  
(l) (l) (l) (l) (l)
 . 
| {z } | nl ,0 {z v
} | n{z l−1
(l)
}
s W̃ (l) ṽ(l−1)
25/57
Summary
(l)
The l th layer of perceptrons is a function fW̃ (l) from Rnl−1 to Rnl
defined by:
(l)
v(l) = fW̃ (l) (v(l−1) )
26/57
Summary
(l)
defined by:
(l)
v(l) = fW̃ (l) (v(l−1) ) = h(l) (W̃ (l) ṽ(l−1) )
26/57
Summary
(l)
defined by:
(l)
v(l) = fW̃ (l) (v(l−1) ) = h(l) (W̃ (l) ṽ(l−1) ) = h(l) (b(l) + W (l) v(l−1) )
 (l) (l)   (l) 

ω1,1 . . . ω1,n l−1
ω1,0
 .. .. .   . 
with W (l) = . . ..  and b(l) =  .. 
ωn(l)l ,1 . . . ωn(l)l ,nl−1 ωn(l)l ,0
26/57
Multi-Layer Perceptron
Feed the input vector x to the input layer:
v(0) = x
For l = 1, . . . , L compute
(l)
v(l) = fW̃ (l) (v(l−1) )
Return the content of the output layer
v(L) = fW (x)
v(1) , . . . , v(L−1) are the hidden layers
27/57
Capacity of a neural network
Universal approximation theorem (Hornik, 1991)
Let f be a continuous function f : [0, 1]n → R and ε > 0.
Let h (1) be a non-constant, increasing, bounded real function.
Then, there exists
an integer q,
a matrix of weights W̃ (1) ∈ Rq×n
a vector of weights ω (2) ∈ Rq
a real b(2)
such as for all x ∈ [0, 1]n ,
D E
e⟩
f (x) − b(2) − ω (2) , h (1) ⟨W̃ (1) , x <ε
28/57
Capacity of a neural network
Universal approximation theorem (Hornik, 1991)
Let f be a continuous function f : [0, 1]n → R and ε > 0.
Let h (1) be a non-constant, increasing, bounded real function.
Then, there exists
an integer q,
a matrix of weights W̃ (1) ∈ Rq×n
a vector of weights ω (2) ∈ Rq
a real b(2)
such as for all x ∈ [0, 1]n ,
D E
e⟩
f (x) − b(2) − ω (2) , h (1) ⟨W̃ (1) , x <ε
Any continuous real-valued function on a bounded subset of Rn

can be approximated arbitrarily well by a MLP with one hidden
layer (for appropriate activation function h (1) )
28/57
Loss functions
The cost function is a sum of loss functions δ between each

prediction fW (xi ) and each label yi :
X
N
L(W) = δ(yi , fW (xi ))
i=1
29/57
Loss functions

X
N
L(W) = δ(yi , fW (xi ))
i=1
Regression
X
N
δ(y, fW (x)) = (yk − fW (x))2
k=1
Sum of squares
Likelihood for Gaussian noise
Simplifications occur when the activation function of the
output layer is the identity
29/57
Loss functions

X
N
L(W) = δ(yi , fW (xi ))
i=1
Binary classification
δ(y, fW (x)) = −y log(fW (x)) − (1 − y) log(1 − fW (x))
Binary cross-entropy
Likelihood for Bernouilli
output layer is the sigmoid (a.k.a logistic)
29/57
Loss functions

X
N
L(W) = δ(yi , fW (xi ))
i=1
Multiclass classification
X
K
δ(y, fW (x)) = − yk log(fW (x)k )
k=1
Cross-entropy
Likelihood for Multinomial distribution
output layer is the softmax
29/57
Softmax activation function
For a multiclass problem with K categories, one can use the

softmax activation on the last layer:
 s 
e1
1  . 
v = Softmax (s) = nL  .. 
X
e si e snL
i=1
30/57
Softmax activation function
For a multiclass problem with K categories, one can use the

softmax activation on the last layer:
 s 
e1
1  . 
v = Softmax (s) = nL  .. 
X
e si e snL
i=1
The result is an element of the simplex of dimension nL :
X
nL
vi = 1 and vi > 0
i=1
30/57
Training neural network: Gradient descent
The gradient
 dL 
dω0 (ω̃)
 .. 
∇L(ω̃) =  . 
dL
dωn (ω̃)
Credits: Martin Thoma
31/57
Training neural network: Gradient descent
The gradient
 dL 
dω0 (ω̃)
 .. 
Initialisation: choose ∇L(ω̃) =  . 
dL
ω̃(0) ∈ Rn+1 dωn (ω̃)
Iterate until convergence
η
ω̃(t+1) = ω̃(t) − ∇L ω̃(t)
N
η > 0 is the learning rate
chosen by the user
Credits: Martin Thoma
31/57
Train a neural network: Back-propagation
Chain rule:
(f ◦ g)′ (x) = f ′ (g(x))g ′ (x)
32/57
Chain rule:
(f ◦ g)′ (x) = f ′ (g(x))g ′ (x)
(f ◦ g ◦ h)′ (x)
32/57
Chain rule:
(f ◦ g)′ (x) = f ′ (g(x))g ′ (x)
(f ◦ g ◦ h)′ (x) = (f ◦ g)′ (h(x))h ′ (x)
32/57
Chain rule:
(f ◦ g)′ (x) = f ′ (g(x))g ′ (x)
(f ◦ g ◦ h)′ (x) = (f ◦ g)′ (h(x))h ′ (x) = f ′ (g(h(x)))g ′ (h(x))h ′ (x)
32/57
Chain rule:
(f ◦ g)′ (x) = f ′ (g(x))g ′ (x)
(f ◦ g ◦ h)′ (x) = (f ◦ g)′ (h(x))h ′ (x) = f ′ (g(h(x)))g ′ (h(x))h ′ (x)
Leibniz notation
h g f
x −−→ y −−→ u −−→ z
dz dz du dy
=
dx du dy dx
32/57
Back-propagation
dL
(L)
=
dωi,j
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
dωi,j dv(L) ds(L) dωi,j
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
dL
(L−1)
=
dωi,j
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
dL
(L−1)
=
dωi,j
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
dL dL dv(L) ds(L) dv(L−1) ds(L−1)
(L−1)
= (L−1)
dωi,j dv(L) ds(L) dv(L−1) ds(L−1) dωi,j
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
(L−1)
= (L−1)
dL
(L−2)
=
dωi,j
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
(L−1)
= (L−1)
dL
(L−2)
=
dωi,j
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
(L−1)
= (L−1)
dL dL dv(L) ds(L) dv(L−1) ds(L−1) dv(L−2) ds(L−2)
(L−2)
= (L−2)
|dv {zds } dv ds(L−1) dv(L−2) ds(L−2) dωi,j
(L) (L) (L−1)
dωi,j
E (L)
| {z }
E (L−1)
| {z }
E (L−2)
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
(L−1)
= (L−1)
(L−2)
= (L−2)
(L) (L) (L−1)
dωi,j
E (L)
| {z }
E (L−1)
| {z }
E (L−2)
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
(L−1)
= (L−1)
(L−2)
= (L−2)
(L) (L) (L−1)
dωi,j
E (L)
| {z }
E (L−1)
| {z }
E (L−2)
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
(L−1)
= (L−1)
(L−2)
= (L−2)
(L) (L) (L−1)
dωi,j
E (L)
| {z }
E (L−1)
| {z }
E (L−2)
33/57
Back-propagation
dL dL dv(L) ds(L)
(L)
= (L)
(L−1)
= (L−1)
(L−2)
= (L−2)
(L) (L) (L−1)
dωi,j
E (L)
| {z }
E (L−1)
| {z }
E (L−2)
33/57
Stochastic gradient and variants
Groupe de lecture - Mike Pereira
The evaluation of the loss function gradient can be costly
when the training set is large
Instead of computing the gradient on the entire training set, it
will be computed on mini-batches of moderate size nb (e.g
nb = 20 but can be 1).

W(t+1) = W(t) − η∇Lnb W(t)
The number of epochs is the number of times the entire data
set is visited
At the beginning of each epoch, the data-set is randomly
shuffled
The cost-function and its gradient are divided by nb (to make
the choice of η independent of nb )
34/57
Adaptive Moment Estimation (ADAM)
Kingma and Ba (2014)
Initialisation
m(0) = v(0) = 0, β1 , β2 ∈ [0, 1) e.g β1 = 0.9, β2 = 0.999
For t ≥ 1
m(t+1) = β1 m(t) + (1 − β1 )∇Lnb (W(t) )
v(t+1) = β2 v(t) + (1 − β2 )∇L2nb (W(t) )

m(t+1)
m̂(t+1) =
1 − β1t
v(t+1)
v̂(t+1) =
1 − β2t
m̂(t+1)
W(t+1) = W(t) − η p
[Tensorflow playground] v̂(t+1) + ε
35/57
Generative model: the problem
Aim
Given a (large?) set of training examples, generate new examples
which "look like" the training examples in terms of
fidelity or realism
diversity or variability
36/57
Generative model: the problem
Aim
Given a (large?) set of training examples, generate new examples
which "look like" the training examples in terms of
fidelity or realism
diversity or variability
In other words, given an empirical distribution PD over Rn , find a
probabilistic distribution (the model) PG with its simulation
algorithm (the generator) which is close to PD (or its theoretical
version P)
36/57
First idea: use a Gaussian transform approach
Generate a latent vector
z ∼ Nn (0, In )
Compute
x = G(z)
where G = fW is a neural network with weights W to
determine
37/57
First idea: use a Gaussian transform approach
Generate a latent vector
z ∼ Nn (0, In )
Compute
x = G(z)
where G = fW is a neural network with weights W to
determine
Mathematical justifications:
Most of the distributions of Rn can be written as a function
of a Gaussian vector of Rn with independant components
(inversion method + decomposition of multivariate densities
as a product of conditional distributions)
By using a high-capacity neural network G, we can
approximate the target function (universal approximation
theorem)
37/57
Generator
Latent Hidden Output

space layer layer
NOISE FAKE
38/57
Example of geostatistical deep generative model
Cholesky algorithm
Gaussian vector x with mean vector µ and covariance matrix Σ

Compute L (lower triangular) such as
Σ = LLT
Sample
z ∼ Nn (0, In )
Return
x = µ + Lz
39/57
Example of geostatistical deep generative model
Cholesky algorithm
Gaussian vector x with mean vector µ and covariance matrix Σ

Compute L (lower triangular) such as
Σ = LLT
Sample
z ∼ Nn (0, In )
Return
x = µ + Lz
Transformation can be handled by the activation function
39/57
Problem: How to handle the large number of parameters?
The data belongs to a manifold
[https://line.17qq.com/articles/cmmcohgcpv.html]
40/57
Example - Dimension reduction (or compression)
Starting from a set of examples x1 , . . . , xN , find two functions

an encoder fe from Rn to Rd with d much smaller than n
a decoder fd from Rd to Rn
The aim is to minimize the reconstruction error over all the
examples
X
N
T (fe , fd ) = ||xi − fd (fe (xi ))||2
i=1
41/57
Second idea: Dimension reduction
[https://www.compthree.com/blog/autoencoder/]
The network "learns" a low dimensional representation of the inputs
42/57
Example: compression of Flumy images
Original 64 × 64 Flumy images (left) and reconstructed

version (right) after a compression in a latent space of 121
units
43/57
Simulate in the latent space
44/57
Simulate in the latent space
44/57
Generator

space layer layer
NOISE FAKE
45/57
Generator

space layer layer
NOISE FAKE
How to train the network?
45/57
Generative Adversarial Network (GAN)
A two players game
REAL
Input Hidden Output

layer layer layer

space layer layer
REAL or FAKE?
NOISE FAKE
DISCRIMINATOR
GENERATOR
46/57
Generative Adversarial Networks (GAN)
Two adversarial neural networks
A generator X = Gθ (z) where z is a Gaussian noise N(0, Id )
and θ are the weights
A discriminator Dϕ (X ) which classifies images "true" or
"fake" (ϕ are the weights)
47/57
Idea: Train Gθ and Dϕ in order that Gθ misleads Dϕ and Dϕ tries to

become better to detect fakes
47/57
Dϕ is close to 1 if the discriminator classifies an image as true and

close to 0 otherwise
47/57
Minmax game with value function
min max V (Dϕ , Gθ )

Gθ Dϕ
with
V (Dϕ , Gθ ) = EX∼Data [log Dϕ (X )] + Ez∼N(0,Id ) [log(1 − Dϕ (Gθ (z))]
47/57
48/57
Training
Goodfellow et al. (2014)
for Number of training iterations do

for k states do
Sample minibatch of mb noise samples {z1 , . . . , zmb } ∼ N(0, Id )
Sample minibatch of mb examples {x1 , . . . , xmb } ∼ Data
Update the discriminator by ascending its stochastic gradient:
1 X
mb
∇ϕ [log Dϕ (xi ) + log(1 − Dϕ (Gθ (zi )))]
mb i=1
end for
Sample minibatch of mb noise samples {z1 , . . . , zmb } ∼ N(0, Id )
Update the generator by descending its stochastic gradient
1 X
mb
∇θ log(1 − Dϕ (Gθ (zi )))
mb i=1
end for
49/57
Flumy generator
Generator is trained on 10000 small 64 × 64 training images

The architecture of the generator (fully convolutional neural
network) leads to stationary process
Example of 64 × 64 Flumy generated images
50/57
Results
Horizontal section
Top: Flumy. Bottom: GAN
51/57
Example on Plurigaussian
4 images among 1000 training images
52/57
Local adaptation
53/57
Local adaptation
53/57
Local adaptation
53/57
Conditioning
Variational bayesian formulation

A new latent vector Zc
A conditioning generator Gc
Zc → Gc (Zc ) → X
Find the parameters of Gc such as G(Gc (Zc )) produces

outputs close to the data and consistent with the prior
(gaussian distribution of the latent space)
54/57
Results
Reference image
9 conditional simulations
55/57
Results
Reference image
Conditional probabilities
56/57
References
Chapelle, O., J. Weston, L. Bottou, and V. Vapnik (2001). Vicinal risk minimization. Advances in neural
information processing systems, 416–422.
Glorot, X. and Y. Bengio (2010). Understanding the difficulty of training deep feedforward neural networks. In
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256.
JMLR Workshop and Conference Proceedings.
Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014).
Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680.
He, K., X. Zhang, S. Ren, and J. Sun (2016). Deep residual learning for image recognition. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 770–778.
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural networks 4 (2), 251–257.
Ioffe, S. and C. Szegedy (2015). Batch normalization: Accelerating deep network training by reducing internal
covariate shift. In International conference on machine learning, pp. 448–456. PMLR.
Kingma, D. P. and J. Ba (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
Krizhevsky, A., I. Sutskever, and G. E. Hinton (2012). Imagenet classification with deep convolutional neural
networks. Advances in neural information processing systems 25, 1097–1105.
LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner (1998). Gradient-based learning applied to document recognition.
Proceedings of the IEEE 86 (11), 2278–2324.
Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014). Dropout: A simple way to
prevent neural networks from overfitting. Journal of Machine Learning Research 15 (56), 1929–1958.
Szegedy, C., W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich
(2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pp. 1–9.
57/57

GeoStat DeepLearn NDesassis 15 06 22

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

GeoStat DeepLearn NDesassis 15 06 22

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GeoStat DeepLearn NDesassis 15 06 22

Uploaded by

Copyright:

Available Formats

First steps to deep-learning

From the perceptron to the deep generative models

Ferdinand Bhavsar, Nicolas Desassis, Fabien Ors

Flumy is a process-based model for meandering channels in

Flumy is a process-based model for meandering channels in

argmin T (f ) ≃ arg minm T (fW )

where T is a functional deﬁned over a space E and fW is a

What is a neural network?

Regression: the components of y are continuous

Regression: the components of y are continuous

When y is categorical with K categories (with K > 2), we

0.98 This is a cat

The problem is not linearly separable

Rosenblatt, F. (1957) The perceptron. A perceiving and recognizing automaton.

Method only adapted for linearly separable problems

yi ∈ {0, 1} is a realization of a Bernoulli distribution with

2.5 5.0 7.5

yi ∈ {0, 1} is a realization of a Bernoulli distribution with

2.5 5.0 7.5

yi ∈ {0, 1} is a realization of a Bernoulli distribution with

the c.d.f of the Gaussian −5 0 5

yi ∈ {0, 1} is a realization of a Bernoulli distribution with

the c.d.f of the Gaussian −5 0 5

P(Y = yi |xi ) = pi (ω̃)yi (1 − pi (ω̃))1−yi

P(Y = yi |xi ) = pi (ω̃)yi (1 − pi (ω̃))1−yi

P(Y = yi |xi ) = pi (ω̃)yi (1 − pi (ω̃))1−yi

P(Y = yi |xi ) = pi (ω̃)yi (1 − pi (ω̃))1−yi

 (l) (l)   (l) 

Feed the input vector x to the input layer:

Return the content of the output layer

v(1) , . . . , v(L−1) are the hidden layers

Any continuous real-valued function on a bounded subset of Rn

The cost function is a sum of loss functions δ between each

The cost function is a sum of loss functions δ between each

The cost function is a sum of loss functions δ between each

The cost function is a sum of loss functions δ between each

For a multiclass problem with K categories, one can use the

For a multiclass problem with K categories, one can use the

The result is an element of the simplex of dimension nL :

Credits: Martin Thoma

Iterate until convergence

Credits: Martin Thoma

(f ◦ g)′ (x) = f ′ (g(x))g ′ (x)

(f ◦ g)′ (x) = f ′ (g(x))g ′ (x)

(f ◦ g)′ (x) = f ′ (g(x))g ′ (x)

(f ◦ g ◦ h)′ (x) = (f ◦ g)′ (h(x))h ′ (x)

(f ◦ g)′ (x) = f ′ (g(x))g ′ (x)

(f ◦ g ◦ h)′ (x) = (f ◦ g)′ (h(x))h ′ (x) = f ′ (g(h(x)))g ′ (h(x))h ′ (x)

(f ◦ g)′ (x) = f ′ (g(x))g ′ (x)

(f ◦ g ◦ h)′ (x) = (f ◦ g)′ (h(x))h ′ (x) = f ′ (g(h(x)))g ′ (h(x))h ′ (x)

m(0) = v(0) = 0, β1 , β2 ∈ [0, 1) e.g β1 = 0.9, β2 = 0.999

m(t+1) = β1 m(t) + (1 − β1 )∇Lnb (W(t) )

v(t+1) = β2 v(t) + (1 − β2 )∇L2nb (W(t) )

Latent Hidden Output

Gaussian vector x with mean vector µ and covariance matrix Σ

Gaussian vector x with mean vector µ and covariance matrix Σ

Transformation can be handled by the activation function