Mathematics of Deep Learning Lecture Notes
Mathematics of Deep Learning Lecture Notes
Lecture Notes
2 Expressivity Page 10
2.1 Viewpoint of approximation theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Learning Page 26
3.1 Training in overparameterized regimes . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Generalization Page 30
4.1 Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Interpretability Page 34
5.1 Towards a more mathematical understanding . . . . . . . . . . . . . . . . . 36
2
7 Partial differential equations Page 43
7.1 Mathematics of PDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.2 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Index Page 52
3
List of Figures
List of Figures
1 An artificial neuron. . . . . . . . . . . . . . . . . . . . . . . . 2
2 Different activation functions. . . . . . . . . . . . . . . . . . . 2
3 An artificial neural network. . . . . . . . . . . . . . . . . . . . 2
4 Neural network architecture. . . . . . . . . . . . . . . . . . . 3
5 Under- and overfitting . . . . . . . . . . . . . . . . . . . . . . 6
6 Three types of errors in learning problems. . . . . . . . . . . . 7
7 Concatenation of neural networks. . . . . . . . . . . . . . . . 7
8 Parallelization of neural networks. . . . . . . . . . . . . . . . 8
9 A wavelet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
10 Tiling of the plane induced by wavelets. . . . . . . . . . . . . 11
11 The wavelet transform. . . . . . . . . . . . . . . . . . . . . . . 11
12 Anisotropic features govern natural images. . . . . . . . . . . 11
13 Intuitive explanation for why wavelets don’t preform as
well as shearlets. . . . . . . . . . . . . . . . . . . . . . . . . . 11
14 The tiling of the frequency domain induced by cone adapted
shearlets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
15 All eight possible ways to shatter three non-collinear points
in the plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
16 A visualization of the ϕi . . . . . . . . . . . . . . . . . . . . . 15
17 Visualization of the encoder and decoder. . . . . . . . . . . . 16
18 The red line represents ε´γ and the blue one Learnpε, Cq. . . 17
19 A neural network representing the hat function. . . . . . . . . 18
20 The functions gm for m P t0, 1, 2u. . . . . . . . . . . . . . . . 19
21 The functions fk for k P t1, 2u and x2 . . . . . . . . . . . . . . 19
22 Neural network realizing multiplication. . . . . . . . . . . . . 19
23 TODO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
24 The bump functions f and ϕ. . . . . . . . . . . . . . . . . . . 21
25 Numerical experiments with ReLU and backpropagation.
[Source??] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
26 Illustration of ϕ for d “ 2. . . . . . . . . . . . . . . . . . . . . 23
27 Visual intuition for the theorem. . . . . . . . . . . . . . . . . 24
28 Some numerics. . . . . . . . . . . . . . . . . . . . . . . . . . . 29
29 Illustration of the concept of covering number. . . . . . . . . 30
30 Sometimes a "double descent curve" is observed. . . . . . . . 32
31 The double descent curve. . . . . . . . . . . . . . . . . . . . . 32
i
List of Figures
ii
1 Introduction to Neural Networks and
Deep Learning
1
1.2 Deep Neural Networks
Rn Ñ R, x ÞÑ %px x, w y ´bq.
Bias
Weights b
x1 w1
Activation
$
&1, Output
x x, w y ´b ą 0,
Inputs x2 w2
Σ %0, x x, w y ´b ď 0,
y
x3 w3
definition 1.2.3 (Artificial Neural Network Types) Fig. 2: Activation functions. [1]
An artificial neural network (ANN) is a graph consisting of artificial artificial neural network
neurons called feed-forward ANN if it is direct and acyclic and
recurrent ANN else.
The start of the Age of Data at the turn of the century saw a drastic im- Input #1
provement of computing power and vast amounts of data being available. Input #2
This gave rise to deep neural networks (cf. Y. LeCun et al., ca 2000). Output
Input #3
The map
Fig. 3: An artificial neural network.
Rd Ñ RNL , x ÞÑ TL %pTL´1 %p. . . %pT1 pxqqqq
2
1.2 Deep Neural Networks
m
ÿ
min LpΦpA` ,B` q pxi q, f pxi qq ` λRppA` , b` qL
`“1 q.
pA` ,b` qL
`“1 i“1
3
1.2 Deep Neural Networks
H :“ spanppϕq`i“1 q Ă CpRn q
and the empirical error/risk by the mean squared error (MSE) mean squared error
m ¯2
1 ÿ ´ ˆ train
E z pf q :“ f pxi q ´ yitrain .
m i“1
4
1.2 Deep Neural Networks
ř`
Every f P H can be expressed as f “ i“1 wi ϕi . For sake of brevity we
set y :“ pyitrain qm ` train
i“1 , w :“ pwi qi“1 and A :“ pϕj pxi qqi,j P Rmˆ` so we
can write E z pf q “ }Aw ´ y}2 .
Letting ŵ :“ arg minwPR` E z pf q being the minimiser, the sought estimate
(“ solution of the regression task) is given by
`
ÿ
fˆ :“ pŵqi ϕi . ˛
i“1
´1
We have σX pSq “ σpπX pSqq, where πX : X ˆ Y Ñ X , px, yq ÞÑ x. Then
for every integrable function ϕ : Z Ñ R
ż ż ˆż ˙
ϕpx, yq dσpx, yq “ ϕpx, yq dσpy | xq dσX pxq
X ˆY X Y
5
1.2 Deep Neural Networks
definition 1.2.16 (Empirical error / target function) Fig. 5: Noisy (roughly linear) data is fitted to
a linear function and a polynomial function.
The empirical error of f with respect to loss ` and sample S is Although the polynomial function is a perfect
fit, the linear function can be expected to gen-
N eralize better: if the two functions were used
1 ÿ
E S pf q :“ `pf pxk q, yk q. to extrapolate beyond the fitted data, the lin-
N k“1 ear function should make better predictions.
[10]
An empirical target function fH,S P H is a minimiser of E S over H.
6
1.2 Deep Neural Networks
Then Φ P N N 83,7,8,2 .
An artificial neuron is a realization of a NN Φ P N N d,d,d`1,1 . ˛
7
Fig. 7: Concatenation of neural networks.
[2]
1.3 Training of deep neural networks
` ˘
• L Φp1q ˝ Φp2q “ LpΦp1q q ` LpΦp2q q ´ 1
hold, where Φp1q ˝ Φp2q is defined as
ˆ´ ¯L2 ´1 ´ ¯L1 ˙
p2q p2q p1q p2q p1q p2q p1q p1q p1q
pAk , b1 q , pA1 AL2 , A1 bL2 ` b1 q, pAk , bk q .
k“1 k“2
Lemma 1.2.21
Let L P N and Φpiq both have L layers. For all activation functions %
and x P Rd
` ˘
• R% pP pΦp1q , Φp2q qpxq “ R% pΦp1q qpxq, R% pΦp2q qpxq ,
` ˘
• L P pΦp1q , Φp2q q “ LpΦp1q q,
` ˘
• M P pΦp1q , Φp2q “ M pΦp1q q ` M pΦp2q q
`` ˘L ˘
holds, where P pΦp1q , Φp2q q :“ Âk , b̂k k“1 q and
Fig. 8: Parallelization of neural networks.
˜ ¸ ˜ ¸ ˜ ¸ [2]
p1q p1q p1q
A1 A` 0 b`
Â1 :“ p2q , and Â` :“ p2q , b̂` :“ p2q @` P rLs.
A1 0 A` b`
m
Given samples z :“ ppxi , yi qqi“1 Ă Rd ˆ RNL , we aim to find the empirical
target function
M
ÿ
fz :“ arg min Lpf, xi , yi q,
f PH i“1
d NL d NL
where f : CpR , R q ˆ R ˆR Ñ R is a loss function; e.g. the MSE. loss function
As optimization approach we chose gradient descent: For F : RN Ñ R gradient descent
one uses the iteration
8
1.3 Training of deep neural networks
řm
For our problem we set F :“ i“1 Lpf, xi , yi q and u :“ ppA` , b` qqL
`“1 . In
order to find ∇ppA` ,b` qqL`“1 F we need to compute
BLpf, x, yq BLpf, x, yq
and
BpA` qi,j Bpb` qi
for all i, j, `.
This can be achieved by backpropagation: Compute a` , z` for ` P backpropagation
t0, . . . , Lu WHAT ARE THEY?? Now set δL :“ 2pf pxq ´ yq. Then
BLpf, x, yq BLpf, x, yq
“ δL aTL´1 and “ δL .
BAL BbL
For ` “ L ´ 1 to 1 now set
Then we have
BLpf, x, yq BLpf, x, yq
“ δL aT`´1 and “ δ` .
BA` Bb`
and can return ∇pA` ,b` q``“1 Lpf, x, yq.
řm
Our goal therefore is to find a stationary point of F :“ i“1 Fi : RN Ñ R,
where Fi :“ Lpf, xi , yi q.
An algorithm realizing this takes as input data a neural network f and
a loss function L. After initializing an starting value u0 and n “ 0,
while the error is large, choose i˚ P rms uniformly at random and update
un`1 Ð un ´ η∇Fi˚ and n ` 1 Ð n. Finally, return un .
9
2 Expressivity
Expressivity is concerned with the following questions.
• Which architecture to choose for a particular application?
• What is the expressive power of a given architecture?
• What effect has the depth of a neural network in this respect?
• What is the complexity of the approximating neural network?
• What are suitable function spaces to consider?
• Can deep neural networks beat the curse of dimensionality?
Mathematically, we ask under which conditions on a neural network Φ
and an activation function % every function from a prescribed function
class C can be arbitrarily well approximated, i.e. }R% pΦq ´ f } ď ε for all
f P C.
sup }f ´ fM }2 “ OpM ´γ q as M Ñ 8
f PC
10
2.1 Viewpoint of approximation theory
where
` ˘
f ÞÑ px f, ϕm yqm , px f, ψj,m,i yqj,m,i . (2)
In order to find mathematical models for images we see that their key
components are usually shapes separated by an edge, i.e. an singularity.
Fig. 11: The wavelet transform. [2]
definition 2.1.5 (Cartoon-like functions (Donoho))
The set of cartoon-like functions is cartoon-like functions
E 2 pR2 q :“ tf P L2 pR2 q : f “ f0 ` f1 ¨ 1B u,
11
2.2 Universality results
tψp¨ ´ mq : m P Zu,
t23j{4 ψpSk Aj x ´ mq : j ě 0, |k| ď r2j{2 s, m P Z2 u
t23j{4 ψ̃pS̃k Ãj x ´ mq : j ě 0, |k| ď r2j{2 s, m P Z2 u,
12
2.2 Universality results
13
2.3 Lower bounds for approximation
14
2.3 Lower bounds for approximation
for N, M, d P N. Then
ˆ ˙
4M LpN
VCdimpHN,M,d,L q ď 2M L log2 ` 2M L2 log2 p` ` 1q ` 2L
lnp2q
holds.
Corollary 2.3.3
For fixed p and ` and N ď M this implies
holds,
` d ˘
M pεq P Ω ε´ np1`δq
m
ηn ÿ
gpxq “ 2M d di ϕi pxq.
i“1
15
2.3 Lower bounds for approximation
m “ M η P OpM logpM qq @δ ą 0,
which is a contradiction. l
where Numpcq is the real value associated with the bit sequence c.
The distortion of pE ` , D` q is bounded by
` n Fig. 17: Visualization of the encoder and
2 n maxp|xk ´ xk`1 |, |xk ´ 1|q.
k“1 decoder.
Now choosing
` 1
n„ and |xi ´ xi`1 | „
logp`q n
logp`q
yields a distortion of ` . ˛
16
2.3 Lower bounds for approximation
Remark 2.3.8 This bound is in terms of the edges, hence the sparsity Fig. 18: The red line represents ε´γ and
the blue one Learnpε, Cq.
of the connectivity, not the neurons. However, the number of neurons is
always bounded up to a uniform constant by the number of edges.
28.11.2019
Now set
17
2.4 Upper bounds and optimality
T´1
} Dε pE ε pf qq ´ f }L2 “ }R% pε pTε pLearnpε, f qqqq ´ f }L2
Theorem 2.4.1
ReLU Calculus
Notice that the ReLU activation function is not much different from Fig. 19: A neural network representing the
using any other piece-wise linear activation function with finitely many hat function. [2]
breakpoints: one can replace one network by an equivalent one but having
another activation function while only increasing the number of units
and weights by constant factors ([5], proposition 1).
18
2.4 Upper bounds and optimality
Lemma 2.4.3 (Square function [5]) Fig. 20: The function gm for m P t0, 1, 2u.
For all ε ą 0 there exists a neural network Φ with Oplogpε´1 qq layers [6], Subsection 2.2.
ř8
Proof. We have x2 “ x ´ m“1 2´2m gm pxq. Defining fk pxq :“ x ´
řk
m“1 2
´2k
gk pxq yields }x2 ´ fk pxq} “ 2´2pk`1q 1 and fk can be realized
by a neural network with Opkq layers and nodes. Setting ε “ 2´2pk`1q
` ` ˘˘
yields k P O log2 1ε . l
Proof. Monomials can be approximated by iteratively approximating the Fig. 22: Neural network realizing multipli-
square and multiplication with the identity (cf. lemma 1.2.22). Therefore cation. (really?) [5], subsection 3.1.
the result follows as neural networks are closed with respect to linear
combinations. l
19
2.5 Approximation by Wavelets, Shearlets and more
is an affine system.
a
Remark 2.5.2 The factor |Aj | ensures all ϕi , where i is an enumera-
tion of the indices in the above set, have equal norm.
then there exists a deep neural network Φ with OpM q edges such
that }f ´ R% pΦq} ď 2ε.
20
2.5 Approximation by Wavelets, Shearlets and more
Proof. We have }ψs ´ R% pΦs q} ď ε›with M pΦs q›ď C. Since the map
x ÞÑ Aj x ´ b is affine linear we have ›ψi ´ R% pΦ̃i q› ď ε and M pΦ̃i q ď C.
› ›
Look at picture to the right, M pΦq P OpM q.
› › › ›
4‰ ›
M
ÿ › ›ÿ M › Fig. 23: TODO
}f ´ R% pΦq} ď ›f ´ di Φi › ` › di Φi ´ R% pΦq› ď 2ε.
› › › ›
l
› i“1
› ›i“1
›
21
}f ´ R% pΦq}L2 ď Cε N ´1`ε .
Remark 2.5.5 This is the optimal rate; hence the first bound is sharp!
Some numerics
Typically weights are learnt by backpropagation. This raises the following
question: Does this lead to the optimal sparse connectivity?
In our setup we consider a fixed network ReLU-architecture, learning
specific functions with stochastic gradient descent. We analyze the
learnt subnetworks to conduct an analysis of the connection between
approximation error and number of edges.
Feel free to verify that we have answered all questions posed at the
beginning of subsection 2.3.
22
2.6 Impact of Depth
Remark 2.6.1 This result holds for virtually all known activation func-
tions, including rectified linear units, sigmoids and thresholds (i.e 1r0,8q ).
It shows that depth - even if increased by 1 - can be exponentially more
valuable than width for standard feedforward neural networks.
Assumption 1: The activation function % is measurable and satisfies
Ťk
Then (using distribution theory) it holds that supppfˆq “ i“1 spanpvi q.
Then
fxϕ “ fˆ ˝ ϕ̂ “ fˆ ˝ 1Bp0,1q
and therefore
k
ď ` ˘
supppfx
ϕq “ spanpvi q ¨ Bp0, 1q “: T,
i“1
VolpT X S d´1 q
ď ke´d .
VolpS d´1 q
Goal: Find g, where gϕ is radial and g ϕ̂ has most of its energy away
from zero.
ř
Main technical part: Ansatz: ϕpxq¨ ĝpxq “ εi ci gi , where gi are carefully
chosen concerning ĝi . l
We will now take a look at compositorial functions and a theorem from [32],
which roughly states that "Deep (hierarchical) networks can approximate
the class of compositional functions with the same accuracy as shallow compositional functions
Fig. 27: Visual intuition for the theorem.
networks but with exponentially lower number of (training) parameters.”
[2]
24
2.6 Impact of Depth
hold.
25
3 Learning
10.12.19
3.1 Training in overparameterized regimes
In the overparameterized regime the number of neurons exceeds the
number of samples, enabling zero training error and therefore a globally
optimal solution and convergence of SGD in high probability over the
initialization.
Consider the following two-layer network: Let % be the ReLU activation
function, pwr qnr“1 Ă Rd the weight vectors of the first layers, W the
associated matrix and a P Rn the output weights. TODO: BILD
Consider the neural network Φ “ pW, aq and its realization
n
1 ÿ
R% pΦq : Rd Ñ R, x ÞÑ ? ar %px wr , x yq
n r“1
BF pW pkq, aq
W pk ` 1q “ W pkq ´ η , (7)
BW pkq
for i P rms be the prediction on input xi at times t and uptq :“ pui ptqqni“1
be prediction vector at times t. Further let
“ ‰
H 8 :“ Ew„N p0,1q x xi , xj y 1tx w,xi y,x x,wj yě0u
26
3.1 Training in overparameterized regimes
where
n
x xi , xj y ÿ
Hi,j “ 1tx xi ,wr y,x xj ,wr yą0u prq.
n r“1
Thus,
d
uptq “ Hptqpy ´ uptqq. (8)
dt
2 We have
ˆ ˙
1
}Hp0q ´ H 8 }2 “ O ? “ }Hptq ´ Hp0q}2 @t ě 0,
n
i.e. the higher the overparametrization (larger u) the more the
dynamics of the predictions are governed by H 8 .
´ 2 ` m ˘¯
3 If n P Ω m λ2 log δ with probability at least 1 ´ δ, we have
0
λ0
}Hp0q ´ H 8 }2 ď 4 and λmin pHp0qq ě 43 λ0 .
This is proven with concentration bounds. TODO: WHATS
THAT?
4 Hptq is stable in terms of W ptq: If wr are i.i.d. from N p0, 1q, then
with probability 1 ´ δ we have: for all pwk ptqqnk“1 Ă Rd with
δλ0
}wr p0q ´ wr ptq}2 ď “: R (9)
m2
for some c and all r. Then
λ0 λ0
}Hptq ´ Hp0q}2 ď and λmin pHptqq ě .
4 2
Ai,r :“ tDw : }w ´ wr p0q} “ R, 1tx xr ,wr p0q yě0u ‰ 1tx xi ,wr p0q yě0u u.
Then
2R
P pAi,r q “ Pz„N p0,1q p|z| ă Rq ď ? (10)
2π
27
3.1 Training in overparameterized regimes
holds. Then,
n
“ˇ ˇ‰ 1 ÿ “ ‰ (10) 4R
E ˇHi,j p0q ´ Hi,j ptqˇ ď E 1tAi,r XAj,r u prq ď ? .
n r“1 2π
Hence, ˆ ˙
4λ0 λ0
λmin pHptqq ě λmin pHp0qq ´ ? ě .
2π 2
λ0
5 If s P r0, ts, λmin pHpsqq ě 2 , then
d
}y ´ uptq}22 “ ´2py ´ uptqqT py ´ uptqq ď ´λ0 }y ´ uptq}22 ,
dt
where the inequality comes from λmin pHpsqq ě λ2 . This implies
d ` λ0 t ˘
e }y ´ uptq}22 ď 0,
dt
so eλ0 t }y ´ uptq}22 is decreasing in t, implying (11).
6 If R1 ă R then
λ0
• λmin pHptqq ě 2 for all t ě 0.
• }wr ptq ´ wr p0q}2 ď R1 for all r.
• (11).
We have R1 ă R if n large enough. l
Let the previous assumptions´ as above be¯ fulfilled and the number
6
of hidden neurons be n “ Ω m logpm{δq
λ40 δ 3
. Further initialize wr „
N p0, 1q and ar „ unifrt˘1us for r P rns. Then with probability at
28
3.1 Training in overparameterized regimes
A similar result as before holds for deep neural networks. More precisely,
the authors show that gradient descent achieves zero training loss in
polynomial time for a deep over-parameterized residual neural network
(ResNet) [27]. residual neural network
29
4 Generalization
12.12.19
4.1 Empirical Risk Minimization
Aiming to find an alternative definition of the complexity of a function
class we will define its covering number, describing how well it can be
approximated with simple objects such as ε-balls.
Lemma 4.1.3
We have
ˆ ˙d
2
PpX, 2εq ď N pX, εq ď PpX, εq ď 1`
ε
mε2
ˆ ˙
` ˘ ´ ε ¯
P EpfH,S q ´ EpfH q ď ε ě 1 ´ N H, ¨ 2 exp ´
8M 8M 4
holds.
WHAT ARE X and Y ???Notice that this bound makes sense as for
2
many samples, i.e. large m as the term e´Cmε decreases, yielding a
` ε
˘
better bound, while N H, 8M increases with growing M , yielding a
smaller and therefore worse lower bound.
Corollary 4.1.4 (Sufficient number)
A generalization error smaller than or equal to ε ě 0 is achieved by
` ˘
m ě ε´2 ln N pH, cεq
30
4.1 Empirical Risk Minimization
In cross-validation you split the samples S into S 1 and S 2 (of approxi- cross-validation
mately the same size) and estimate the generalization error of f by
ˆ ˙ m
2 ÿ ÿ 2 ÿ
`pf pxq, yq ´ `pf pxq, yq “ σi `pf pxi q, yi q “: p‹q.
m m i“1
px,yqPS 1 px,yqPS 2
We denote σ :“ pσi qm
i“1 , where
$
&1, if pxi , yi q P S 1 ,
σi “
%´1, if pxi , yi q P S 2
If we now choose ` to be the hinge loss `pf pxq, yq :“ 1´f2pxq¨y and without hinge loss
loss of generality the labels to be yi P t˘1u one obtains
m
1 ÿ
p‹q “ σi p1 ´ f pxi qyi q.
m i“1
´ 1¯
Now consider a random split, so σi becomes a random variable. Hence The O m´ 2 terms follows directly from
Van Zuijlen’s bound [9]
« ˇ ˇff « ff
ˇ1 ÿ m m
ˇ ´ 1
¯ 1 ÿ ˜
m
¸
sup ˇ σi p1 ´ f pxi qyi qˇ ď O m´ 2 `Eσ sup σ̃i f pxi q , 1 ÿ 1
ˇ ˇ
Eσ P ? σi ď 1 ě ,
f PH ˇ m i“1 ˇ f PH m I“1 m i“1 2
loooooooooooooomoooooooooooooon
“:p:q where Ppσi “ 1q “ Ppσi “ ´1q “ 1
2.
where σ̃i :“ yi σi . The term p:q measures how well the hypothesis class is
able to fit random labels to the samples.
31
4.1 Empirical Risk Minimization
One obtains
« ˇ ˇff «› › ff
m m
1 ˇˇ ÿ ˇ 4‰ b ›ÿ ›
RS pF a,b q “ Eσ sup ˇ x v, σi xi yˇ ď Eσ › σi xi ›
ˇ › ›
}v}2 ďb m ˇi“1
ˇ (CS) m › i“1
›
2
»g fi
fÿ m
(L) b f a
“ Eσ –e σi σj x xi , xj yfl } ¨ } “ x ¨, ¨ y
m i,j“1
g
f m
(J) b f ÿ
ď e Eσ rσi σj s x xi , xj y
m i,j“1
g « ff
f m
bf ÿ ba
ď eEσ }xi }22 ď ? , Erσi σj s “ 0 if i ‰ j
m i“1
m
where (L) is the linearity of the inner product on Rd and (J) is Jensen’s
inequality Erf pXqs ď f pErXsq for concave f . ˛
´a L
¯ź
1
RS pF L,B1 ,...,BL q ď m´ 2 2 logp2qd ` 1 Bi . ˛
i“1
Notice that if we choose f “ `phpxq, yq the left term becomes Erf s´E S pf q.
32
4.1 Empirical Risk Minimization
Fig. 32: The training and test accuracy of various model on the CIFAR10
dataset. Performance with and without data augmentation and weights
decay are compared. The results of fitting random labels are also included.
[13, 14]
33
5 Interpretability
Motivation. Given a trained neural network, we don’t know what the 07.01
training data was nor how it was trained. The goal of this section is to
provide tools to open this black box.
Remark 5.0.1 Even if we know the training data and training procedure,
due to the lack of a comprehensive theory, we don’t have this information.
Sensitivity Analysis
Let C :“ pck qnk“1 be a set of classes. Consider a DNN with realization
34
Remark 5.0.4 MC represents how much a small change in each pixel
of x makes to the classification score xci for the classes from C.
The problems in the sensitivity map in figure Fig. 34 are evident:
• it looks more like noise because of its scattered nature,
• it is very far from a human explanation; correlation with regions
from the image picked by a human are rough at best.
This might be because the partial derivative fluctuate greatly (Fig. 35).
A solution is local smoothing [16]: Instead of Mc we now use
n
xc pxq :“ 1
ÿ
M Mc pxi q ` N p0, σ 2 q,
n i“1
where the xi are from a neighborhood from x.
where w are the weights and the a are constants tuned in the algorithm.
In other words: a neurons (and therefore its relevance to the output) can
be reconstructed by the neurons of the following layer and the weights.
35
5.1 Towards a more mathematical understanding
Deep Taylor
For a realization R% pΦq : Rd Ñ R we want to determine the relevance of
each pixel xk of x “ pxk qdk“1 for the output R% pΦqpxq.
d
Remark 5.0.6 In sensitivity analysis we have Mp pxq “ dx R% pΦqpxq,
which doesn’t satisfy 1 nor 2 . WHAT ABOUT LRP???
The following definition comes from [21]:
B
Mp pxq :“ R% pΦqpx̂qpxp ´ x̂p q
Bxp
Remark 5.0.8 Consider the standard Taylor expansion Fig. 38: Numerical experiment for sensitiv-
ity vs. LRP. SOURCE???
R% pΦqpx̂q `∇R% pΦqpx̂qT px ´ x̂q ` Op}x
R% pΦqpxq “ loooomoooon ´ x̂}2 q,
loooooomoooooon
“0 can be ă0
36
5.1 Towards a more mathematical understanding
Problem relaxation
37
5.1 Towards a more mathematical understanding
distortion.
????
Φ
We have Erys, covrys Ý
Ñ ErΦpyqs, covrϕpyqs.
The generic approach is to estimate using sample mean and sample
covariance, which is possible for any classifier function Φ but might
require large number of samples.
The neural network approach uses the compositional structure of Φ,
propagating the distribution though the layers, projecting to a simple
family of distributions at each layer.
NOTE: low rank and diagonal are density filtering methods
Hardness results
We consider the binary case. The following theorems [35] show that find
a minimizer of Rpεq and even the approximation of it is hard:
38
6 Inverse Problems
When solving an inverse problem, one tries to recover the original data
from a transformed version!
Example 6.0.4 (Limited Angle-(Computed) Tomography) Fig. 41: Visualization of inpainting, MRI
A CT scanner (applied in dental CT, breast tomosynthesis and electron and feature extraction. [37, 38]
tomography) samples the Radon transform
ż
R f pϕ, sq :“ f pxq dSpxq,
Lpϕ,sq
s P R.
This becomes an challenging inverse problem if R f p¨, sq is only sampled
“ ‰
on r´ϕ, ϕs ( ´ π2 , π2 . ˛
39
2 N pT q ‰ t0u, meaning there exists infinitely many solutions.
3 K ` (as defined below) is not continuous.
The solution for 1 is trying to minimize }Kx ´ y} and the solution to
2 is minimizing }x} also:
Theorem 6.0.2
41
in Rf at pϕ0 , s0 q.
• Singularities of f TODO
Douglas-Rachford
Define γ ą 0 and the proximal operator
1
proxf pvq :“ arg min f pzq ` }z ´ v}2
z 2
in order to solve
min }Ax ´ y}2 ` αRpxq
xPRd
Wavefront sets
A point x P R2 is called singular point for f if there exists a ϕ P C 8
0 pU x q, singular point
where U x is a neighborhood of x and ϕpxq ‰ 0 such that fx ϕ R C pR2 q.
8
0
A pair px, θq is called regular directed point if there exists neighborhoods regular directed point
2
of U x and Vθ or x and θ, respectively and a γ P C 8
0 pR q such that γ|U x ” 1
and
ξ2
@N DCN ą 0 : |pfxγqpξq| ď CN p1 ` |ξ|q´N @ξ P R2 , P Vθ
ξ1
The wavefront set consists of all point which are not regular directed wavefront set
points. (rapid decay of fourier transform ùñ function smooth in that
direction)
15/16 Example K : X Ñ Y is inpainting. pxi , Kxi qi by fitting pKxi , xi qi .
18 η is noise
19 Backprojection should be « also no details were explained.
Fig. 45: TODO
42
7 Partial differential equations
Partial Differential Equations arise in basically any branch of science 06.02.2020
and engineering. They are key for modeling numerous types of (natural)
phenomena such as sound, heat, diffusion, electrostatics, fluid dynamics,
elasticity and gravitation.
Bupxq B 2 upxq
ˆ ˙
Bupxq
L x, upxq, ,..., , ,... “ 0 (12)
Bx1 Bxd Bx21
Bupx, tq
´ α∆x upx, tq “ 0
Bt
B 2 upx, tq
´ ∆x upx, tq “ 0
Bt2
43
7.1 Mathematics of PDEs
for ϕ P C 8
0 pΩq.
The advantage of the boxed weak formulation is that ∆upxq “ f pxq weak formulation
doesn’t have to hold everywhere and u P C 2 pΩq is not required anymore.˛
The spaces Hm pΩq and H0m pΩq are Hilbert spaces via
ÿ ż
x u, v y :“ Dα upxqDα vpxq dx.
|α|ďm Ω
44
7.2 Discretization
Let b be symmetric, bounded and coercive, i.e. bpv, vq ě c}v}2 for b is bounded if |bpu, vq| ď C}u}}v}, C ą 0
c ą 0. Then there exists a unique solution u P Hm pΩq to (13). and symmetric if bpu, vq “ bpv, uq.
Corollary 7.1.9
For suitable f : Ω Ñ R Poisson’s equation has a unique solution.
7.2 Discretization
Usually finding the exact solution is infeasible, thus we discretize the
equation and the underlying space.
Commonly methods include finite element methods, finite difference,
finite volume methods and frame-based approaches.
High-Fidelity Approximation
The Garlerkin method solves
ż
bpuh , vq “ f pxqvpxq dx @v P U h ,
Ω
Let pϕi qD h
i“1 be a basis for U . Then we have
˜ ¸ ż
D
ÿ D
ÿ D
ÿ
h
b ui ϕi , vj ϕj “ f pxq vj ϕj pxq dx,
i“1 j“1 Ω j“1
45
7.3 Parametric PDEs
Therefore,
D ˆż ˙D
ÿ ‰´1
P Rd .
“
uh “ uhi ϕi “ pbpϕj , ϕi qqD
i,j“1 f pxqϕi pxq dx
i“1 Ω i“1
Multi-query situation
Many applications require solving the parametric PDE multiple times
for different parameters:
Rp Ą Y Q y “ py1 , . . . yp q ÞÑ uy P H
46
7.3 Parametric PDEs
We assume that
• For all ε ą 0 there exists dpεq ! D, V P RDˆdpεq such that
for all y P Y there exists Byrb P Rdpεqˆdpεq with
by a neural network. l
dpεq
Let pψi qi“1 denote the reduced basis and assume that there exists
dpεq
ReLU neural networks pΦi qi“1 of size Oppolylogpεqq such that
}Φi ´ ψi }H ď ε for all i P rdpεqs.
Then there exists a ReLU neural network Φ of size
47
7.3 Parametric PDEs
Remark 7.3.2 These hypothesis are fulfilled for diffusion and linear
elasticity equations.
Folie 44
x1 ptq “ %pAptqxptq ´ bptqq, xp0q “ x0
48
References
References
[1] Selfmade in GeoGebra6.
[2] Lecture Slides by Prof. Gitta Kutyniok.
[3] Tatiana A Bubba, Gitta Kutyniok, Matti Lassas, Maximilian März,
Wojciech Samek, Samuli Siltanen and Vignesh Srinivasan.
Learning The Invisible: A Hybrid Deep Learning-Shearlet Framework
for Limited Angle Computed Tomography
Inverse Problems, Volume 35, Number 6, 31.05.2019.
[4] Matthew Hutson.
AI researchers allege that machine learning is alchemy.
Science Magazine, 03.05.2018.
[5] Dmitry Yarotsky.
Error bounds for approximations with deep ReLU networks. Neural
networks Volume 94, Pages 103-114, 2016.
[6] Matus Telgarsky.
Representation Benefits of Deep Feedforward Networks.
Arvix, 2016.
[7] George Cybenko.
Approximation by superpositions of a sigmoidal function.
Mathematics of Control, Signals and Systems, Volume 2, Pages 303-
314, 1989.
[8] Screenshot of What is VC Dimension | Example vc dimension of Line
Hypothesis class on YouTube at 7:01.
[9] Martien C.A. van Zuijlen.
On a conjecture concerning the sum of independent Rademacher
random variables.
Arxiv, 2011.
[10] Wikipedia page for overfitting.
[11] Noah Golowich, Alexander Rakhlin, Ohad Shamir.
Size-Independent Sample Complexity of Neural Networks.
Proceedings of Machine Learning Research, Volume 75, Pages 1–3,
2018.
[12] Chapter 26 in Shalev-Shwartz, Shai; Ben-David, Shai (2014). Under-
standing Machine Learning – from Theory to Algorithms.
[13] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol
Vinyals.
Understanding deep learning requires rethinking generalization.
ArXiv, 2019
[14] Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal.
Reconciling modern machine learning practice and the bias-variance
trade-off .
ArXiv, 2019
49
References
50
References
51
References
52
Index
A Interpretability . . . . . . . . . . . . . . . 3
inverse problem . . . . . . . . . . . . . 39
activation function . . . . . . . . . . . 2
approximation method . . . . . . 18 L
artificial neural network . . . . . . 2
Learning . . . . . . . . . . . . . . . . . . . . . 3
artificial neuron . . . . . . . . . . . . . . 2
loss function. . . . . . . . . . . . . . . . . .8
B
M
backpropagation . . . . . . . . . . . . . . 9
mean squared error . . . . . . . . . . . 4
binary en(de)coders . . . . . . . . . 16
minimal code length . . . . . . . . . 16
C
O
cartoon-like functions. . . . . . . .11
optimal exponent. . . . . . . . . . . .16
classification. . . . . . . . . . . . . . . . . .4
compositional functions . . . . . 24 R
cross-validation . . . . . . . . . . . 4, 31
reduced basis method . . . . . . . 46
D regression . . . . . . . . . . . . . . . . . . . . 4
Regularization . . . . . . . . . . . . . . 40
distortion . . . . . . . . . . . . . . . . . . . 16
relevance map
Taylor decomposition . . . 36
E
residual neural network. . . . . .29
encoder-decoder pair . . . . . . . . 16
S
Expressivity . . . . . . . . . . . . . . . . . . 3
shearlet . . . . . . . . . . . . . . . . . . . . . 11
G
supervised learning . . . . . . . . . . . 4
Generalization . . . . . . . . . . . . . . . . 3
U
gradient descent . . . . . . . . . . . . . . 8
gradient flow . . . . . . . . . . . . . . . . 26 unsupervised learning . . . . . . . . 4
H V
hinge loss . . . . . . . . . . . . . . . . . . . 31 VC dimension . . . . . . . . . . . . . . . 14
hypothesis space. . . . . . . . . . . . . .6
W
I
wavefront set . . . . . . . . . . . . . . . . 42
inpainting . . . . . . . . . . . . . . . . . . . 39 weak derivative . . . . . . . . . . . . . . 44
53