[go: up one dir, main page]

0% found this document useful (0 votes)
50 views58 pages

Mathematics of Deep Learning Lecture Notes

The lecture notes from Technical University Berlin cover the mathematics of deep learning, focusing on key aspects such as expressivity, optimization, generalization, and interpretability. The course aims to explore the capabilities and limitations of deep learning, integrating mathematical approaches with practical applications. It also discusses the theoretical foundations and recent advancements in neural networks and their applications in various fields.

Uploaded by

aftabmsdev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views58 pages

Mathematics of Deep Learning Lecture Notes

The lecture notes from Technical University Berlin cover the mathematics of deep learning, focusing on key aspects such as expressivity, optimization, generalization, and interpretability. The course aims to explore the capabilities and limitations of deep learning, integrating mathematical approaches with practical applications. It also discusses the theoretical foundations and recent advancements in neural networks and their applications in various fields.

Uploaded by

aftabmsdev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Technical University Berlin

Lecture Notes

Mathematics of Deep Learning


read by Prof. Dr. Gitta Kutyniok in the winter term 2019/2020
Contents

List of Figures Page i

1 Introduction to Neural Networks and Deep


Learning Page 1
1.1 Introduction to this lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Training of deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Expressivity Page 10
2.1 Viewpoint of approximation theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Universality results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Lower bounds for approximation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Upper bounds and optimality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 Approximation by Wavelets, Shearlets and more . . . . . . . . . . . . . . 20

2.6 Impact of Depth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Learning Page 26
3.1 Training in overparameterized regimes . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Generalization Page 30
4.1 Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Interpretability Page 34
5.1 Towards a more mathematical understanding . . . . . . . . . . . . . . . . . 36

6 Inverse Problems Page 39

2
7 Partial differential equations Page 43
7.1 Mathematics of PDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7.2 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7.3 Parametric PDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Index Page 52

Lecture Notes by Viktor Glombik.


Edited on August 22, 2020.

3
List of Figures

List of Figures
1 An artificial neuron. . . . . . . . . . . . . . . . . . . . . . . . 2
2 Different activation functions. . . . . . . . . . . . . . . . . . . 2
3 An artificial neural network. . . . . . . . . . . . . . . . . . . . 2
4 Neural network architecture. . . . . . . . . . . . . . . . . . . 3
5 Under- and overfitting . . . . . . . . . . . . . . . . . . . . . . 6
6 Three types of errors in learning problems. . . . . . . . . . . . 7
7 Concatenation of neural networks. . . . . . . . . . . . . . . . 7
8 Parallelization of neural networks. . . . . . . . . . . . . . . . 8
9 A wavelet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
10 Tiling of the plane induced by wavelets. . . . . . . . . . . . . 11
11 The wavelet transform. . . . . . . . . . . . . . . . . . . . . . . 11
12 Anisotropic features govern natural images. . . . . . . . . . . 11
13 Intuitive explanation for why wavelets don’t preform as
well as shearlets. . . . . . . . . . . . . . . . . . . . . . . . . . 11
14 The tiling of the frequency domain induced by cone adapted
shearlets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
15 All eight possible ways to shatter three non-collinear points
in the plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
16 A visualization of the ϕi . . . . . . . . . . . . . . . . . . . . . 15
17 Visualization of the encoder and decoder. . . . . . . . . . . . 16
18 The red line represents ε´γ and the blue one Learnpε, Cq. . . 17
19 A neural network representing the hat function. . . . . . . . . 18
20 The functions gm for m P t0, 1, 2u. . . . . . . . . . . . . . . . 19
21 The functions fk for k P t1, 2u and x2 . . . . . . . . . . . . . . 19
22 Neural network realizing multiplication. . . . . . . . . . . . . 19
23 TODO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
24 The bump functions f and ϕ. . . . . . . . . . . . . . . . . . . 21
25 Numerical experiments with ReLU and backpropagation.
[Source??] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
26 Illustration of ϕ for d “ 2. . . . . . . . . . . . . . . . . . . . . 23
27 Visual intuition for the theorem. . . . . . . . . . . . . . . . . 24
28 Some numerics. . . . . . . . . . . . . . . . . . . . . . . . . . . 29
29 Illustration of the concept of covering number. . . . . . . . . 30
30 Sometimes a "double descent curve" is observed. . . . . . . . 32
31 The double descent curve. . . . . . . . . . . . . . . . . . . . . 32

i
List of Figures

32 The training and test accuracy of various model on the


CIFAR10 dataset. . . . . . . . . . . . . . . . . . . . . . . . . 33
33 Relevance map for handwritten digits. . . . . . . . . . . . . . 34
34 An image and its sensitivity map Mc . . . . . . . . . . . . . . 34
35 The partial derivative of Mc with respect to the RGB
values of a single pixel. . . . . . . . . . . . . . . . . . . . . . . 35
36 Effect of sample size on the estimated gradient for inception. 35
37 Visualization of backpropagation. . . . . . . . . . . . . . . . . 35
38 Numerical experiment for sensitivity vs. LRP. . . . . . . . . . 36
39 TODO. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
40 Recovering meat from minced meat is a difficult inverse
problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
41 Visualization of inpaiting, MRI and feature extraction. . . . . 39
42 Geometric setup of the Radon transform. . . . . . . . . . . . 39
43 The point on the green line closest to the origin differs
when considering the `1 -or the `2 -norm. . . . . . . . . . . . . 40
44 The shrinkage operator for λ “ 32 . . . . . . . . . . . . . . . . 42
45 TODO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

ii
1 Introduction to Neural Networks and
Deep Learning

1.1 Introduction to this lecture


After briefly touching on the basics of statistical learning theory we
will cover the four main aspects of the mathematical theory of deep
learning: expressivity, optimization, generalization and interpretability.
Furthermore we will see deep learning in mathematical approaches such as
inverse problems and numerical analysis of partial differential equations.
The goal of this course will be to learn the possibilities and limitations
of deep learning in order to lay the foundation for delving further into
this topic, both theoretically and practically.

Deep neural networks have recently shown impressive results in a variety


of real-world applications such as self-driving cars, surveillance, legal
issues, health care, speech recognition and many more.
They have already outperformed various mathematical based approaches
which were fundamental in inverse problems and imaging science (de-
noising, edge detection, inpainting, classification, superresolution and
limited-Angle Computed Tomography), numerical analysis of the Black-
Scholes, Allen-Cahn and parametric PDEs and modelling. Therefore,
neural networks and standard mathematical techniques have been com-
bined (cf. [3]) by using model-based methods as far as possible and
data-driven methods where necessary.
Unfortunately very few theoretical result explain the success of deep neural
networks (cf. [15]): "Ali Rahimi, a researcher in artificial intelligence
at Google [...], took a swipe at his field last December and received
a 40-second ovation for it. [...] at an AI conference, Rahimi charged
that machine learning algorithms [...] have become a form of «alchemy».
Researchers, he said, do not know why some algorithms work and others
don’t, nor do they have rigorous criteria for choosing one AI architecture
over another". [4]

For brevity’ sake we will write rns :“ t1, . . . , nu for n P N.

1
1.2 Deep Neural Networks

1.2 Deep Neural Networks


Artificial Neurons & Neural Networks
The idea behind artificial neural networks is to mimic the functionality
of the human brain and can be traced back to McCulloch and Pitts
in 1943, who were interested in developing an algorithmic approach to
learning.

definition 1.2.1 (Artificial neuron)


An artificial neuron with weights w P Rd , bias b P R and activation artificial neuron
function % : R Ñ R is a function activation function

Rn Ñ R, x ÞÑ %px x, w y ´bq.

Bias
Weights b
x1 w1
Activation
$
&1, Output
x x, w y ´b ą 0,
Inputs x2 w2
Σ %0, x x, w y ´b ď 0,
y

x3 w3

Fig. 1: An artificial neuron.

Example 1.2.2 (Activation functions)


• Heaviside function 1r0,8q ,
• Sigmoid function x ÞÑ p1 ` e´x q´1 ,
• Rectifiable Linear Unit (ReLU) x ÞÑ maxp0, xq,
• Softmax function x ÞÑ lnp1 ` ex q. ˛

definition 1.2.3 (Artificial Neural Network Types) Fig. 2: Activation functions. [1]
An artificial neural network (ANN) is a graph consisting of artificial artificial neural network
neurons called feed-forward ANN if it is direct and acyclic and
recurrent ANN else.

Input Hidden Output


Deep Neural Networks layer layer layer

The start of the Age of Data at the turn of the century saw a drastic im- Input #1

provement of computing power and vast amounts of data being available. Input #2
This gave rise to deep neural networks (cf. Y. LeCun et al., ca 2000). Output
Input #3

definition 1.2.4 ((Deep) Neural Network (DNN)) Input #4

The map
Fig. 3: An artificial neural network.
Rd Ñ RNL , x ÞÑ TL %pTL´1 %p. . . %pT1 pxqqqq

2
1.2 Deep Neural Networks

is called DNN, where d :“ N0 P N is the dimension of input layer,


L the number of layers, N` the number of neurons in layer ` P rLs,
% : R Ñ R the activation function and pT` : RN`´1 Ñ RN` qL`“1 are
affine linear maps.

Example 1.2.5 (DNN Architecture: Affine linear maps)


The affine linear maps T` are defined by a matrix A` P RN`´1 ˆN` and an
affine part b` P RN` via T` pxq :“ A` x ` b` . ˛

In the example on the right


¨ ˛
a11 a12 0 ˜ ¸
a21 a22 0
A1 :“ ˝ 0 0 a13 ‚ and A2 :“ .
˚ ‹
0 0 a23
0 0 a14

Example 1.2.6 (Training of neural networks)


Let M be a manifold and pxi , f pxi qqm
i“1 samples of a function f : M Ñ
rKs, aiming to classify an input x P M into one of K classes.
We now select d, L, pN` qL
`“1 and %, collectively forming the architecture
of a DNN. Now we learn the affine linear functions pT` qL`“1 by

m
ÿ
min LpΦpA` ,B` q pxi q, f pxi qq ` λRppA` , b` qL
`“1 q.
pA` ,b` qL
`“1 i“1

where L is a loss function, R a regularization term and λ ą 0 (cf. 1.2.9)


yielding the network
Fig. 4: Neural network architecture. [2]
ΦpA` ,b` q` : Rd Ñ RNL , x ÞÑ TL %pTL´1 %p. . . %pT1 pxqqqq

This is often solved by stochastic gradient descent (cf. subsection 1.3),


in pursuit of
ΦpA` ,b` q` « f. ˛

Four Fundamental Questions concerning DNNs


• Expressivity ( Applied Harmonic Analysis, Approximation Theory) Expressivity
– How powerful is the network architecture?
– Can it represent the correct functions?
• Learning ( Differential Geometry, Optimal Control, Optimization) Learning
– Why does the algorithm yield reasonable results?
– What are sensible starting values?
• Generalization ( Learning Theory, Optimization, Statistics) Generalization
– Why do DNNs preform well on unseen data?
– What impact does the depth of the network have?
• Interpretability ( Information Theory, Uncertainty Quantification) Interpretability

3
1.2 Deep Neural Networks

– Why did a trained DNN reach a certain decision?


– Which of the inputs components contributes most?

Mathematical Learning Theory

definition 1.2.7 (Learning (T. Mitchell, 1997))


A computer program learns from experience E with respect to some
class of tasks T and performance measure P , if its performance at
tasks in T , as measured by P , improves with experience E.

While this definition lacks mathematical precision, we can give some


examples for T, E and P :

Example 1.2.8 (Task, experience, performance measure)


A task could be
• classification: compute f : Rn Ñ rks, mapping a data point to its classification
class (i.e. hand-written digit recognition),
• regression: compute f : Rn Ñ R prediction numerical value (i.e. regression
future prices),
• density estimation: learn a probability density p : R Ñ R` on the
test data space to i.e. find corrupted data or detect anomalies.
The experience is typically given by the data set. We mainly distinguish
into two types of learning:
• supervised: labelled data ( classification). supervised
• unsupervised: Data is unlabelled and the algorithm is tasked with unsupervised
find a structure in the data ( clustering). Think of a classification
task, in which you do not know the classes the data points in the
(test) data set belong to.
The usual performance measure is accuracy: the proportion of data points
for which the model (function) outputs the correct value.
This can be evaluated by cross-validation: the data is split into subsets cross-validation
called cross-validation folds and then trained on one and tested on the
other. ˛

Example 1.2.9 (Linear Regression)


We want to predict the function f : Rn Ñ R (task) and split our data
n
set (experience) into a training set z :“ ppxtrain
i , yitrain qm
i“1 Ă R ˆ R and
n
the test set ppxtest
i , yitest qm
i“1 Ă R ˆ R.

For the learning algorithm we define the hypothesis space

H :“ spanppϕq`i“1 q Ă CpRn q

and the empirical error/risk by the mean squared error (MSE) mean squared error
m ¯2
1 ÿ ´ ˆ train
E z pf q :“ f pxi q ´ yitrain .
m i“1

4
1.2 Deep Neural Networks

and then find the empirical target function

fH,z :“ arg min E z pf q. (1)


f PH

ř`
Every f P H can be expressed as f “ i“1 wi ϕi . For sake of brevity we
set y :“ pyitrain qm ` train
i“1 , w :“ pwi qi“1 and A :“ pϕj pxi qqi,j P Rmˆ` so we
can write E z pf q “ }Aw ´ y}2 .
Letting ŵ :“ arg minwPR` E z pf q being the minimiser, the sought estimate
(“ solution of the regression task) is given by
`
ÿ
fˆ :“ pŵqi ϕi . ˛
i“1

The general statistical learning problem


We can now state a mathematically precise definition of learning.

definition 1.2.10 (Statistical Learning Problem)


Let pΩ, Σ, σq be a probability space, X : Ω Ñ Rn and Y : Ω Ñ Rk
be random vectors and X and Y be the images of X and Y ,
where we assume that X Ă Rn is compact. For a loss function
` : Rk ˆ Rk Ñ r0, 8s and a measurable f : X Ñ Y define the
error
ż
Epf q :“ E r`pf pXq, yqs “ `pf pXpωq, Y pωqq dσpωq.

The learning problem is to find fH :“ arg minf :X ÑY Epf q. learning problem

Remark In regression we have k “ 1. A typical loss function is the


squared error `px, yq :“ px ´ yq2 .

definition 1.2.12 (Notation: Probability measures)


For x P X let σpy | xq be the conditional probability measure on
Y (with respect to x) and σX the marginal probability measure
on X .

´1
We have σX pSq “ σpπX pSqq, where πX : X ˆ Y Ñ X , px, yq ÞÑ x. Then
for every integrable function ϕ : Z Ñ R
ż ż ˆż ˙
ϕpx, yq dσpx, yq “ ϕpx, yq dσpy | xq dσX pxq
X ˆY X Y

holds. WHAT IS Z???

5
1.2 Deep Neural Networks

definition 1.2.13 (Regression function)


We define the regression function by
ż
fσ : X Ñ Y, x ÞÑ y dσpy | xq.
Y

Theorem 1.2.1: Regression

The regression function fσ is a minimiser of E over L2 pX , σX q:

Epf q “ }f ´ fσ }2L2 pX ,σX q ` Epfσ q @f P L2 pX , σX q

As the distribution σ is unknown, the target function can’t be computed.

definition 1.2.14 (Hypothesis space, target function)


A subspace H Ă CpX , Yq is called hypothesis space. hypothesis space

Example 1.2.15 Homogeneous polynomials spaces or reproducing ker-


nel Hilbert spaces are commonly used hypothesis spaces. ˛

In reality we only have access to the evidence data. We thus assume a


given sample S :“ ppxk , yk qqN
k“1 drawn i.i.d. according to σ and try to
estimate fH from S.

definition 1.2.16 (Empirical error / target function) Fig. 5: Noisy (roughly linear) data is fitted to
a linear function and a polynomial function.
The empirical error of f with respect to loss ` and sample S is Although the polynomial function is a perfect
fit, the linear function can be expected to gen-
N eralize better: if the two functions were used
1 ÿ
E S pf q :“ `pf pxk q, yk q. to extrapolate beyond the fitted data, the lin-
N k“1 ear function should make better predictions.
[10]
An empirical target function fH,S P H is a minimiser of E S over H.

The error of the empirical target function can be decomposed as

EpfH,S q “ Epf H,S q ´ EpfH q ` lo


looooooooomooooooooon Epf
omo q
Hon (Bias-Variance decomposition)
generalization/ approx.
error
sample error

Remark 1.2.17 The approximation error EpfH q is only dependent on


H while the generalization error also depends on size of the sample S.
Typically we observe the following behavior for a fixed sample size

H / error generalization approximation


larger increase decrease
smaller decrease increase

6
1.2 Deep Neural Networks

Fig. 6: Three types of errors in learning problems. We aim for a universally


best method. [2]

Deep Neural Networks Enter the Stage

definition 1.2.18 (Deep neural network II)


In the setting of definition 1.2.4 the map

R% : Rd Ñ RNL , Φpxq ÞÑ ΦpA` ,b` qL`“1 pxq

is the realization of Φ with activation function % and Φ :“ realization


ppAk , bk qqL
k“1 the neural network architecture architecture
řL
We call N pΦq :“ d ` `“1 N` the number of neurons, LpΦq :“ L the
řL
number of layers, M pΦq :“ `“1 }A` }0 ` }b` }0 , where } ¨ }0 denotes
the number of non-zero entries.
For d P N, M, L, N P N Yt8u and R P R we denote by N N Rd,M,N,L
the set of neural networks Φ with input dimension d, NL “ 1 and
M pΦq ď M and N pΦq ď N , where all weights are bound by R.

We call Φ sparsely connected if M pΦq is small, shallow if L is small and


deep if L is large.

Example 1.2.19 (Notation N N R


d,M,N,L ) Let Φ be given by
¨ ˛ ¨ ˛
1 2 0 2 ˜ ¸ ˜ ¸
2 6 0 8
A1 :“ ˝0 0 5‚, b1 :“ ˝3‚, A2 :“ , and b2 :“ .
˚ ‹ ˚ ‹
0 0 7 8
0 0 4 4

Then Φ P N N 83,7,8,2 .
An artificial neuron is a realization of a NN Φ P N N d,d,d`1,1 . ˛

We will now cover the two basic operations on neural networks.

Lemma 1.2.20 (Concatenation) ´ ¯Li


piq piq
Let L1 , L2 P N and Φpiq :“ pAk , bk q for i P t1, 2u be neural
k“1
network such that the input layer of Φp1q and the output layer of Φp2q
have the same dimensions. For any activation function %
` ˘
• R% Φp1q ˝ Φp2q “ R% pΦp1q q ˝ R% pΦp2q q

7
Fig. 7: Concatenation of neural networks.
[2]
1.3 Training of deep neural networks

` ˘
• L Φp1q ˝ Φp2q “ LpΦp1q q ` LpΦp2q q ´ 1
hold, where Φp1q ˝ Φp2q is defined as
ˆ´ ¯L2 ´1 ´ ¯L1 ˙
p2q p2q p1q p2q p1q p2q p1q p1q p1q
pAk , b1 q , pA1 AL2 , A1 bL2 ` b1 q, pAk , bk q .
k“1 k“2

Lemma 1.2.21
Let L P N and Φpiq both have L layers. For all activation functions %
and x P Rd
` ˘
• R% pP pΦp1q , Φp2q qpxq “ R% pΦp1q qpxq, R% pΦp2q qpxq ,
` ˘
• L P pΦp1q , Φp2q q “ LpΦp1q q,
` ˘
• M P pΦp1q , Φp2q “ M pΦp1q q ` M pΦp2q q
`` ˘L ˘
holds, where P pΦp1q , Φp2q q :“ Âk , b̂k k“1 q and
Fig. 8: Parallelization of neural networks.
˜ ¸ ˜ ¸ ˜ ¸ [2]
p1q p1q p1q
A1 A` 0 b`
Â1 :“ p2q , and Â` :“ p2q , b̂` :“ p2q @` P rLs.
A1 0 A` b`

Lemma 1.2.22 (The identity with neural networks)


Define Φpidq :“ ppA1 , 0q, pA2 , 0qq with A1 :“ pidRd , ´ idRd qT and A2 :“
AT1 . Then R% pΦpidq q ” idRd for % :“ ReLU.

Remark 1.2.23 Therefore different architectures can lead to the same


realization; R% pΦq “ R% pΦ ˝ Φpidq q.

1.3 Training of deep neural networks


Let d :“ N0 , pNk qL
k“1 Ă N and % an activation function. Define the
hypothesis space
! )
L
H :“ R% pΦq : Φ “ ppAk , bk qqk“1 , A` P RN`´1 ,N` , b` P RN` .

m
Given samples z :“ ppxi , yi qqi“1 Ă Rd ˆ RNL , we aim to find the empirical
target function
M
ÿ
fz :“ arg min Lpf, xi , yi q,
f PH i“1
d NL d NL
where f : CpR , R q ˆ R ˆR Ñ R is a loss function; e.g. the MSE. loss function
As optimization approach we chose gradient descent: For F : RN Ñ R gradient descent
one uses the iteration

un`1 Ð un ´ η∇F pun q, n P N,

where η P R is the step size, a hyperparameter and the gradient ∇ is


taken componentwise.

8
1.3 Training of deep neural networks

řm
For our problem we set F :“ i“1 Lpf, xi , yi q and u :“ ppA` , b` qqL
`“1 . In
order to find ∇ppA` ,b` qqL`“1 F we need to compute

BLpf, x, yq BLpf, x, yq
and
BpA` qi,j Bpb` qi

for all i, j, `.
This can be achieved by backpropagation: Compute a` , z` for ` P backpropagation
t0, . . . , Lu WHAT ARE THEY?? Now set δL :“ 2pf pxq ´ yq. Then

BLpf, x, yq BLpf, x, yq
“ δL aTL´1 and “ δL .
BAL BbL
For ` “ L ´ 1 to 1 now set

δδ :“ diagp%1 pz` qqAT``1 ¨ δ``1 .

Then we have
BLpf, x, yq BLpf, x, yq
“ δL aT`´1 and “ δ` .
BA` Bb`
and can return ∇pA` ,b` q``“1 Lpf, x, yq.
řm
Our goal therefore is to find a stationary point of F :“ i“1 Fi : RN Ñ R,
where Fi :“ Lpf, xi , yi q.
An algorithm realizing this takes as input data a neural network f and
a loss function L. After initializing an starting value u0 and n “ 0,
while the error is large, choose i˚ P rms uniformly at random and update
un`1 Ð un ´ η∇Fi˚ and n ` 1 Ð n. Finally, return un .

9
2 Expressivity
Expressivity is concerned with the following questions.
• Which architecture to choose for a particular application?
• What is the expressive power of a given architecture?
• What effect has the depth of a neural network in this respect?
• What is the complexity of the approximating neural network?
• What are suitable function spaces to consider?
• Can deep neural networks beat the curse of dimensionality?
Mathematically, we ask under which conditions on a neural network Φ
and an activation function % every function from a prescribed function
class C can be arbitrarily well approximated, i.e. }R% pΦq ´ f } ď ε for all
f P C.

2.1 Viewpoint of approximation theory


Let a function class C Ă L2 pRd q and a function system pϕqiPI Ă L2 pRd q
be given. In order to measure the suitability of the function system for
uniform approximation of C we introduce the

definition 2.1.1 (M -term best approximation)


The error of best M -term approximation for f P C, }f ´ fM }2 , is
"› ÿ › *
› ›
inf ›f ´
› ci ϕi › : IM Ă I, |IM | “ M, pci qiPIM Ă R .

iPIM 2

definition 2.1.2 (Optimal approximation rate)


The largest γ ą 0 such that

sup }f ´ fM }2 “ OpM ´γ q as M Ñ 8
f PC

is the optimal (sparse) approximation rate of C by pϕi qiPI .

We now introduce wavelets and their associated transform, which is


applied in the compression standard JPEG200.

definition 2.1.3 (R-Wavelet)


Let ϕ P L2 pRq a scaling function and ψ P L2 pRq a wavelet. Then
the associated wavelet system is
(
tϕpx ´ mq : m P Zu Y 2j{2 ψp2j x ´ mq : j ě 0, m P Z .

Fig. 9: A wavelet. [2]

10
2.1 Viewpoint of approximation theory

definition 2.1.4 (R2 -Wavelet)


The the associated wavelet system is
(
tϕp1q px´mqumPZ Y 2j{2 ψ piq p2j x´mq : j ě 0, m P Z, i P t1, 2, 3u ,

where

ϕp1q pxq :“ ϕpx1 qϕpx2 q, ψ p1q pxq :“ ϕpx1 qψpx2 q,


ψ p2q pxq :“ ψpx1 qϕpx2 q, ψ p3q pxq :“ ψpx1 qψpx2 q

Fig. 10: Tiling of the plane induced by


The wavelet transform is given by wavelets. [2]

` ˘
f ÞÑ px f, ϕm yqm , px f, ψj,m,i yqj,m,i . (2)

In order to find mathematical models for images we see that their key
components are usually shapes separated by an edge, i.e. an singularity.
Fig. 11: The wavelet transform. [2]
definition 2.1.5 (Cartoon-like functions (Donoho))
The set of cartoon-like functions is cartoon-like functions

E 2 pR2 q :“ tf P L2 pR2 q : f “ f0 ` f1 ¨ 1B u,

where H ‰ B Ă r0, 1s2 is simply connected with C 2 -boundary and


bounded curvature and the fi P C 2 pR2 q are supported in r0, 1s2 with
}fi }C 2 ď 1, i P t0, 1u.

Theorem 2.1.1: Donoho (2001)

Let pψλ qλ Ă L2 pR2 q. Allowing only polynomial depth search, we


have the following optimal behavior for f P E 2 pR2 q:
Fig. 12: Anisotropic (dominated by an
}f ´ fN }2 — N ´1 as N Ñ 8
edge) features govern natural images.

Unfortunately, Donoho (TODO: Citation) also proved that the op-


timal rate for any directional representation system, of which wavelets
are just one example, the optimal approximation rate is 2, so wavelets
cannot represent cartoon-like functions well. This is due to the isotropic
structure of wavelets: they are scaled by 2j{2 in both directions equally.
Fig. 13: Intuitive explanation for why
definition 2.1.6 (Shearlets (Kutyniok, Labate, 2006)) wavelets don’t preform as well as shearlets.
` ˘ ` ˘
For j P Z let Aj :“ diag 2j , 2j{2 and Sj :“ 10 1j and ψ be a
wavelet. A shearlet is the map ψj,k,m pxq :“ 23j{4 ψpSk Aj x ´ mq.
looooomooooon shearlet
affine linear

11
2.2 Universality results

definition 2.1.7 (Cone-adapted shearlet system)


The cone-adapted shearlet system SHpϕ, ψ, ψ̃q generated by
ϕ, ψ, ψ̃ P L2 pR2 q is the union of

tψp¨ ´ mq : m P Zu,
t23j{4 ψpSk Aj x ´ mq : j ě 0, |k| ď r2j{2 s, m P Z2 u
t23j{4 ψ̃pS̃k Ãj x ´ mq : j ě 0, |k| ď r2j{2 s, m P Z2 u,

where Ãj :“ diagp2j{2 , 2j q and S̃k :“ SkT .

Fig. 14: The tiling of the frequency domain


Theorem 2.1.2: Shearlet frame is parseval frame induced by cone adapted shearlets.

For classical shearlets ψ, ψ̃ and all f P L2 pR2 q we have ([28])


ÿ
}f }22 “ | x f, σ y |2 .
σPSHpϕ,ψ,ψ̃

Theorem 2.1.3: Compactly supported shearlets


are optimally sparse
ˆ
Let ϕ, ψ, ψ̃ P L2 pR2 q be compactly supported and ψ̂, ψ̃ satisfy
certain decay conditions. Then SHpϕ, ψ, ψ̃q provides an optimally
sparse approximation of f P E 2 pR2 q, i.e.
3
}f ´ fN }2 . N ´1 plogpN qq 2 as N Ñ 8

The associated algorithms (2 & 3D (parallelized) fast shearlet transform)


have been implemented mostly at TU Berlin, i.e. in Matlab (Kutynio,
Lim, Reisenhofer; 2013), Julia (Loarca; 2017), Python (Look; 2018) and
Tensorflow (Kutyniok, Loarca; 2019), see also www.shearlab.org.

Lets get back to expressivity of DNNs. In general, we ask which complex-


ity M pΦq (cf. definition 1.2.18) a neural network Φ, which approximates
a function f up to ε ą 0, needs to have.

2.2 Universality results


We will now see the universality of shallow neural networks; that every 5.11.19 (?)
continuous function on a compact set can be arbitrarily well approximated
with a neural network with one single hidden layer in a theorem by
Cybenko (1989) and Hornik (1991).

Remark 2.2.1 Assume % is a polynomial of degree q P N. Then %pAx`bq


is also a polynomial of degree q, hence R% pΦq is also a polynomial
with degree less than or equal to L ¨ q. In this case, CpRd q cannot be
approximated well.

12
2.2 Universality results

Theorem 2.2.1: Universal Approximation Thm

Let % : R Ñ R be continuous but not a polynomial and K Ă Rd


a compact set. Then for any continuous function f : Rd Ñ RNL
and every ε ą 0 there exists M, N P N and Φ P N N d,M,N,2 with

sup |R% pΦqpxq ´ f pxq| ď ε. (3)


xPK

Proof. We prove that for d ě 1 and % P CpR, Rq the following are


equivalent:
1 % is not a polynomial.
2 S “ CpKq, where S :“ spanp%px w, x y ´bq : w P Rd , b P Rq.
DONT WE NEED SOME DOMAIN-RESTRICTIONS?
Otherwise the RHS will contain more functions surely.
" 1 ùñ 2 ":
a First consider d “ 1 and smooth %. By 1 there exists a x0 P R
such that %pkq p´x0 q ‰ 0 for all k P N. By a theorem of Stone-
Weierstrass it suffices to show that S is dense in the polynomials
on R, as they are dense in CpKq.
hÑ0
We have %phx ´ x0 q ÝÝÝÑ %p´x0 q, thus the constant functions are
well approximated. Furthermore,
1` ˘ h,λÑ0
%ppλ ` hqx ´ x0 q ´ %px ´ x0 q ÝÝÝÝÑ xp1 p´x0 q
h
looooooooooooooooooooomooooooooooooooooooooon
hÑ0
ÝÝÝÑx%1 pλx´x0 q

holds, thus linear functions and therefore, by linearity (WHAT


ABOUT MULTIPLICATIVITY??) polynomials are well ap-
proximated.
b Now consider arbitrary d ě 1, smooth % and

spanpgpx w, x y ´bq : w P Rd , b P R, g P CpKqq Ă CpKq,

HIER IST ES NÄMLICH g P CpKq nicht CpR, Rq!!which is


dense. Thus, every f P CpKq can we approximated by
N
ÿ
ai gi px wi , x y ´bi q, ai , bi P R, wi P Rd , gi P CpKq.
k“1

By a t ÞÑ gi pt ´ bi q can be well approximated for all t P R.


εŒ0
c For non-smooth % use a mollification pgε qεą0 such that g ˚gε ÝÝÝÑ %.
Another version of this proof can be found in the original paper, [7]. l

Generally, neural networks have huge approximation power: The following


theorem by Maiorov and Pinkus (1999) doesn’t restrict the size of the
weights!

13
2.3 Lower bounds for approximation

Theorem 2.2.2: Universal network theorem


There exists an activation function % : R Ñ R such that for any
d P N, compact K Ă Rd , continuous f : K Ñ R and ε ą 0 there
exist M, N P N (only dependent on d) and Φ P N N d,M,N,3 such
that (3) holds.

2.3 Lower bounds for approximation


The question we will try to answer in this chapter are 26.11.19

• How well functions can be approximated by neural networks with


few non-zero weights ( sparse connectivity)?
– Can we derive a lower bound on the necessary number of
weights
– Can we construct neural networks which attain this bound?
• Do neural networks approximate as good as wavelets and shearlets?

definition 2.3.1 (Vapnik-Chervoenkis dimension)


Let X be a set, S Ă X and H Ă th : X Ñ t0, 1uu. Define the
restriction of the function class H to S H|S :“ th|S : h P Hu. The
VC dimension of H is VC dimension
" *
m
VCdimpHq :“ sup m P N : sup |H|S | “ 2 .
|S|ďm

The VC dimension is a tool for understanding the classification capa-


bilities of a function class as 2m “ |t0, 1um | is the number of possible
combinations of binary labels on a set of m points. The VC dimension of
H is the largest integer m such that there exists a set S Ă X containing
only m points such that H|S has the maximum possible cardinality, given
by 2m .

Example 2.3.2 (VC-Dimension) Let X :“ R2 , h :“ 1R` and


" ˆB F˙ *
cospθq
H :“ h ,¨ ´ t : θ P r0, πs, t P R2 .
sinpθq
Fig. 15: All eight possible ways to shatter
Then H has VC-dimension 3 as at most 3 (non-collinear) points can be
three non-collinear points in the plane. [8]
shattered by a linear classifier from H.
A hypothesis class with infinite VC-dimension is pt ÞÑ α sinpωtqqα,ωPR . ˛

Theorem 2.3.1: VC dimension of DNNs [29])

Let % be a piecewise polynomial with p pieces of degree at most `,


h :“ χR` and

HN,M,d,L :“ th ˝ R% pΦq : Φ : N N d,M,N,L u

14
2.3 Lower bounds for approximation

for N, M, d P N. Then
ˆ ˙
4M LpN
VCdimpHN,M,d,L q ď 2M L log2 ` 2M L2 log2 p` ` 1q ` 2L
lnp2q

holds.

Corollary 2.3.3
For fixed p and ` and N ď M this implies

VCdimpHN,M,d,L q P OpM L log2 pM q ` M L2 q.

Theorem 2.3.2: Smooth function approximation


[5]

Let L P N and H :“ th P C n pr0, 1sd q : }h}C n ď 1u and % the ReLU.


If

@ε ą 0 @h P H DΦ P N N d,M pεq,M pεq,L : }h ´ R% pΦq}8 ă ε,

holds,
` d ˘
M pεq P Ω ε´ np1`δq

follows for all δ ą 0.

Proof. For convinience let η :“ 1 ` δ. Towards contradiction assume


that for all h P H there exists a Φh P N N d,M,M,` with ` ď L such that
1 ´ ηn
}h ´ R% pΦh q}8 ď M d (4)
8
for some δ ą 0.
Let m :“ M η and pxk qm d
k“1 Ă r0, 1s such that |xi ´ xj | ě M for all
i, j P rms. Around the xi construct ϕi P H with disjoint support and
η
diameter M ´ d . Additionally require
1 η
ϕi pxi q “ ¨ M ´ d @i P rms (5)
2
For every g : tx1 , . . . , xm u Ñ R there exist pdk qm
k“1 Ă t0, 1u
m
such that Fig. 16: A visualization of the ϕi . [5]

m
ηn ÿ
gpxq “ 2M d di ϕi pxq.
i“1

By (4) there exists a neural network Φ P N N d,M,M,` such that


› ›
m
›ÿ › 1 ηn
› di ϕi pxq ´ R% pΦq› ď M ´ d (6)
› ›
›i“1 › 8
8

WOHER BLEIBT DER VORFAKTOR VON g??? By (5) and


(6) we have
$
ηn 1 &ą 0, if gpxi q “ 1,
2M d R% pΦqpxi q ´
2 %ď 0, if gpxi q “ 0

for all i P rms, hence


ˆ ˙
η 1
χR` 2M d R% pΦq ´ ”g on tx1 , . . . , xm u
2

15
2.3 Lower bounds for approximation

holds. By corollary 2.3.3 we have

m “ M η P OpM logpM qq @δ ą 0,

which is a contradiction. l

definition 2.3.4 (Rate distortion theory)


Let d P N, Ω Ă Rd and C Ă L2 pΩq.
1 For ` P N the sets

E ` :“ E : C Ñ t0, 1u` , D` :“ D : t0, 1u` Ñ L2 pΩq


( (

are the sets of binary en(de)coders of length `. binary en(de)coders


2 An encoder-decoder pair pE, Dq P E ` ˆ D` achieves distortion encoder-decoder pair
ε ą 0 over C if DpE, Dq :“ supf PC }DpEpf qq ´ f }L2 ď ε holds. distortion
3 For ε ą 0 we define the minimal code length minimal code length
! )
Lpε, Cq :“ min ` P N : DpE, Dq P E ` ˆ D` : DpE, Dq ď ε

and the optimal exponent optimal exponent


` ˘(
γ ˚ pCq :“ inf γ P R : Lpε, Cq P O ε´γ .

Example 2.3.5 (Optimal exponent)


Theorem 2.1.1 states that γ ˚ pE 2 pR2 qq ě 1. If C Ă Bp,q
s
pRd q is bounded,
we have γ ˚ pCq “ ds . ˛

Example 2.3.6 (Mapping functions to bit sequences)


Let
C :“ tf P Cpr0, 1s, r0, 1sq : |f pxq ´ f pyq| ď |x ´ y|u.
For pxk qnk“1 Ă r0, 1sn with x1 ă . . . ă xn define
! )
E ` : C Ñ t0, 1u` , f ÞÑ bin ` pf px1 qq, . . . , bin ` pf pxn qq ,
n n

where binn pcq is the binary approximation / representation of c of length


n. Define D` : t0, 1u` Ñ L2 pr0, 1sq by
N ˆ ˆ ˙ ˆ ˙˙
ÿ `pi ´ 1q `piq
D` pcq :“ Num c ,...,c 1rxi´1 ,xi s ,
i“1
n n

where Numpcq is the real value associated with the bit sequence c.
The distortion of pE ` , D` q is bounded by
` n Fig. 17: Visualization of the encoder and
2 n maxp|xk ´ xk`1 |, |xk ´ 1|q.
k“1 decoder.

Now choosing
` 1
n„ and |xi ´ xi`1 | „
logp`q n
logp`q
yields a distortion of ` . ˛

16
2.3 Lower bounds for approximation

Theorem 2.3.3: γ ˚ and N -term approximation

For C Ă L2 pRd q the optimal N -term approximation rate is 1


γ ˚ pCq .

Theorem 2.3.4: A fundamental lower bound [30]

Let d P N, % : R Ñ R, c ą 0, C Ă L2 pRd q and


ˆ ˙
1
Learn : 0, ˆ C Ñ N N d,8,8,8
2

be an abstract learning procedure such that all weights


of Learnpε, f q can be encoded with ´c log2 pεq bits. If
DpLearnp¨, εq, R% q ď ε holds,

sup εγ sup M pLearnpε, f qq “ 8


εPp0, 12 q f PC

holds for all γ ă γ ˚ pCq.

Remark 2.3.7 The statement is equivalent to

@C ą 0 : sup M pLearnpε, f qq ď Cε´γ @ε ą 0 @γ ă γ ˚ pCq.


f PC

Therefore, the number of edges is bounded below by ε´γ .

Remark 2.3.8 This bound is in terms of the edges, hence the sparsity Fig. 18: The red line represents ε´γ and
the blue one Learnpε, Cq.
of the connectivity, not the neurons. However, the number of neurons is
always bounded up to a uniform constant by the number of edges.
28.11.2019

Proof. Let f P C and ε ą 0. Towards contradiction assume there exists


a C ą 0 with

M :“ M pLearnpε, f qq ď Cε´γ , γ ă γ ˚ pCq.

Each weight is encoded by ´c log2 pεq bits and N pLearnpε, f qq ď 2M .


Then Learnpε, f q can be encoded by
` ˘
2M ´c log2 pεqM ď C̃ε´γ log ε´1 “: Cpε, γq.
log2 p2M q ` loooooomoooooon
looooooomooooooon
position encoding
of weights weights

bits without loosing information. Thus there exists an invertible encoder

Tε : N N d,L,tĉ¨ε´γ u,tc̄¨ε´γ u Ñ t0, 1uCpε,γq .

Now set

E ε : C Ñ t0, 1uCpε,γq , f ÞÑ Tε pLearnpε, f qq,


Dε : t0, 1uCpε,γq Ñ L2 pRq, c ÞÑ R% pT´1
ε pcqq.

17
2.4 Upper bounds and optimality

By construction, for all f P C

T´1
} Dε pE ε pf qq ´ f }L2 “ }R% pε pTε pLearnpε, f qqqq ´ f }L2


“ }R% pLearnpε, f qq ´ f }L2 ă ε

holds. We therefore have constructed an encoder-decoder pair of dis-


tortion ε and bit length ´εγ logpε´1 q. But γ ˚ pCq is the smallest γ such
that there exists an encoder-decoder pair of length cε´γ . This is a
contradiction. l

But what happens in the border case γ “ γ ˚ pCq?

2.4 Upper bounds and optimality


We investigate whether we can construct neural networks with attain the
bounds proven above and how wavelets and shearlets compare to neural
networks in this regard.

definition 2.4.1 (Approximation method)


Let K Ă Rd be a compact set. A family of continuous coeffi-
cient mappings AN : L2 pKq Ñ RN , N P N and reconstruction
mappings RN : RN Ñ L2 pKq is called approximation method. approximation method

Example 2.4.2 An pgq could be the first n shearlet coefficients of g. ˛

Our larger goal is to prove the following bold statement:

Theorem 2.4.1

Let pAN , RN qN PN be an approximation method and % P CpR, Rq


not a polynomial. Then for every N P N and every g P L2 pKq
there exists a neural network Φ with M pΦq ď N polylogpN q and

}g ´ R% pΦq}L2 pKq ď }g ´ pRN ˝ AN qpgq}L2 pKq .

Therefore, even a ReLU neural network will outperform any approxima-


tion method.

ReLU Calculus
Notice that the ReLU activation function is not much different from Fig. 19: A neural network representing the
using any other piece-wise linear activation function with finitely many hat function. [2]
breakpoints: one can replace one network by an equivalent one but having
another activation function while only increasing the number of units
and weights by constant factors ([5], proposition 1).

18
2.4 Upper bounds and optimality

Define the hat function


$ ` ˘
’2x,

& if x P 0, 21 ,
“ ‰
g : R Ñ r0, 1s, x ÞÑp1 ´ |2x ´ 1|q 1r0,1s pxq “ 2p1 ´ xq, if x P 21 , 1 ,


else. 0,
%
ˆ ˙˙ ˆ
1
“ ReLU 2 ReLUpxq ´ 4 ReLU x ´
2
“ 2 minpx, 1 ´ xq 1r0,1s pxq.

and the sawtooth function gm :“ gloooomoooon


˝ . . . ˝ g [6].
m times

Lemma 2.4.3 (Square function [5]) Fig. 20: The function gm for m P t0, 1, 2u.
For all ε ą 0 there exists a neural network Φ with Oplogpε´1 qq layers [6], Subsection 2.2.

and nodes such that supxPr0,1s |x2 ´ RReLU pΦpxqq| ď ε.

ř8
Proof. We have x2 “ x ´ m“1 2´2m gm pxq. Defining fk pxq :“ x ´
řk
m“1 2
´2k
gk pxq yields }x2 ´ fk pxq} “ 2´2pk`1q 1 and fk can be realized
by a neural network with Opkq layers and nodes. Setting ε “ 2´2pk`1q
` ` ˘˘
yields k P O log2 1ε . l

Lemma 2.4.4 (Multiplication [5])


For ε ą 0 there exists a neural network Φ with Oplogpε´1 qq layers
and nodes such that sup |x1 x2 ´ RReLU pΦqpx1 , x2 q| ď ε.
px1 ,x2 qPr0,1s2

Fig. 21: The functions fk for k P t1, 2u and


“ ‰ x2 . [1]
1
Proof. This follows directly from x1 x2 “ 2 px1 ` x2 q2 ´ px21 ` x22 q and
the previous lemma. l

Lemma 2.4.5 (Polynomials [5])


For ε ą 0, there exists a neural network Φ with Oppolylogpε´1 qq layers
ˇř ˇ
ˇ `
and nodes such that sup ˇ k“0 ai xi ´ RReLU pΦqpxqˇ ď ε.
ˇ
xPr0,1s

Proof. Monomials can be approximated by iteratively approximating the Fig. 22: Neural network realizing multipli-
square and multiplication with the identity (cf. lemma 1.2.22). Therefore cation. (really?) [5], subsection 3.1.
the result follows as neural networks are closed with respect to linear
combinations. l

Lemma 2.4.6 (Smooth functions I [5])


Suppose that f is Gevrey, i.e. there exist C, R, σ ą 0 such that
ˇ ˇ
supxPr0,1s ˇf pkq pxqˇ ď CRk pk!qσ for all k P N. Then for all ε ą 0 there
` ˘
exists a neural network Φ with O polylogpε´1 q layers and nodes such
that supxPr0,1s |f pxq ´ RReLU pΦqpxq| ď ε.
d2
1 It
“ ‰ 2
suffices to consider the last interval 1 ´ 2´k , 1 as dx 2 px q ą 0. The linear
2 ¯ ´k ´k
interpolation of x in this interval is fk pxq :“ p2 ´ 2 qx ` 2 ´ 1. By differentiating
(maximum at x “ 1 ´ 2´pk`1q ) one obtains

}x2 ´ f¯k pxq}8 “ sup p2 ´ 2´k qx ` 2´k ´ 1 ´ x2 “ 2´2k´2 .


xPr1´2´k ,1s

19
2.5 Approximation by Wavelets, Shearlets and more

Proof. As f can be well approximated by local polynomials, this follows


from lemma 2.4.5. l

Theorem 2.4.2: Smooth functions II [5]

Let f satisfy }f }C n pr0,1sq ď 1. Then for all ε ą 0 there exists a


1
neural network Φ with Oplogpε´1 qq layers and Opε´ n q nodes such
that supxPr0,1s |f pxq ´ RReLU pΦqpxq| ď ε.

Proof. (Sketch) As f can be approximated by local polynomials, we


can use lemma 2.4.5 and argue the optimality by using VC-dimension.l

Remark 2.4.7 This bound is optimal in view of theorem 2.3.2.

2.5 Approximation by Wavelets, Shearlets


and more
definition 2.5.1 (Affine System)
Let d P N, pAj qjPN Ă GLpRd q, pψk qSk“1 Ă L2 pRd q be compactly
supported. Then
! 1
)
|Aj | 2 ψs pAj x ´ bq : s P rSs, b P Zd , j P N

is an affine system.

a
Remark 2.5.2 The factor |Aj | ensures all ϕi , where i is an enumera-
tion of the indices in the above set, have equal norm.

Example 2.5.3 Wavelets and shearlet systems are affine. ˛

Theorem 2.5.1: Memory-optimal NNs [30]

Let Ω Ă Rd be bounded and pϕi qi Ă L2 pΩq an affine system with


pψs qSs“1 defined as before and % : R Ñ R an activation function.
Assume there exists ε ą 0 such that for all D ą 0 and s

DΦD,ε P N N s,C,2C,L : }ψs ´ R% pΦD,ε q}L2 pr´D,Dsd q ď ε.

Let C Ă L2 pRd q. If there exist ε ą 0, M P N and f P L2 pΩq X C


such that there exist coefficients pdi qM i“1 with
› ›
› ÿM ›
›f ´ di ϕi › ď ε,
› ›
› i“1

then there exists a deep neural network Φ with OpM q edges such
that }f ´ R% pΦq} ď 2ε.

This produces memory-optimal DNNs.

20
2.5 Approximation by Wavelets, Shearlets and more

Proof. We have }ψs ´ R% pΦs q} ď ε›with M pΦs q›ď C. Since the map
x ÞÑ Aj x ´ b is affine linear we have ›ψi ´ R% pΦ̃i q› ď ε and M pΦ̃i q ď C.
› ›
Look at picture to the right, M pΦq P OpM q.
› › › ›
4‰ ›
M
ÿ › ›ÿ M › Fig. 23: TODO
}f ´ R% pΦq} ď ›f ´ di Φi › ` › di Φi ´ R% pΦq› ď 2ε.
› › › ›
l
› i“1
› ›i“1

Corollary 2.5.4 (Memory optimal NNs and affine systems)


If an affine system pϕi qi Ă L2 pRd q satisfies that
• for each i there exists a neural network Φi with at most C ą 0
edges such that ϕi “ R% pΦi q
• there
› exists a C̃› ą 0 such that for all f P C Ă L2 pRd q we have
řM ´ 1
›f ´ i“1 fi ϕi › ď C̃M γ ˚ pCq ,
› ›

then every f P C can be approximated up to an error of ε by a neural


˚
network with only Opε´γ pCq q edges.

Recall that if a neural network stems from a fixed learning procedure


Learn, for all γ ă γ ˚ pCq, there does not exist a C ą 0 such that

sup M pLearnpε, f qq ď Cε´γ @ε ą 0.


f PC

Proof. By the previous theorem there exists a neural network Φ with


řM
R% pΦq “ i“1 fi ϕi and M pϕq P OpM q. We have
› M ›
1
› ď C̃M ´ γ ˚ pCq “: ε ðñ M P Opε´γ ˚ pCq q
› ÿ ›
›f ´ fi ϕ i l
› ›
i“1
looomooon
“R% pΦq with
M pΦqPOpM q

In summary we showed the memory optimal neural networks can be


constructed by mimicking N -term approximation.
We now have completed nearly every step of our road map for the section:
1 Find model function class C Ă L2 pR2 q. cartoon-like functions.
2 Find an associate representation system ( shearlets) with the
following properties:
• provides optimally sparse approximations for C ( has been tpxq
proven)
• its elements can be realized by a neural network with controlled
number of edges. This is still missing!
1 3
2
In order to approximate the ψs we construct wavelet generators (LeCun;
1987), (Shaham, Cloninger, Coifman; 2017): Let % be the ReLU and
define tpxq :“ %pxq´%px´1q´%px´2q`%px´3q, which can be constructed
with a two-layer network. Now ϕpx, yq :“ %ptpxq ` tpyq ´ 1q yields a 2D
bump function. Summing up shifted versions of ϕ, which can be realized
by just one additional layer yields a function ψs with vanishing moments.

21

Fig. 24: The functions ϕpx, yq and


2.6 Impact of Depth

Those ψ will not be differentiable. A smoothed version of a ReLU leads


to appropriate shearlet generators.
We will now see that function classes which are optimal representable
by shearlets, etc. are also optimally approximated by memory-efficient
neural networks with a parallel architecture:

Theorem 2.5.2: Optimal approximation [30]

Let % be an admissable smooth activation function and ε ą 0.


Then there exists Cε ą 0 such that for all f P E 2 pR2 q and N P N
there exists a Φ P N N 2,OpN q,OpN q,3 satisfying

}f ´ R% pΦq}L2 ď Cε N ´1`ε .

Remark 2.5.5 This is the optimal rate; hence the first bound is sharp!

Some numerics
Typically weights are learnt by backpropagation. This raises the following
question: Does this lead to the optimal sparse connectivity?
In our setup we consider a fixed network ReLU-architecture, learning
specific functions with stochastic gradient descent. We analyze the
learnt subnetworks to conduct an analysis of the connection between
approximation error and number of edges.

Fig. 25: Numerical experiments with ReLU and backpropagation.


[Source??]

Feel free to verify that we have answered all questions posed at the
beginning of subsection 2.3.

2.6 Impact of Depth


Our next theorem, in rough terms, states than 03.12.19

22
2.6 Impact of Depth

"There exists a simple (approximately radial) function on Rd , expressible


by a 3-layer neural network of width polynomial in the dimension d, which
cannot be arbitrarily well approximated by 2-layer networks, unless their
width is exponential in d."

Remark 2.6.1 This result holds for virtually all known activation func-
tions, including rectified linear units, sigmoids and thresholds (i.e 1r0,8q ).
It shows that depth - even if increased by 1 - can be exponentially more
valuable than width for standard feedforward neural networks.
Assumption 1: The activation function % is measurable and satisfies

|%pxq| ď Cp1 ` |x|α q @x P R, C, α ą 0.

Assumption 2: There exists a constant c% ě 1 such that for any Lipschitz


function f : R Ñ R with Lipschitz constant L which is constant outside a
bounded interval r´R, Rs and for any δ ą 0 there exist scalars a, αi , βi , γi ,
i P rws with w ď c% RL
δ and
ˇ ˇ
ˇ w
ÿ ˇ
sup ˇf pxq ´ a ´ αk %pβk x ´ γk qˇ ď δ.
ˇ ˇ
xPRd ˇ k“1
ˇ

The second assumption guarantees that f can be approximated by a


neural network with one hidden layer.

Theorem 2.6.1: From two to three layers [31]

If % satisfies the two assumptions above then there exist univer-


sal constants c, C ą 0 such that the following holds: For every
dimension d ą C there exists a probability measure µ on Rd and
a function g : Rd Ñ R with the following properties:
?
1 g is bounded in r´2, 2s, supported on tx P Rd : }x} ď C du
19
and expressible by a 3-layer network of width Cc% d 4 .
2 Every function f expressible by a 2-layer network of width
at most cecd satisfies Ex„µ rpf pxq ´ gpxqq2 s ě c.

Proof. 3-layer case: Consider radial functions gpxq “ hp}x}2 q, where


h : R Ñ R. By assumption 2, x ÞÑ x2 can be approximated by a linear
řd
combination of neurons. Then approximate }x}22 “ i“1 x2i on any
bounded domain. The next layer approximates h. 3 layers.
2-layer case: µ corresponds to a probability density function ϕ2 . The
condition (ii) is equivalent to
ż
(P)
pf pxq ´ gpxqq2 ϕ2 pxq dx “ }f ϕ ´ gϕ}22 “ }fx ϕ}22 ě c,
ϕ ´ gx

where (P) is the Plancherel theorem.


´ ¯ d2
Rd
Choose ϕ :“ 1Bp0,1q “ }x} J d p2πRd}x}q, where Jk is a Bessel
2
function, which is a density function. If f is expressible by a 2-layer
network, fˆ has a special form: For this, consider
k
ÿ
f : R Ñ R, x ÞÑ fi px vi , x yq.
i“1

Fig. 26: Illustration of ϕ for d “ 2. [2]


23
2.6 Impact of Depth

Ťk
Then (using distribution theory) it holds that supppfˆq “ i“1 spanpvi q.
Then
fxϕ “ fˆ ˝ ϕ̂ “ fˆ ˝ 1Bp0,1q

and therefore
k
ď ` ˘
supppfx
ϕq “ spanpvi q ¨ Bp0, 1q “: T,
i“1

where T is a union of tubes through the origin with width 1. Impact of


dimension: Unless k is exponentially large i.i.d., T is very sparse at large
distances from the origin, i.e.

VolpT X S d´1 q
ď ke´d .
VolpS d´1 q

Goal: Find g, where gϕ is radial and g ϕ̂ has most of its energy away
from zero.
ř
Main technical part: Ansatz: ϕpxq¨ ĝpxq “ εi ci gi , where gi are carefully
chosen concerning ĝi . l

We will now take a look at compositorial functions and a theorem from [32],
which roughly states that "Deep (hierarchical) networks can approximate
the class of compositional functions with the same accuracy as shallow compositional functions
Fig. 27: Visual intuition for the theorem.
networks but with exponentially lower number of (training) parameters.”
[2]

definition 2.6.2 (Compositional function)


A function f : Rd Ñ R following the structure of a binary tree

f px1 , . . . , xn q “ h1 ph2 ph3 px1 , x2 q, h4 px3 , x4 qq, . . .q

for appropriate functions hi : R2 Ñ R is called compositional.

definition 2.6.3 (W r,d , W H,r,2 , Sn and D n )


Let W r,d be set of all functions with continuous partial derivatives
up to order r such that
ÿ
}Dk f } ď 1
|k|ďr

and W H,r,2 be the class of compositional functions in W r,d .


Let Sn :“ N N d,8,d`n`1,2 and Dn the class of DNNs from
N N d,8,8,8 whose realizations the structure of a compositional
function with the analog of functions hi being in Sn .

24
2.6 Impact of Depth

Theorem 2.6.2: Compositional function approxi-


mation [32]

Let % P C 8 pRq and not a polynomial on any subinterval of R. For


f P W r,d
inf }f ´ R% pΦq} “ Opn´r{q q
ΦPSn

and for f P W H,r,2

inf }f ´ R% pΦq} “ Opn´r{2 q


ΦPD n

hold.

Remark 2.6.4 (Corollary?) If we only know r and want to guarantee


an accuracy of ε, we need a shallow network with Opε´q{r q trainable
parameters.
If we assume a hierarchical structure on the target function, then to guar-
antee an accuracy of ε, a corresponding deep network requires Opε´2{r q
trainable parameters.

Because of time constraint we only briefly covered convolutions neural


networks, so this section will not be include in here (for now). See slide
49 - 59 from the lecture.

25
3 Learning
10.12.19
3.1 Training in overparameterized regimes
In the overparameterized regime the number of neurons exceeds the
number of samples, enabling zero training error and therefore a globally
optimal solution and convergence of SGD in high probability over the
initialization.
Consider the following two-layer network: Let % be the ReLU activation
function, pwr qnr“1 Ă Rd the weight vectors of the first layers, W the
associated matrix and a P Rn the output weights. TODO: BILD
Consider the neural network Φ “ pW, aq and its realization
n
1 ÿ
R% pΦq : Rd Ñ R, x ÞÑ ? ar %px wr , x yq
n r“1

WO KOMMT DER GEWICHTENDE FAKTOR HER? For sam-


d
ples ppxk , yk qqm
k“1 Ă R ˆ R we want to minimize the empirical loss
function
m
ÿ 1` ˘2
F pW, aq :“ R% pW, aqpxi q ´ yi
i“1
2
First, consider

BF pW pkq, aq
W pk ` 1q “ W pkq ´ η , (7)
BW pkq

where η ą 0 is the step size. Here we have


m
BF pW, aq 1 ÿ` ˘
“? R% pW, aqpxi q ´ yi ar xi 1tx wr ,xi yě0u p?q,
Bwr n i“1

which is neither smooth nor convex.


Instead of gradient descent we use gradient flow, which is gradient de- gradient flow
scent with infinitesimal step size: Instead of (7) consider the differential
equation
dwr ptq BF pW ptq, aq
“´ .
dt Bwr ptq
For brevity’s sake let

ui ptq :“ R% pW ptq, aqpxi q

for i P rms be the prediction on input xi at times t and uptq :“ pui ptqqni“1
be prediction vector at times t. Further let
“ ‰
H 8 :“ Ew„N p0,1q x xi , xj y 1tx w,xi y,x x,wj yě0u

by the Gram-Matrix induced by % and random initialization.


We assume that λ0 :“ λmin pH 8 q ą 0, which is fulfilled if for all i ‰ j we
have xi ∦ xj ([26], theorem 3.1). Note for most real world datasets, no
two inputs are parallel, so our assumption holds in general.

26
3.1 Training in overparameterized regimes

Theorem 3.1.1: Convergence Rate of Gradient


Flow ([26], Theorem 3.2)

Suppose the assumption holds and }xi }2 “ 1 and |yi | ď C holds


for a constant C and all i P rms. Choose the number of hidden
neurons as ˆ 6 ˙
m
n“Ω
λ40 δ 3
and i.i.d. initialize wr „ N p0, 1q and ar „ unifrt˘1us for r P rns.
Then with probability at least 1 ´ δ over the initialization we have

}uptq ´ y}22 ď expp´λ0 tq}up0q ´ y}22 .

Proof. 1 Dynamics of predictions. We have


n
d ÿ @ BR% pW ptq, aqpxi q dwr ptq D
ui ptq “ ,
dt r“1
Bwr ptq dt
m
ÿ
“ ... “ pyj ´ uj ptqqHi,j ptq,
j“1

where
n
x xi , xj y ÿ
Hi,j “ 1tx xi ,wr y,x xj ,wr yą0u prq.
n r“1
Thus,
d
uptq “ Hptqpy ´ uptqq. (8)
dt

2 We have
ˆ ˙
1
}Hp0q ´ H 8 }2 “ O ? “ }Hptq ´ Hp0q}2 @t ě 0,
n
i.e. the higher the overparametrization (larger u) the more the
dynamics of the predictions are governed by H 8 .
´ 2 ` m ˘¯
3 If n P Ω m λ2 log δ with probability at least 1 ´ δ, we have
0
λ0
}Hp0q ´ H 8 }2 ď 4 and λmin pHp0qq ě 43 λ0 .
This is proven with concentration bounds. TODO: WHATS
THAT?
4 Hptq is stable in terms of W ptq: If wr are i.i.d. from N p0, 1q, then
with probability 1 ´ δ we have: for all pwk ptqqnk“1 Ă Rd with
δλ0
}wr p0q ´ wr ptq}2 ď “: R (9)
m2
for some c and all r. Then
λ0 λ0
}Hptq ´ Hp0q}2 ď and λmin pHptqq ě .
4 2

This can see be considering the event

Ai,r :“ tDw : }w ´ wr p0q} “ R, 1tx xr ,wr p0q yě0u ‰ 1tx xi ,wr p0q yě0u u.

Then
2R
P pAi,r q “ Pz„N p0,1q p|z| ă Rq ď ? (10)

27
3.1 Training in overparameterized regimes

holds. Then,
n
“ˇ ˇ‰ 1 ÿ “ ‰ (10) 4R
E ˇHi,j p0q ´ Hi,j ptqˇ ď E 1tAi,r XAj,r u prq ď ? .
n r“1 2π

Summing over i, j P rms and applying Markov’s inequality we get


that with probability of at least 1 ´ δ
m 2
ˇHi,j ptq ´ Hi,j pzqˇ ď 4m R
ÿ ˇ ˇ
? .
i,j“1 2πδ

We then use matrix perturbation to get the result:

}Hptq ´ Hp0q}2 ď }Hptq ´ Hp0q}F


m
ÿ 4m2 R (9) 4λ0
ď |Hi,j ptq ´ Hi,j pzq| ď ? “ ? .
i,j“1 2πδ 2π

Hence, ˆ ˙
4λ0 λ0
λmin pHptqq ě λmin pHp0qq ´ ? ě .
2π 2
λ0
5 If s P r0, ts, λmin pHpsqq ě 2 , then

}y ´ uptq}22 ď e´λt }y ´ up0q}22 (11)


a m }y´up0q}2
and }wr ptq ´ wr p0q} ď π λ0 “: R1 .
The proof of this fact relies on using (8). It gives

d
}y ´ uptq}22 “ ´2py ´ uptqqT py ´ uptqq ď ´λ0 }y ´ uptq}22 ,
dt
where the inequality comes from λmin pHpsqq ě λ2 . This implies

d ` λ0 t ˘
e }y ´ uptq}22 ď 0,
dt
so eλ0 t }y ´ uptq}22 is decreasing in t, implying (11).
6 If R1 ă R then
λ0
• λmin pHptqq ě 2 for all t ě 0.
• }wr ptq ´ wr p0q}2 ď R1 for all r.
• (11).
We have R1 ă R if n large enough. l

We now consider the differential equations

dwr ptq BF pW ptq, aptqq dwr ptq BF pW ptq, aptqq


“´ and “´
dt Bwr ptq dt Bar ptq

Theorem 3.1.2: Jointly training both layers [26]

Let the previous assumptions´ as above be¯ fulfilled and the number
6
of hidden neurons be n “ Ω m logpm{δq
λ40 δ 3
. Further initialize wr „
N p0, 1q and ar „ unifrt˘1us for r P rns. Then with probability at

28
3.1 Training in overparameterized regimes

least 1 ´ δ over the initialization we have

}uptq ´ y}22 ď e´λ0 t }up0q ´ y}22 .

The proof uses similar argument as before.

Theorem 3.1.3: Discrete Time Analysis [26]

Let the previous assumptions


´ 6 ¯ be fulfilled and the number of hidden
neurons be n “ Ω λm 4 δ3 . Further initialize as above and set the
` λ ˘0
step size η :“ O m 2 . Then with probability at least 1 ´ δ over
0

the initialization we have


ˆ ˙k
2 ηλ0
}upkq ´ y}2 ď 1 ´ }up0q ´ y}22 for k P N .
2

Fig. 28: Some numerics. [26]

A similar result as before holds for deep neural networks. More precisely,
the authors show that gradient descent achieves zero training loss in
polynomial time for a deep over-parameterized residual neural network
(ResNet) [27]. residual neural network

29
4 Generalization
12.12.19
4.1 Empirical Risk Minimization
Aiming to find an alternative definition of the complexity of a function
class we will define its covering number, describing how well it can be
approximated with simple objects such as ε-balls.

definition 4.1.1 (Covering Number)


Let X be a metric space and ε ą 0. We define Fig. 29: Illustration of the concept of cov-
# + ering number with ε-balls with respect to
`
ď the uniform norm. [2]
N pX, εq :“ min ` P N : Dpxk q`k“1 ĂX:XĂ Bpxk , εq
k“1

definition 4.1.2 (Packing number)


The packing number is

PpX, εq :“ maxtk P N : Dpx` qk`“1 : }xi ´ xj } ą ε @i ‰ ju.

Lemma 4.1.3
We have
ˆ ˙d
2
PpX, 2εq ď N pX, εq ď PpX, εq ď 1`
ε

Theorem 4.1.1: Generalization error bound


[SOURCE???]

Let H be a hypothesis class and assume that for all f P H we


have |f pXq ´ Y | ď M almost everywhere. The for all ε ą 0

mε2
ˆ ˙
` ˘ ´ ε ¯
P EpfH,S q ´ EpfH q ď ε ě 1 ´ N H, ¨ 2 exp ´
8M 8M 4

holds.

WHAT ARE X and Y ???Notice that this bound makes sense as for
2
many samples, i.e. large m as the term e´Cmε decreases, yielding a
` ε
˘
better bound, while N H, 8M increases with growing M , yielding a
smaller and therefore worse lower bound.
Corollary 4.1.4 (Sufficient number)
A generalization error smaller than or equal to ε ě 0 is achieved by
` ˘
m ě ε´2 ln N pH, cεq

with high probability for some c ą 0.

Unfortunately, the complexity measure given above is independent of the


data distribution and the samples hence very pessimistic.

30
4.1 Empirical Risk Minimization

Motivation of Rademacher complexity


For generalization we need to bound
ˇ ˇ
sup ˇ E r`pf pxq, yqs ´ E S pf qˇ.
f PH

In cross-validation you split the samples S into S 1 and S 2 (of approxi- cross-validation
mately the same size) and estimate the generalization error of f by
ˆ ˙ m
2 ÿ ÿ 2 ÿ
`pf pxq, yq ´ `pf pxq, yq “ σi `pf pxi q, yi q “: p‹q.
m m i“1
px,yqPS 1 px,yqPS 2

We denote σ :“ pσi qm
i“1 , where
$
&1, if pxi , yi q P S 1 ,
σi “
%´1, if pxi , yi q P S 2

If we now choose ` to be the hinge loss `pf pxq, yq :“ 1´f2pxq¨y and without hinge loss
loss of generality the labels to be yi P t˘1u one obtains
m
1 ÿ
p‹q “ σi p1 ´ f pxi qyi q.
m i“1
´ 1¯
Now consider a random split, so σi becomes a random variable. Hence The O m´ 2 terms follows directly from
Van Zuijlen’s bound [9]
« ˇ ˇff « ff
ˇ1 ÿ m m
ˇ ´ 1
¯ 1 ÿ ˜
m
¸
sup ˇ σi p1 ´ f pxi qyi qˇ ď O m´ 2 `Eσ sup σ̃i f pxi q , 1 ÿ 1
ˇ ˇ
Eσ P ? σi ď 1 ě ,
f PH ˇ m i“1 ˇ f PH m I“1 m i“1 2
loooooooooooooomoooooooooooooon
“:p:q where Ppσi “ 1q “ Ppσi “ ´1q “ 1
2.

where σ̃i :“ yi σi . The term p:q measures how well the hypothesis class is
able to fit random labels to the samples.

definition 4.1.5 (Rademacher complexity)


The Rademacher complexity of a function class F with respect to
the sample S :“ pxi qm
i“1 is
« ˇ ˇff
m
1 ˇˇ ÿ ˇ
RS pFq :“ Eσ sup ˇ σi f pxi qˇ ,
ˇ
f PH m ˇ i“1
ˇ

where the expectation is taken over all Rademacher random vari-


ables σ :“ pσi qm
i“1 Ă t˘1u
m
chosen uniformly at random.

The Rademacher complexity can be used to derive data-dependent


upper-bounds on the learnability of function classes. Intuitively, a func-
tion class with smaller Rademacher complexity is easier to learn.
Statistical learning theory tells us to choose F :“ H.

Example 4.1.6 (Linear classifiers’ Rademacher complexity)


Consider the hypothesis space of unbiased linear classifiers

F a,b :“ tRd Ą ty : }y}2 ď au Q x ÞÑ x v, x y : }v}2 ď bu.

31
4.1 Empirical Risk Minimization

One obtains
« ˇ ˇff «› › ff
m m
1 ˇˇ ÿ ˇ 4‰ b ›ÿ ›
RS pF a,b q “ Eσ sup ˇ x v, σi xi yˇ ď Eσ › σi xi ›
ˇ › ›
}v}2 ďb m ˇi“1
ˇ (CS) m › i“1

2
»g fi
fÿ m
(L) b f a
“ Eσ –e σi σj x xi , xj yfl } ¨ } “ x ¨, ¨ y
m i,j“1
g
f m
(J) b f ÿ
ď e Eσ rσi σj s x xi , xj y
m i,j“1
g « ff
f m
bf ÿ ba
ď eEσ }xi }22 ď ? , Erσi σj s “ 0 if i ‰ j
m i“1
m

where (L) is the linearity of the inner product on Rd and (J) is Jensen’s
inequality Erf pXqs ď f pErXsq for concave f . ˛

Example 4.1.7 Consider the hypothesis space of unbiased neural net-


works

F L,B1 ,...,BL :“ tty : }y}2 ď 1u Q x ÞÑ AL % pAL´1 . . . %pA1 xqqu

with }Ai }F ď Bi for all i P rLs. A theorem [11] states

´a L
¯ź
1
RS pF L,B1 ,...,BL q ď m´ 2 2 logp2qd ` 1 Bi . ˛
i“1

Theorem 4.1.2: Generalization Bound [12]

Let F Ă tf : X Ñ r0, 1su be a function class, X is a random


variable on X and S :“ pxi qm i“1 be i.i.d. drawn from X according
to X. Then with possibility at least 1 ´ δ for all f P F we have
ˇ ˇ ˜c ¸
m
ˇ 1 ÿ ˇ lnpδ ´1 q
sup ˇErf s ´ f pxi qˇ ď 2 RS pFq ` O .
ˇ ˇ
f PF ˇ m i“1 ˇ m

Notice that if we choose f “ `phpxq, yq the left term becomes Erf s´E S pf q.

Understanding deep learning requires rethinking generalization: the


following table from [13] shows that state of the art networks can fit
random labels. Therefore, their Rademacher complexity is 1. The
classical theory (left diagram) in the diagram below does not explain
generalization!
Fig. 30: Sometimes a “double descent curve”
is observed. [14]

32
4.1 Empirical Risk Minimization

Fig. 32: The training and test accuracy of various model on the CIFAR10
dataset. Performance with and without data augmentation and weights
decay are compared. The results of fitting random labels are also included.
[13, 14]

33
5 Interpretability
Motivation. Given a trained neural network, we don’t know what the 07.01
training data was nor how it was trained. The goal of this section is to
provide tools to open this black box.

Remark 5.0.1 Even if we know the training data and training procedure,
due to the lack of a comprehensive theory, we don’t have this information.

Example 5.0.2 Often, job applications are prescreened by neural net-


works. If one is rejected, one is interested in the reason for this decision.
In hospitals, neural networks are used to suggest treatments. Certainly
patients and doctors alike are interested in the reasoning behind them.˛

Therefore, our goal is to derive human-like explanations from a neural


network, even though defining "human-like" is tricky.

Interpretability of a Classification Decision


The approach that probably comes to mind first to investigate which
features are most relevant for the decision. This can be achieved by
treating every pixel of an image separately (the current standard). It
would obviously be desirable to also consider combinations of pixels in
order to incorporate additional knowledge about the data.
One is also interested in the certainty of the decision.
As seen on the right one can construct relevance maps. But this doesn’t Fig. 33: Red regions indicate pixel relevant
for deciding this image to be the number
have to provide sufficient knowledge about a network decision.
displayed above, while blue pixels indicate
Previous relevance mapping methods include gradient based methods such as Sensitiv- regions relevant for deciding against this
number.
ity Analysis [17] and SmoothGrad [16], backwards propagation based methods such as
Guided Backprop [18], Layer-wise Relevance Propagation [19] and Deep Taylor [20],
surrogate model based methods such as LIME (Local Interpretable Model-agnostic
Explanations) [22] and game theoretic methods such as Shapley values [23, 24] and
SHAP (Shapley Additive Explanations) [34].

Sensitivity Analysis
Let C :“ pck qnk“1 be a set of classes. Consider a DNN with realization

R% pΦq : Rd Ñ Rn , x ÞÑ pxck qnk“1 ,

where xci P R describes to which extend x belongs to class ci .


Fig. 34: An image and its sensitivity map
Define pR% pΦqqci :“ xci and assume this map is piecewise differentiable. Mc . Here |C| “ 1. The light regions indi-
cate where the algorithm identifies a lot of
definition 5.0.3 (Sensitivity map) gazelle (“ c1 ) and the dark regions where
With the above notation, let there is little of it: Lighter pixels indicate
partial derivatives with higher absolute val-
BR% pΦqci pxq ues. [16]
MC pxq :“
Bx

34
Remark 5.0.4 MC represents how much a small change in each pixel
of x makes to the classification score xci for the classes from C.
The problems in the sensitivity map in figure Fig. 34 are evident:
• it looks more like noise because of its scattered nature,
• it is very far from a human explanation; correlation with regions
from the image picked by a human are rough at best.
This might be because the partial derivative fluctuate greatly (Fig. 35).
A solution is local smoothing [16]: Instead of Mc we now use
n
xc pxq :“ 1
ÿ
M Mc pxi q ` N p0, σ 2 q,
n i“1
where the xi are from a neighborhood from x.

Fig. 35: The partial derivative of Mc with


respect to the RGB values of a single pixel
Fig. 36: Effect of sample size on the estimated gradient for inception.
as a fraction of the maximum entry in the
10% noise was applied to each image. [16] gradient vector, maxi BM c
ptq, (top) as one
Bxi
moves away from a baseline image x (left)
to a fixed location x` (right) is one random
Layerwise- Relevance Propagation (LRP) sample from N p0, 0.012q. The final image
is indistinguishable to a human from the
A realization of a neural network takes in an image x and returns R% pΦqpxq. origin image. [16]
Denote the neurons of the `-th layer by pR`,j qN `
j“1 . Then
N``1
ÿ a pw q
R`,j “ ř `,j `,j k R``1,k ,
k“1 n a`,n pw`,n qj

where w are the weights and the a are constants tuned in the algorithm.
In other words: a neurons (and therefore its relevance to the output) can
be reconstructed by the neurons of the following layer and the weights.

Fig. 37: Visualization of backpropagation. [21]

35
5.1 Towards a more mathematical understanding

Deep Taylor
For a realization R% pΦq : Rd Ñ R we want to determine the relevance of
each pixel xk of x “ pxk qdk“1 for the output R% pΦqpxq.

definition 5.0.5 (Relevance maps (RM))


A collection of functions Mp : Rd Ñ R is called relevance maps of
pxk qdk“1 if
1 pMp qp is non-negative, i.e. Mp ě 0,
ř
2 pMp qp is conservative, i.e if ´ ´ ´ p Mp pxq “ R% pΦqpxq.

d
Remark 5.0.6 In sensitivity analysis we have Mp pxq “ dx R% pΦqpxq,
which doesn’t satisfy 1 nor 2 . WHAT ABOUT LRP???
The following definition comes from [21]:

definition 5.0.7 (Taylor decomposition RM)


Assume R% pΦq P C 1 and R% pΦqpx̂q “ 0 for some x̂. Then

B
Mp pxq :“ R% pΦqpx̂qpxp ´ x̂p q
Bxp

is the Taylor decomposition relevance map. Taylor decomposition relevance


map

Remark 5.0.8 Consider the standard Taylor expansion Fig. 38: Numerical experiment for sensitiv-
ity vs. LRP. SOURCE???
R% pΦqpx̂q `∇R% pΦqpx̂qT px ´ x̂q ` Op}x
R% pΦqpxq “ loooomoooon ´ x̂}2 q,
loooooomoooooon
“0 can be ă0

which is almost conservative.


The Taylor decomposition relevance map can be efficiently computer
by backpropagation.
Fehlt: Bild: numerical experiments I, II (Folie 11/12) Note to Slide 11:
the black lines are artificial and aren’t created by the algorithm.

5.1 Towards a more mathematical under-


standing
Our main goal is to understand decisions of "black-box" predictors. We
face the following challenges:
• What exactly is relevance in a mathematical sense?
– Rigorous definition of relevance by information theory.
• What is a good relevance map?
Formulation of interpretability as optimization problem.
Theoretical analysis of complexity.
• How to compare different relevance maps?
Canonical framework for comparison.

36
5.1 Towards a more mathematical understanding

The relevance mapping problem


Let Φ : r0, 1sd Ñ r0, 1s be a classification function, where high values
indicate high relevance and x P r0, 1sd be an input signal.
We want to to determine the most relevant components of x for the
prediction Φpxq. We choose the relevant components S Ă rds, which
should be small. We consider S { to be non-relevant. Hence the only two
relevance values are 0 (not relevant) and 1 (relevant).

Fig. 39: TODO. [2]

Rate-Distortion explanation (RDE)


Let say Alice has a message x a neural network Φ with its realisation
R% pΦq. She now transmits the sparse relevant part of x, S, to Bob, wo
also posses Φ. In order to apply it to S, the “missing” parts have to be
filled as well, which is done by random noise η. We call Bob’s whole
message y and aim for R% pΦqpyq « R% pΦqpxq.

definition 5.1.1 (Rate-Distortion function)


The distortion is
„ 
1` ˘
E R% pΦqpyq ´ R% pΦqpxq
2

and the rate-distortion function

Rpεq :“ min t|S| : DpSq ď εu. (|S| rate)


SĂrds

We call the S part of x ε-relevant.

Problem relaxation

??? \ setting discrete continuous


relevant set S Ă rds s P r0, 1sd
variable y “ xs ` η S c y “ s d x ` p1 ´ sq d η
distortion DpSq DpSq
rate |S| }s}`1

Here, d denotes entry wise multiplication.


Relevance problem minsPr0,1sd Dpsq ` λ}s}`1 , where λ ą 0 is a regulariza-
tion parameter.
Finding a minimizer of Rpεq or even approximating it is very hard.

37
5.1 Towards a more mathematical understanding

definition 5.1.2 (Distortion, Obfuscation)


We call
„ 
1` ˘2 1` ˘2 1
Dpsq :“ E Φpxq ´ Φpyq “ Φpxq ´ ErΦpyqs ` covrΦpyqs
2 2 2

distortion.

Erys “ sdx`p1´sqdErns, covrys “ diagp1´sq covrns diagp1´sq

????

Φ
We have Erys, covrys Ý
Ñ ErΦpyqs, covrϕpyqs.
The generic approach is to estimate using sample mean and sample
covariance, which is possible for any classifier function Φ but might
require large number of samples.
The neural network approach uses the compositional structure of Φ,
propagating the distribution though the layers, projecting to a simple
family of distributions at each layer.
NOTE: low rank and diagonal are density filtering methods

Hardness results
We consider the binary case. The following theorems [35] show that find
a minimizer of Rpεq and even the approximation of it is hard:

Theorem 5.1.1: Hardness result

Let Φ and x be given. For k P t1, . . . , du and ε ă 14 , deciding


where Rpεq ď k is NPPP -complete.
For α P p0, 1q, approximating Rpεq to within a factor of d1´α is
NP-hard.

Many important problems in artificial intelligence belong to the class


NPPP , i.e planning under uncertainties, finding maximum a-posteriori
configurations in graphical models or maximizing utility functions in
Bayesian networks.
Whereas the class NP can be thought of as the set of problems solvable by
guessing the answer and checking it in polynomial time, the class NPP P
can be thought of as the set of problems solvable by guessing the answer
and checking it using a probabilistic polynomial time (PP) computation.
[36]
TODO: numerics

38
6 Inverse Problems
When solving an inverse problem, one tries to recover the original data
from a transformed version!

Example 6.0.1 (Inverse problems) In inpainting one tries to recov- inpainting


ery from incomplete images. In medicine, magnetic resonance imaging
recovers from point samples of the Fourier transform. When extracting
features, one separates the image into different features. ˛
Fig. 40: Recovering meat from minced meat
is a difficult inverse problem.
definition 6.0.2 (Inverse Problem)
Let X, Y be Banach spaces and K : X Ñ Y be linear. Given y P Y
seeking x P X fulfilling Kx “ y is called an inverse problem. inverse problem

Example 6.0.3 (Inverse problems II)


In inpainting we have

K : L2 pr0, 1s2 q Ñ L2 pr0, 1s2 q, f ÞÑ f ¨ 1Ω

for Ω Ă r0, 1s2 . In magnetic resonance imaging we have

K : L2 pr0, 1s2 q X L1 pr0, 1sq2 Ñ L2 pr0, 1s2 q, f ÞÑ pfˆpλqqλPΛ ,

where Λ Ă r0, 1s2 is discrete. for feature extraction we have

K : L2 pr0, 1s2 q ˆ L2 pr0, 1s2 q Ñ L2 pr0, 1s2 q, pf, gq ÞÑ f ` g. ˛

Example 6.0.4 (Limited Angle-(Computed) Tomography) Fig. 41: Visualization of inpainting, MRI
A CT scanner (applied in dental CT, breast tomosynthesis and electron and feature extraction. [37, 38]
tomography) samples the Radon transform
ż
R f pϕ, sq :“ f pxq dSpxq,
Lpϕ,sq

where Lpϕ, sq :“ tx P R2 : x x cospϕq ` x2 sinpϕq “ su, ϕ P ´ π2 , π2 and


“ ‰

s P R.
This becomes an challenging inverse problem if R f p¨, sq is only sampled
“ ‰
on r´ϕ, ϕs ( ´ π2 , π2 . ˛

The wellposedess conditions by are Hadamard


Fig. 42: Geometric setup of the Radon
1 Existence: For all y P Y there exists a x P X such that Kx “ y. transform. [3]

2 Uniqueness: The above x is unique.


3 Stability: xn Ñ x follows from Kxn Ñ Kx.
Most problems are not well-posed. They are called ill-posed. They can
have the following problems:
1 y R RpT q, meaning there does not exists a solution.

39
2 N pT q ‰ t0u, meaning there exists infinitely many solutions.
3 K ` (as defined below) is not continuous.
The solution for 1 is trying to minimize }Kx ´ y} and the solution to
2 is minimizing }x} also:

definition 6.0.5 (Types of solutions)


We call x P X least square solution if }Kx ´ y} “ minzPX }Kz ´ y}
holds and minimum norm solution if also }x} “ minzPX }z} holds.

Theorem 6.0.1: Moore-Penrose-Inverse

There exists a K ` : Y Ą DpK ` q :“ RpKq ‘ RpKqK Ñ X such


that for y P DpK ` q the equation Kx “ y has a minimum norm
solution x` “ K ` y.

We can solve problem 3 with

definition 6.0.6 (Regularization)


A family pJα : Y Ñ Xqαą0 of linear continuous operators is called
αŒ0
Regularization of K ` if Rα ÝÝÝÑ K ` holds pointwise in DpK ` q. Regularization

definition 6.0.7 (Tikhonov regularization)


The Tikhonov regularization is defined by Tikhonov regularization

Jα pyq :“ min }Kx ´ y}2 ` αRpxq


xPX

and is called standard if R “ } ¨ }2 .

This allows use to find an approximate solution xpαq P X, α ą 0 of an


ill-posed inverse problem Kx “ y by minimizing Jα . The regularization
term R ensures continuous dependence on the data and incorporate
properties of the solution.
But how do we obtain prior information and how do we incorporate it
in the solver? There are several types of approaches: take-from-shelf,
handcrafting or learning.

Example 6.0.8 (Regularization term)


For vectors: the `2 or `1 - (sparsity) norm. For functions there is the
TV-norm of f : r0, 1s Ñ R:
nÿ
k ´1

}f }TV :“ sup |f pxi`1 ´ f pxi q|


k i“0

or the Sobolev norm of f : Ω Ñ R: 1 2


¨ ˛ p1
ÿ
}f }W k,p pΩq :“ ˝ }Dα f }pLp pΩq ‚ . ˛
|α|ďk
Fig. 43: The point on the green line closest
to the origin differs when considering the
`1 -or the `2 -norm, whose balls are drawn.
40 One can see the `1 -minimization "pushes
points towards the axis".
Example 6.0.9 (Sparse Regularization) Recall the wavelet trans-
form (2). For each class of data, there exists a sparsifying system, which
is a TODO????. From figure ?? we see than to attain sparseness we
want to use the `1 norm: we solve the ill-posed inverse problem Kf “ g
by
f α :“ arg min }Kf ´ g}2 ` α ¨ }px f, ψj,m qj,m }1
f
We can find such a system by using an existing one such as from applied
harmonic analysis: Wavelets, Ridgelets, Curvelets, Shearlets, etc. or
construct a novel system based on information about the data or learn a
system by an dictionary learning algorithm (cf. K-SVD [39]). ˛

Solving inverse problems with deep learning


A first approach from 2004 [40] trained a feedforward neural network
N N θ : Rm Ñ Rn such that N N θ pyq « x held. This was done one CPU
with out CNNs. Therefore, only small toy-examples could be used. This
was seen more as a proof of concept. TODO BIlDER.
Later another approach [41] based on ISTA (see below) used unroll/unfold
iterations and casts as a “CNN-like” neural network.
In 2012 the fist CNN AlexNet wins a classification challenge [42]. Since
then all winners are CNN-based by a huge margin.
In the field of denoising (which is an inverse problem) there have been
similar improvements (Jain et al.; 2009, Zhang et al.; 2017, Elad et al.;
2016).
Newer approaches work like this: Given training samples pfi , gi qN i“1
following the forward model gi “ Kfi ` η determine a reconstruction
operator Tθ such that g “ Kf ` η implies that Tθ pgq « f . Then, Tθ
is parametrized by θ P Rp and learned from the training data. We can
the evaluate the quality of Tθ by testing on the test data pfi , gi q`i“N `1
following the forward model.
For CT, we can use reconstruction by filtered backprojection (FBP):
żπ
f pxq « pRf pϕ, ¨q ˚ hqpx x, nϕ yq dϕ,
0

where ĥpkq “ |k|.


A typical deep learning approaches to inverse problems is denoising direct
inversion (Ye et al.; 2016, Unser et. al.; 2017). Here, one does the
direct inversion with FBPs and then trains the CNN to remove noise, the
intuition being that the CNN learns structured noise/artifacts. Without
taking the FBP, the CNN would need to learn the physics of CT, which
is more complicated. (TODO BILD)

Theorem 6.0.2

Let L0 :“ Lpϕ0 , s0 q be a line in the plane. Let px0 , ξ0 q P W F pf q


such that x0 P L0 and ξ0 is a normal vector to L0 .
• The singularity of f at px0 , ξ0 q causes a unique singularity

41
in Rf at pϕ0 , s0 q.
• Singularities of f TODO

ISTA (Iterative soft-thresholding algorithm) [25]


Solve 21.01.2019
min }Ax ´ y}2 ` α}x}`1 ,
xPRn

where A : Rn Ñ Rm is the discrete version of K by the iteration

xk`1 :“ Sαt pxk ´ 2tAT pAxk ´ yqq,


´λ

where t P R is the step size and λ

Sλ : Rn Ñ Rn , pSλ pxqqi :“ p|xi | ´ λq` ¨ sgnpxi q

the shrinkage operator. Fig. 44: The shrinkage operator for λ “ 3


.
2

Douglas-Rachford
Define γ ą 0 and the proximal operator
1
proxf pvq :“ arg min f pzq ` }z ´ v}2
z 2
in order to solve
min }Ax ´ y}2 ` αRpxq
xPRd

by the "splitting" iteration

xk`1 :“ proxγαR pvk q, vk`1 :“ vk ` proxγ}A¨´y}2 p2xk`1 ´ vk q ´ xk`1 .

If R :“ } ¨ }2`1 , the x-update is soft-thresholding.


TODO

Wavefront sets
A point x P R2 is called singular point for f if there exists a ϕ P C 8
0 pU x q, singular point
where U x is a neighborhood of x and ϕpxq ‰ 0 such that fx ϕ R C pR2 q.
8
0

A pair px, θq is called regular directed point if there exists neighborhoods regular directed point
2
of U x and Vθ or x and θ, respectively and a γ P C 8
0 pR q such that γ|U x ” 1
and
ξ2
@N DCN ą 0 : |pfxγqpξq| ď CN p1 ` |ξ|q´N @ξ P R2 , P Vθ
ξ1

The wavefront set consists of all point which are not regular directed wavefront set
points. (rapid decay of fourier transform ùñ function smooth in that
direction)
15/16 Example K : X Ñ Y is inpainting. pxi , Kxi qi by fitting pKxi , xi qi .
18 η is noise
19 Backprojection should be « also no details were explained.
Fig. 45: TODO

42
7 Partial differential equations
Partial Differential Equations arise in basically any branch of science 06.02.2020
and engineering. They are key for modeling numerous types of (natural)
phenomena such as sound, heat, diffusion, electrostatics, fluid dynamics,
elasticity and gravitation.

7.1 Mathematics of PDEs

definition 7.1.1 ((partial) Differential equation)


Let x P Ω Ă Rd . Then

Bupxq B 2 upxq
ˆ ˙
Bupxq
L x, upxq, ,..., , ,... “ 0 (12)
Bx1 Bxd Bx21

is a partial differential equation. Given a function L we want to


find a solution u : Ω Ñ R.

Example 7.1.2 (Ordinary differential equation) For d “ 1 (12) is


called ordinary and could be given by

0 “ Lpupxq, u1 pxqq :“ upxq ´ u1 pxq.

For Ω :“ R the solutions are given by upxq :“ cex for c P R. ˛

Example 7.1.3 (PDEs: Poisson, Heat, Wave) The partial differ-


ential equation
d
ÿ B 2 upxq
∆upxq “ “ f pxq
i“1
Bx2i

is called the Poisson-equation, Poisson-equation

Bupx, tq
´ α∆x upx, tq “ 0
Bt

is called the heat equation and heat equation

B 2 upx, tq
´ ∆x upx, tq “ 0
Bt2

the wave equation, where t P r0, T s and α P R. wave equation


They represent the three different types of PDEs (in that order):
• elliptic PDEs usually model time-independent (=stationary) prob- elliptic
lems and (their solutions) describe a state of minimal energy.
• parabolic PDes are non-stationary elliptic equations. parabolic
• hyperbolic PDEs typically describe waves and their propagation. ˛ hyperbolic

43
7.1 Mathematics of PDEs

Remark 7.1.4 (Motivation) Does there exist a solution? Consider


L px, upxq, u1 pxqq “ f . If we aim for u P C 2 pRq, then we need f P CpRq.
In most applications those regularity assumptions are unrealistic. In order
to relax those assumptions we introduce the concept of weak solutions. weak solutions
d
From now on let Ω Ă R be open and bounded.

Example 7.1.5 (Poisson’s equation) For f : Ω Ñ R consider


$
&∆u “ f on Ω
pP q
%u “ 0 on δΩ.

Let d “ 1 and assume there exists a solution u P C 2 pΩq. Then by partial


integration we obtain
ż ż ż
2 1 1
u pxqϕpxq dx “ ´ u pxqϕ pxq dx “ f pxqϕpxq dx
Ω Ω Ω

for ϕ P C 8
0 pΩq.

The advantage of the boxed weak formulation is that ∆upxq “ f pxq weak formulation
doesn’t have to hold everywhere and u P C 2 pΩq is not required anymore.˛

definition 7.1.6 (weak derivative)


Let α P Nd0 . Then v “ Dα u : Ω Ñ R is called weak derivative of weak derivative
u if or all ϕ P C 8
0 pΩq it holds that
ż ż
p´1q|α| upxqDα ϕpxq dx “ vpxqϕpxq dx.
Ω Ω

In the following that the functions involved are sufficiently integrable,


which in most cases will mean that they are in L2 pΩq.

definition 7.1.7 (L2 -Sobolev Spaces)


For m P N let

Hm pΩq :“ tu : Ω Ñ R, Dα exists @|α| ď mu


H0m pΩq :“ tu P Hm pΩq : u|δΩ “ 0u.

Theorem 7.1.1: Scalar product on Hm and H0m

The spaces Hm pΩq and H0m pΩq are Hilbert spaces via
ÿ ż
x u, v y :“ Dα upxqDα vpxq dx.
|α|ďm Ω

44
7.2 Discretization

definition 7.1.8 (Weak solution to Poisson’s equation)


A function u P H01 pΩq is called weak solution of (P) if
ż ż
´ ∇upxq∇vpxq dx “ f pxqvpxq dx
Ω Ω

holds for all v P H01 pΩq.

More generally we can consider a bilinear form b : Hm pΩq ˆ Hm pΩq Ñ R


and search for a u P Hm pΩq fulfilling
ż
bpu, vq “ f pxqvpxq @v P H m pΩq. (13)

Theorem 7.1.2: Lax-Milgram (1954)

Let b be symmetric, bounded and coercive, i.e. bpv, vq ě c}v}2 for b is bounded if |bpu, vq| ď C}u}}v}, C ą 0
c ą 0. Then there exists a unique solution u P Hm pΩq to (13). and symmetric if bpu, vq “ bpv, uq.

Corollary 7.1.9
For suitable f : Ω Ñ R Poisson’s equation has a unique solution.

7.2 Discretization
Usually finding the exact solution is infeasible, thus we discretize the
equation and the underlying space.
Commonly methods include finite element methods, finite difference,
finite volume methods and frame-based approaches.

High-Fidelity Approximation
The Garlerkin method solves
ż
bpuh , vq “ f pxqvpxq dx @v P U h ,

where U h Ă H m pΩq is a D-dimensional subspace called high-fidelity


discretization instead of (13).

Lemma 7.2.1 (Céa (1964))


Up to a constant uh is the best approximation of u by elements in U h .

Let pϕi qD h
i“1 be a basis for U . Then we have
˜ ¸ ż
D
ÿ D
ÿ D
ÿ
h
b ui ϕi , vj ϕj “ f pxq vj ϕj pxq dx,
i“1 j“1 Ω j“1

which we can – using the bilinearity of b – rewrite as


D
ÿ ż
uhi bpϕi , ϕj q ´ f pxqϕj pxq dx “ 0 @j P rDs.
i“1 Ω

45
7.3 Parametric PDEs

Therefore,
D ˆż ˙D
ÿ ‰´1
P Rd .

uh “ uhi ϕi “ pbpϕj , ϕi qqD
i,j“1 f pxqϕi pxq dx
i“1 Ω i“1

We have reduce the high-dimensional problem to a problem from linear


algebra, matrix inversion.

7.3 Parametric PDEs


Parameter dependent families of PDEs arise in basically any branch
of science and engineering such as complex design problems, inverse
problems, optimization tasks and uncertainty quantification.
The number of parameters can be finite (i.e. physical properties such as
domain geometry) or infinite (i.e. when modelling of random stochastic
diffusion field).
Given a differential equation Lpuy , yq “ fy for parameters y, the para-
metric map is Y Ñ H, y ÞÑ uy .
In our setting we will consider parameter-dependent equations of the
form ż
by puy , vq “ fy pxqvpxq dx @y P Y, v P H,

where Y Ă Rp is the compact parameter set (p large), H is a Sobolev
space, by : H ˆ H Ñ R is symmetric, uniformly coercive (uniformly
in y???) and uniformly continuous bilinear form, fy : Ω Ñ R is the We assume the solution manifold
uniformly bounded, parameter-dependent right-hand side and uy P H is pYq :“ tuy : y P Yu Ă H to be
the solution. compact.

Example 7.3.1 (Parametric Poisson’s equation)


For f : Ω Ñ R consider
$
&∇pa ¨ ∇uq “ f in Ω,
where a P A Ă tg : Ω Ñ R, boundedu.
%u “ 0 on δΩ,

If A is compact there exists functions pgi q8 such that for every a P


8
ř8 i“1
A there exist pyi qi“1 Ă R with a “ i“1 yi gi . Since we are solving
řp
numerically anyway, we restrict ourselves to the case that a “ i“1 yi gi
for some potentially very large p P N. ˛

Multi-query situation
Many applications require solving the parametric PDE multiple times
for different parameters:

Rp Ą Y Q y “ py1 , . . . yp q ÞÑ uy P H

Applications include design optimization, optimal control, routine analy-


sis, uncertainty quantification and inverse problems.
As the computational cost often is much too high, we use the reduced
basis method, using that the solution is compact and low-dimensional. reduced basis method
TODO

46
7.3 Parametric PDEs

Reduced Basis Method U rb Ă U h Ă H with dimensions dpεq, D and


possibly 8, respectively.
Reduced basis ψi , high fidelity basis ϕi . V is the transformation matrix
between them.
urb rb ´1 rb
y “ pBy q fy .

Theorem 7.3.1: Discrete Version [33]

We assume that
• For all ε ą 0 there exists dpεq ! D, V P RDˆdpεq such that
for all y P Y there exists Byrb P Rdpεqˆdpεq with

}V Byrb V T fyh uhy } ď ε.

• There exist ReLU neural networks ΦB and Φf of size


Oppolyppqpdpεqq2 polylogpεqq such that for all y P Y we have

}ΦB ´ Byrb }, }Φf ´ fyrb } ď ε.

Then there exists a ReLU neural network Φ of size


Oppdpεqq3 polylogpεq ` D ` polyppqpdpεqq2 polylogpεqq such
that
}Φ ´ uhy } ď ε @y P Y . (14)

Proof. 1 Scalar multiplication. By [5] scalar multiplication on


r´1, 1s can be ε-approximated by a NN of size Oplogpε´1 qq.
2 Matrix multiplication. A matrix multiplication of two d ˆ d-
matrices can be performed by Opd3 q scalar multiplications. There-
fore, matrix multiplication can be ε-approximated by a NN of size
Oppdpεqq3 ¨ logpε´1 qq.
3 Matrix inversion. Neural networks can approximate matrix poly-
nomials and therefore the Inversion operators as
n
ÿ nÑ8
Ak ÝÝÝÑ pI ´ Aq´1 .
k“0

Therefore, matrix inversion can be approximated by a neural net-


work of size Oppdpεqq3 ¨ logq pε´1 qq for some q ą 0.
4 Approximating uhy . We can use both assumptions to approximate

V pByrb q´1 V T fyh “ fyrb

by a neural network. l

Theorem 7.3.2: Continuous version [33]

dpεq
Let pψi qi“1 denote the reduced basis and assume that there exists
dpεq
ReLU neural networks pΦi qi“1 of size Oppolylogpεqq such that
}Φi ´ ψi }H ď ε for all i P rdpεqs.
Then there exists a ReLU neural network Φ of size

47
7.3 Parametric PDEs

Oppdpεqq3 polylogpεq ` polyppqpdpεqq2 polylogpεqq such that (14)


holds.

Remark 7.3.2 These hypothesis are fulfilled for diffusion and linear
elasticity equations.
Folie 44
x1 ptq “ %pAptqxptq ´ bptqq, xp0q “ x0

48
References

References
[1] Selfmade in GeoGebra6.
[2] Lecture Slides by Prof. Gitta Kutyniok.
[3] Tatiana A Bubba, Gitta Kutyniok, Matti Lassas, Maximilian März,
Wojciech Samek, Samuli Siltanen and Vignesh Srinivasan.
Learning The Invisible: A Hybrid Deep Learning-Shearlet Framework
for Limited Angle Computed Tomography
Inverse Problems, Volume 35, Number 6, 31.05.2019.
[4] Matthew Hutson.
AI researchers allege that machine learning is alchemy.
Science Magazine, 03.05.2018.
[5] Dmitry Yarotsky.
Error bounds for approximations with deep ReLU networks. Neural
networks Volume 94, Pages 103-114, 2016.
[6] Matus Telgarsky.
Representation Benefits of Deep Feedforward Networks.
Arvix, 2016.
[7] George Cybenko.
Approximation by superpositions of a sigmoidal function.
Mathematics of Control, Signals and Systems, Volume 2, Pages 303-
314, 1989.
[8] Screenshot of What is VC Dimension | Example vc dimension of Line
Hypothesis class on YouTube at 7:01.
[9] Martien C.A. van Zuijlen.
On a conjecture concerning the sum of independent Rademacher
random variables.
Arxiv, 2011.
[10] Wikipedia page for overfitting.
[11] Noah Golowich, Alexander Rakhlin, Ohad Shamir.
Size-Independent Sample Complexity of Neural Networks.
Proceedings of Machine Learning Research, Volume 75, Pages 1–3,
2018.
[12] Chapter 26 in Shalev-Shwartz, Shai; Ben-David, Shai (2014). Under-
standing Machine Learning – from Theory to Algorithms.
[13] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol
Vinyals.
Understanding deep learning requires rethinking generalization.
ArXiv, 2019
[14] Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal.
Reconciling modern machine learning practice and the bias-variance
trade-off .
ArXiv, 2019

49
References

[15] Michael Elad (CS, Technion).


Deep, deep trouble Deep Learnings Impact on Image Processing, Math-
ematics, and Humanity.
SIAM News.
[16] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda B. Viégas,
Martin Wattenberg.
SmoothGrad: Removing Noise by Adding Noise.
2017.
[17] Baehrens, Schroeter, Harmeling, Kawanabe, Hansen, Müller, 2010
[18] Springenberg, Dosovitskiy, Brox, Riedmiller, 2015
[19] Bach, Binder, Montavon, Klauschen, Müller, Samek, 2015
[20] G. Montavon, W. Samek, K. Müller.
Methods for interpreting and understanding deep neural networks.
Digital Signal Processing 73, 1–15, 2018
[21] Grégoire Montavon, Sebastian Lapuschkin, Alexander Binder, Woj-
ciech Samek, Klaus-Robert Müller
Explaining nonlinear classification decisions with deep Taylor decom-
position.
Pattern Recognition, Volume 65, May 2017, Pages 211-222
[22] Ribeiro, Singh, Guestrin, 2016
[23] (Shapley, 1953)
[24] Kononenko, Štrumbelj, 2010
[25] Ingrid Daubechies, M. Defrise and C. De Mol.
An iterative thresholding algorithm for linear inverse problems with a
sparsity constraint.
Communications on Pure and Applied Mathematics, 57(11), 1413-
1457, November 2004.
[26] Simon S. Du, Xiyu Zhai, Barnabas Poczos, Aarti Singh.
Gradient Descent Provably Optimizes Over-parameterized Neural Net-
works.
Published as a conference paper at International Conference on Learn-
ing Representations 2019.
[27] Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, Xiyu Zhai,
Gradient Descent Finds Global Minima of Deep Neural Networks,
ICML 2019, (see also arXiv:1811.03804).
[28] D. Labate, G. Kutyniok and G. Weiss, W.-Q. Lim.
Sparse multidimensional representation using shearlets.
Proc. of SPIE conference on Wavelet Applications in Signal and Image
Processing XI, San Diego, USA, (2005).
[29] M.Anthony and P. Bartlett.
Neural Network Learning: Theoretical Foundations.
Cambridge University Press, Cambridge, UK, (2009).
[30] Bölcskei, Grohs, Kutyniok and Peterson.

50
References

Optimal Approximation with Sparsely Connected Deep Neural Net-


works.
SIAM Journal on Mathematics of Data Science, 2019, Volume 1,
Number 1, pages 8–45.
[31] R. Eldan and O. Shamir.
The Power of Depth for Feedforward Neural Networks.
JMLR: Workshop and Conference Proceedings 49 (2016), 1–34.
[32] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao.
Why and when can deep-but not shallow-networks avoid the curse of
dimensionality: A review.
International Journal of Automation and Computing 14(5) (2017),
503–519.
[33] Gitta Kutyniok, Philipp Petersen, Mones Raslan, Reinhold Schnei-
der.
A Theoretical Analysis of Deep Neural Networks and Parametric
PDEs.
ArXiv
[34] Scott Lundberg, Su-In Lee.
A Unified Approach to Interpreting Model Predictions. ArXiv, 2017.
[35] Stephan Wäldchen, Jan Macdonald, Sascha Hauch, and Gitta Ku-
tyniok.
The Computational Complexity of Understanding Network Decisions.
ArXiv, 2019.
[36] M. Littman, J. Goldsmith, M. Mundhenk.
The computational complexity of probabilistic planning.
Journal of artificial intelligence research 9 (1998), pages 1-36.
[37] Michael Lustig, David L. Donoho, Juan M. Santos, and John M.
Pauly.
Compressed Sensing MRI: A look at how CS can improve on current
imaging techniques.
IEEE SIGNAL PROCESSING MAGAZINE, MARCH 2008.
[38] Gitta Kutyniok and Wang-Q Lim.
Image Separation Using Wavelets and Shearlets. Curves and surfaces.
7th international conference, Avignon, France, June 24–30, 2010.
Revised selected papers
[39] Michal Aharon, Michael Elad, and Alfred Bruckstein.
K-SVD: An Algorithm for Designing Overcomplete Dictionaries for
Sparse Representation.
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO.
11, NOVEMBER 2006.
[40] Paschalis P et al.
Tomographic image reconstruction using artificial neural networks.
Nucl. Instrum. Methods 527 211–5,2004.
[41] K. Gregor and Y. LeCun.
Learning fast approximations of sparse coding Proceedings of the 27th

51
References

International Conference on Machine Learning (ICML-10), 2010, pp.


399–406.
[42] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and R. R.
Salakhutdinov.
Improving neural networks by preventing co-adaptation of feature
detectors.
ArXiv.
Möglichkeiten für Bib:
On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-
Wise Relevance Propagation, 2015
Explaining nonlinear classification decisions with deep Taylor decomposi-
tion, 2017
Layer-Wise Relevance Propagation for Deep Neural Network Architec-
tures, 2016

52
Index

A Interpretability . . . . . . . . . . . . . . . 3
inverse problem . . . . . . . . . . . . . 39
activation function . . . . . . . . . . . 2
approximation method . . . . . . 18 L
artificial neural network . . . . . . 2
Learning . . . . . . . . . . . . . . . . . . . . . 3
artificial neuron . . . . . . . . . . . . . . 2
loss function. . . . . . . . . . . . . . . . . .8
B
M
backpropagation . . . . . . . . . . . . . . 9
mean squared error . . . . . . . . . . . 4
binary en(de)coders . . . . . . . . . 16
minimal code length . . . . . . . . . 16

C
O
cartoon-like functions. . . . . . . .11
optimal exponent. . . . . . . . . . . .16
classification. . . . . . . . . . . . . . . . . .4
compositional functions . . . . . 24 R
cross-validation . . . . . . . . . . . 4, 31
reduced basis method . . . . . . . 46
D regression . . . . . . . . . . . . . . . . . . . . 4
Regularization . . . . . . . . . . . . . . 40
distortion . . . . . . . . . . . . . . . . . . . 16
relevance map
Taylor decomposition . . . 36
E
residual neural network. . . . . .29
encoder-decoder pair . . . . . . . . 16
S
Expressivity . . . . . . . . . . . . . . . . . . 3
shearlet . . . . . . . . . . . . . . . . . . . . . 11
G
supervised learning . . . . . . . . . . . 4
Generalization . . . . . . . . . . . . . . . . 3
U
gradient descent . . . . . . . . . . . . . . 8
gradient flow . . . . . . . . . . . . . . . . 26 unsupervised learning . . . . . . . . 4

H V
hinge loss . . . . . . . . . . . . . . . . . . . 31 VC dimension . . . . . . . . . . . . . . . 14
hypothesis space. . . . . . . . . . . . . .6
W
I
wavefront set . . . . . . . . . . . . . . . . 42
inpainting . . . . . . . . . . . . . . . . . . . 39 weak derivative . . . . . . . . . . . . . . 44

53

You might also like