[go: up one dir, main page]

0% found this document useful (0 votes)
54 views47 pages

Ece18898g Neural Networks

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 47

ECE 18-898G: Special Topics in Signal Processing:

Sparsity, Structure, and Inference


Neural Networks: A brief touch

Yuejie Chi
Department of Electrical and Computer Engineering

Spring 2018

1/41
Outline

• introduction to deep learning

• perceptron model (a single neuron)

• 1-hidden-layer (2-layer) neural network

2/41
The rise of deep neural networks

ImageNet Large Scale Visual Recognition Challenge (ILSVRC): Led by


Prof. Fei-Fei Li (Stanford).

Total number of non-empty synsets (categories): 21841;


Total number of images: 14,197,122

3/41
The rise of deep neural networks

The deeper, the better?

4/41
AlexNet

• Won the 2012 LSVRC competition by a large margin: top-1 and


top-5 error rates of 37.5% and 17.0%.

5/41
AlexNet

• 60 million parameters and 650,000 neurons.


• takes 5-6 days to train on two GTX 580 3GB GPUs.

3 fully connected
5 convolutional layers layers

Softmax
output

6/41
Faster training with ReLU

• Rectified linear units (ReLU):

y = max(0, x)

• compared to tanh and sigmoid,


training is much faster.

ReLU
2.5
tanh
sigmoid
2

1.5

0.5

-0.5

-1
-3 -2 -1 0 1 2 3

ReLu doesn’t saturate. 7/41


Reduce overfitting

Important to reduce overfitting since:

number of training data  number of parameters

8/41
Reduce overfitting

Important to reduce overfitting since:

number of training data  number of parameters

• Data augmentation: apply label-invariant transforms

8/41
Reduce overfitting

Important to reduce overfitting since:

number of training data  number of parameters

• Dropout

9/41
Reduce overfitting

Important to reduce overfitting since:

number of training data  number of parameters

• Other ways of “implicit regularization”:


◦ early stopping
◦ weight decay (ridge regression)
◦ ....

10/41
Learned hierarchical representations

Learned representations using CNN trained on ImageNet:

Figure credit: Y. Lecun’s slide with research credit to, Zeiler and
Fergus, 2013.
11/41
single-layer networks (perceptron)

12/41
Perceptron

Input x = [x1 , . . . , xd ] ∈ Rd , weight w = [x1 , . . . , xd ] ∈ Rd , output


y ∈ R;
d
!
 
y = σ w> x = σ
X
wi xi
i=1

where σ(·) is a nonlinear activation function, e.g. σ(z) = sign(z)


(hard thresholding) or σ(z) = sigmoid(z) = 1+e1−z (soft thresholding).

1 x
ayer input layer output layer
w1
x2 w
x y W2 y
x3 w3

Decision making at test stage: given a test sample x, calculate y.


13/41
Nonlinear activation

Nonlinear activation is critical for complex decision boundary.

14/41
Training of a perceptron

Empirical risk minimization: Given training data {xi , yi }ni=1 , find the
weight vector w:
n
1X
w
b = arg min `(w; xi , yi )
w∈Rd n i=1

• find the weight parameter w that best fits the data;


• popular choice for loss function: quadratic, cross entropy, hinge,
etc..  2
`(w; xi , yi ) = yi − σ(w> xi )
we’ll use the quadratic loss and sigmoid activation as an
example...

15/41
Training via (stochastic) gradient descent

n 
1 X 2
w
b = arg min yi − σ(w> xi ) := arg min `n (w)
w∈Rd 2n i=1 w∈Rd

• Gradient descent:
wt+1 = wt − ηt ∇`n (wt )
where ηt is the step-size or learning rate.

16/41
Training via (stochastic) gradient descent

n 
1 X 2
w
b = arg min yi − σ(w> xi ) := arg min `n (w)
w∈Rd 2n i=1 w∈Rd

• Gradient descent:
wt+1 = wt − ηt ∇`n (wt )
where ηt is the step-size or learning rate.
• The gradient can be calculated via chain rule.
◦ call ŷi = ŷi (w) = σ(w> xi ), then

d 1 2 dŷi (w)
(yi − ŷi ) = (ŷi − yi ) = (ŷi − yi )σ 0 (w> xi )xi
dw 2 dw
= (ŷi − yi )ŷi (1 − ŷi ) xi
| {z }
scalar
where we used σ 0 (z) = σ(z)(1 − σ 0 (z)). This is called “delta rule”.
16/41
Stochastic gradient descent

Stochastic gradient descent uses only a mini-batch of data every


iteration.
At every iteration t,
1 Draw a mini-batch of data indexed by St ∈ {1, · · · , n};
2 Update X
wt+1 = wt − ηt ∇`(wt ; xi , yi )
i∈St

17/41
Detour: Backpropagation

• Backpropagation is the basic algorithm to train neural network,


rediscovered several times in the literature in the 1970-80’s, but
popularized by the 1986 paper by Rumelhart, Hinton, and
Williams.
• Assuming node operations take unit time, backpropagation takes
linear time, specifically, O(Network Size) = O(V + E) to
compute the gradient, where V is the number of vertices and E
is the number of edges in the neural network.

main idea: chain rule from calculus.


d
f (g(x)) = f 0 (g(x))g 0 (x)
dx
Let’s illustrate the process with single-output, 2-layer NN
18/41
Derivations of backpropagation
dden layer input layer output layer
den layer input layer output
x y layer
Wwm,j y
hidden layer inputxlayer
den layer input layer output
x y layer
j
W output layer
y
dden layer input layer output
x y layer
Wx y W y y
den layer input layer output
x y layer
W y
hidden layer input layer output layer
hidden layer
x input
y W layer output layer
y
hidden layer input layer output layer

network output:
!   
X X X
ŷ = σ vm hm = σ  vm σ  wm,j xj 
m m j

loss function: f = 1
2 (y − ŷ)2 .
19/41
Backpropogation I

Optimize the weights for each layer, starting with the layer closest to
outputs and working back to the layer closest to inputs.
1 To update vm ’s: realize

df df dŷ
=
dvm dŷ dvm
dŷ
= (ŷ − y)
dvm
!
0
X
= (ŷ − y)σ vm hm hm
m
= (ŷ − y)ŷ(1 − ŷ) hm .
| {z }
δ

This is the same as updating the perceptron.

20/41
Backpropogation II

2 To update wm,j ’s: realize

df df dŷ dhm
=
dwm,j dŷ dhm dwm,j
= (ŷ − y)ŷ(1 − ŷ)vm hm (1 − hm )xj
= δvm hm (1 − hm )xj

21/41
Questions we may ask (in general)

• Representation: how well can a given network (fixed activation)


approximate / explain the training data?

• Generalization: how well can the learned w


b behave in
prediction during testing?

• Optimization: how does the output of (S)GD wt relate to w? b


(or we should really plug in wt in the previous two questions!)

22/41
Nonconvex landscape of perceptron can be very bad
SGD converges to local minimizers. Are they global?
Theorem 12.1 (Auer et al., 1995)
Let σ(·) be sigmoid and `(·) be the quadratic loss function. There
exists a sequence of training samples {xi , yi }ni=1 such that `n (w) has
b nd cd distinct local minima.

23/41
Nonconvex landscape of perceptron can be very bad
SGD converges to local minimizers. Are they global?
Theorem 12.1 (Auer et al., 1995)
Let σ(·) be sigmoid and `(·) be the quadratic loss function. There
exists a sequence of training samples {xi , yi }ni=1 such that `n (w) has
b nd cd distinct local minima.

Consequence: there may exist exponentially many bad local minima


with arbitrary data! — curse of dimensionality

23/41
Why?

• saturation of the sigmoid


0.35 0.6

0.3
0.5

0.25
0.4

0.2

0.3

0.15

0.2
0.1

0.1
0.05

0 0
-10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -8 -6 -4 -2 0 2 4 6 8 10

`(w; 10, 0.55) `(w; 0.7, 0.25)


◦ each sample produces a local min + flat surfaces away from the
minimizer

24/41
Why?

• saturation of the sigmoid


0.35 0.6

0.3
0.5

0.25
0.4

0.2

0.3

0.15

0.2
0.1

0.1
0.05

0 0
-10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -8 -6 -4 -2 0 2 4 6 8 10

`(w; 10, 0.55) `(w; 0.7, 0.25)


◦ each sample produces a local min + flat surfaces away from the
minimizer
◦ if the local min of sample A falls into the flat region of sample B
(and vice versa), the sum of sample losses preserve both minima.

24/41
Why?

• We get one local minimum per sample in 1D.


0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

local minima
0
-10 -8 -6 -4 -2 0 2 4 6 8 10

• Curse of dimensionality: we construct the samples to get b nd cd


distinct local minima in d dim.

25/41
Statistical models come to rescue
Data/measurements follow certain statistical models and hence are
not worst-case instances.

m
1 X
minimizew `n (w) = `(w; xi , yi )
m i=1

26/41
Statistical models come to rescue
Data/measurements follow certain statistical models and hence are
not worst-case instances.

m
1 X m→∞
minimizew `n (w) = `(w; xi , yi ) =⇒ E[`(w; x, y)] := `(w)
m i=1

26/41
Statistical models come to rescue
Data/measurements follow certain statistical models and hence are
not worst-case instances.

m
1 X m→∞
minimizew `n (w) = `(w; xi , yi ) =⇒ E[`(w; x, y)] := `(w)
m i=1

empirical risk ≈ population risk


3 3
θ0 = [1, 0]
θ̂n = [0.816, −0.268]
2 2

1 1

θ0
θ2

θ2
0 0

-1 -1

-2 -2

-3 -3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
θ1 θ1

Figure credit: Mei, Bai and Montanari 26/41


Statistical models of training data

Assume the training data {xi , yi }ni=1 is i.i.d. drawn from some
distribution:
(x, y) ∼ p(x, y)
We are using neural networks to fit p(x, y).
• A planted-truth model: let x ∼ N (0, I) and the label y is drawn
as
◦ regression model:
yi = σ(w?> xi )
◦ classification model: yi ∈ {0, 1}, where

P(yi = 1) = σ(w?> xi )

• Parameter recovery: can we recover w? using {xi , yi }ni=1 ?

27/41
Roadmap

1 Step 1: Verify the landscape properties of population loss;


2 Step 2: translate properties of population loss to empirical loss;
3 Step 3: argue wb (minimizer of empirical loss) is close to w ?
(minimizer of population loss).

28/41
Step 1: population risk

θ0
θ2
0

-1

-2

-3
-3 -2 -1 0 1 2 3
θ1

• w? is the unique local minimizer that is also global. No bad local


minima!
• strongly convex near global optima ; large gradient elsewhere
29/41
Nonconvex landscape: from population to empirical

Figure credit: Mei, Bai and Montanari

30/41
Step 2: uniform convergence of gradients & Hessian

Theorem 12.2 (Bai, Song, Montanari, 2017)


Under suitable assumptions, for any δ > 0, there exists a positive
constant C depending on (R, δ) but independent of n and d, such
that as long as n ≥ Cd log d, we have
1 preservation of gradient:
 s 
Cd log n 
P  sup k∇`n (w) − ∇`(w)k2 ≤ ≥ 1 − δ.
kwk≤R n

2 preservation of Hessian:
 s 
Cd log n 
P  sup ∇2 `n (w) − ∇2 `(w) ≤ ≥ 1 − δ.

kwk≤R n

31/41
Step 3: establish rate of convergence for ERM

By the mean-value theorem, there exists some w0 between w


b and w ?
such that
1
b = `n (w ? ) + h∇`n (w ? ), w
`n (w) b − w ? )> ∇2 `n (w 0 )(w
b − w ? i + (w b − w? )
2
≤ `n (w? )

where the last line follows by optimality of w.


b Then

1 1
b − w ? k22 ≤ (w
λmin (∇2 `n (w0 )) kw b − w ? )> ∇2 `n (w 0 )(w
b − w? )
2 2
≤ |h∇`n (w? ), w
b − w ? i|
≤ k∇`n (w? )k · kw
b − w? k
s
? 2k∇`n (w? )k Cd log n
→ kw
b − w k2 ≤ . .
λmin (∇2 `n (w0 )) n
32/41
two-layer networks

33/41
Two-layer linear network
Given arbitrary data {xi , yi }ni=1 , xi , yi ∈ Rd , fit the two-layer linear
network with quadratic loss:
n
kyi − ABxi k22
X
f (A, B) =
i=1

where B ∈ Rp×d , A∈ Rd×p , where p ≤ d.

B A
x

hidden
hidden layer input layer
layer inputlayer
output layer output layer

hidden layer input layer output layer

• special case: auto-association (auto-encoding, identity mapping),


where yi = xi , for e.g. image compression.
34/41
Landscape of 2-layer linear network

• bears some similarity with the nonconvex matrix factorization


problem;

• Lack identifiability: for any invertible C, AB = (AC)(C −1 B).

• Define X = [x1 , x2 , · · · , xn ] and Y = [y1 , y2 , · · · , yn ], then

f (A, B) = kY − ABXk2F

• Let ΣXX = ni=1 xi x> > >


i = XX , ΣY Y = Y Y ,
P
>
ΣXY = XY , and ΣY X = Y X .>

• When X = Y , any optimum AB = U U > , where U is the top


p eigenvectors of ΣXX .

35/41
Landscape of 2-layer linear network

Theorem 12.3 (Baldi and Hornik, 1989)


Suppose X is full rank and hence ΣXX is invertible. Further assume

Σ := ΣY X Σ−1
XX ΣXY

is full rank with d distinct eigenvalues λ1 > · · · > λd > 0. Then


f (A, B) has no spurious local minima, except for equivalent versions
of global minimum due to invertible transformations.

• no bad local min!


• generalizable to multi-layer linear networks.

36/41
Critical points of two-layer linear networks

Lemma 12.4 (critical points)


Any critical point satisfies

AB = PA ΣY X Σ−1
XX ,

where A satisfies

PA Σ = PA ΣPA = ΣPA ,

where PA is the ortho-projector projecting onto the column span of


the sub-indexed matrix.

37/41
Critical points of two-layer linear networks

Let the EVD of Σ = ΣY X Σ−1 >


XX ΣXY be Σ = U ΛU .

Lemma 12.5 (critical points)


At any critical point, A can be written in the form

A = [UJ , 0d×(p−r) ]C

where rank(A) = r ≤ p, J ⊂ {1, . . . , d}, |J | = r and C is invertible.


Correspondingly,
" #
−1 UJ> ΣY X Σ−1
XX
B=C ,
last p − r rows of CL

where L is any d × d matrix.

Verify (A, B) is global optima if and only if J = {1, . . . , p}.


38/41
Two-layer nonlinear network

Local
nlayer
layer strong
input
input layerconvexity
layer output under
layer the Gaussian model [Fu et al., 2018].
output layer
n layer input layer output
x xy y layer
WW y
yer hidden
input layer
layer input layer
output output layer
layer
n layerhidden
inputlayer
layer output
input y layer
x layer output
W layer y
n layer input layer output x y W y
x xy xlayer
Wy W
y W y y +
n layer input layer output
x y layer
W y
x input
hidden layer y W
hidden layer
layer y output layer
inputlayer
output layer

hidden layer input layer output layer

39/41
Reference I

[1] ”ImageNet,” www.image-net.org/.


[2] ”Deep learning,” Lecun, Bengio, and Hinton, Nature, 2015.
[3] ”ImageNet classification with deep convolutional neural networks,”
Krizhevsky et al., NIPS, 2012.
[4] ”Learning representations by back-propagating errors,” Rumelhart,
Hinton, and Williams, Nature, 1986.
[5] ”Dropout: A simple way to prevent neural networks from overfitting,”
Srivastava et al., JMLR, 2014.
[6] ”Dropout Training as Adaptive Regularization,” Wager et al., NIPS,
2013.
[7] ”On the importance of initialization and momentum in deep learning,”
Sutskever et al., ICML, 2013.
[8] ”Exponentially many local minima for single neurons,” P. Auer, M.
Herbster and K. Warmuth, NIPS, 1995.
40/41
Reference II

[9] ”The landscape of empirical risk for non-convex losses,” S. Mei, Y. Bai,
and A. Montanari, arXiv preprint arXiv:1607.06534, 2016.
[10] ”Neural networks and principal component analysis: Learning from
examples without local minima,” P. Baldi and K. Hornik, Neural
Networks, 1989.
[11] ”Local Geometry of One-Hidden-Layer Neural Networks for Logistic
Regression,” Fu, Chi, Liang, arXiv preprint arXiv:1802.06463, 2018.

41/41

You might also like