Ece18898g Neural Networks

ECE 18-898G: Special Topics in Signal Processing:
Sparsity, Structure, and Inference

Neural Networks: A brief touch
Yuejie Chi
Department of Electrical and Computer Engineering
Spring 2018
1/41
Outline
• introduction to deep learning
• perceptron model (a single neuron)
• 1-hidden-layer (2-layer) neural network
2/41
The rise of deep neural networks
ImageNet Large Scale Visual Recognition Challenge (ILSVRC): Led by

Prof. Fei-Fei Li (Stanford).
Total number of non-empty synsets (categories): 21841;

Total number of images: 14,197,122
3/41
The rise of deep neural networks
The deeper, the better?
4/41
AlexNet
• Won the 2012 LSVRC competition by a large margin: top-1 and

top-5 error rates of 37.5% and 17.0%.
5/41
AlexNet
• 60 million parameters and 650,000 neurons.

• takes 5-6 days to train on two GTX 580 3GB GPUs.
3 fully connected
5 convolutional layers layers
Softmax
output
6/41
Faster training with ReLU
• Rectified linear units (ReLU):
y = max(0, x)
• compared to tanh and sigmoid,

training is much faster.
ReLU
2.5
tanh
sigmoid
2
1.5
0.5
-0.5
-1
-3 -2 -1 0 1 2 3
ReLu doesn’t saturate. 7/41

Reduce overfitting
Important to reduce overfitting since:
number of training data number of parameters
8/41
Reduce overfitting
• Data augmentation: apply label-invariant transforms
8/41
Reduce overfitting
• Dropout
9/41
Reduce overfitting
• Other ways of “implicit regularization”:

◦ early stopping
◦ weight decay (ridge regression)
◦ ....
10/41
Learned hierarchical representations
Learned representations using CNN trained on ImageNet:
Figure credit: Y. Lecun’s slide with research credit to, Zeiler and
Fergus, 2013.
11/41
single-layer networks (perceptron)
12/41
Perceptron
Input x = [x1 , . . . , xd ] ∈ Rd , weight w = [x1 , . . . , xd ] ∈ Rd , output

y ∈ R;
d
!

y = σ w> x = σ
X
wi xi
i=1
where σ(·) is a nonlinear activation function, e.g. σ(z) = sign(z)

(hard thresholding) or σ(z) = sigmoid(z) = 1+e1−z (soft thresholding).
1 x
ayer input layer output layer
w1
x2 w
x y W2 y
x3 w3
Decision making at test stage: given a test sample x, calculate y.

13/41
Nonlinear activation
Nonlinear activation is critical for complex decision boundary.
14/41
Training of a perceptron
Empirical risk minimization: Given training data {xi , yi }ni=1 , find the
weight vector w:
n
1X
w
b = arg min `(w; xi , yi )
w∈Rd n i=1
• find the weight parameter w that best fits the data;

• popular choice for loss function: quadratic, cross entropy, hinge,
etc.. 2
`(w; xi , yi ) = yi − σ(w> xi )
we’ll use the quadratic loss and sigmoid activation as an
example...
15/41
Training via (stochastic) gradient descent
n
1 X 2
w
b = arg min yi − σ(w> xi ) := arg min `n (w)
w∈Rd 2n i=1 w∈Rd
• Gradient descent:
wt+1 = wt − ηt ∇`n (wt )
where ηt is the step-size or learning rate.
16/41
Training via (stochastic) gradient descent
n
1 X 2
w
b = arg min yi − σ(w> xi ) := arg min `n (w)
w∈Rd 2n i=1 w∈Rd
• Gradient descent:
wt+1 = wt − ηt ∇`n (wt )
where ηt is the step-size or learning rate.
• The gradient can be calculated via chain rule.
◦ call ŷi = ŷi (w) = σ(w> xi ), then
d 1 2 dŷi (w)
(yi − ŷi ) = (ŷi − yi ) = (ŷi − yi )σ 0 (w> xi )xi
dw 2 dw
= (ŷi − yi )ŷi (1 − ŷi ) xi
| {z }
scalar
where we used σ 0 (z) = σ(z)(1 − σ 0 (z)). This is called “delta rule”.
16/41
Stochastic gradient descent
Stochastic gradient descent uses only a mini-batch of data every

iteration.
At every iteration t,
1 Draw a mini-batch of data indexed by St ∈ {1, · · · , n};
2 Update X
wt+1 = wt − ηt ∇`(wt ; xi , yi )
i∈St
17/41
Detour: Backpropagation
• Backpropagation is the basic algorithm to train neural network,

rediscovered several times in the literature in the 1970-80’s, but
popularized by the 1986 paper by Rumelhart, Hinton, and
Williams.
• Assuming node operations take unit time, backpropagation takes
linear time, specifically, O(Network Size) = O(V + E) to
compute the gradient, where V is the number of vertices and E
is the number of edges in the neural network.
main idea: chain rule from calculus.

d
f (g(x)) = f 0 (g(x))g 0 (x)
dx
Let’s illustrate the process with single-output, 2-layer NN
18/41
Derivations of backpropagation
dden layer input layer output layer
den layer input layer output
x y layer
Wwm,j y
hidden layer inputxlayer
x y layer
j
W output layer
y
dden layer input layer output
x y layer
Wx y W y y
x y layer
W y
hidden layer input layer output layer
hidden layer
x input
y W layer output layer
y
network output:
!   
X X X
ŷ = σ vm hm = σ  vm σ  wm,j xj 
m m j
loss function: f = 1
2 (y − ŷ)2 .
19/41
Backpropogation I
Optimize the weights for each layer, starting with the layer closest to
outputs and working back to the layer closest to inputs.
1 To update vm ’s: realize
df df dŷ
=
dvm dŷ dvm
dŷ
= (ŷ − y)
dvm
!
0
X
= (ŷ − y)σ vm hm hm
m
= (ŷ − y)ŷ(1 − ŷ) hm .
| {z }
δ
This is the same as updating the perceptron.
20/41
Backpropogation II
2 To update wm,j ’s: realize
df df dŷ dhm
=
dwm,j dŷ dhm dwm,j
= (ŷ − y)ŷ(1 − ŷ)vm hm (1 − hm )xj
= δvm hm (1 − hm )xj
21/41
Questions we may ask (in general)
• Representation: how well can a given network (fixed activation)

approximate / explain the training data?
• Generalization: how well can the learned w

b behave in
prediction during testing?
• Optimization: how does the output of (S)GD wt relate to w? b

(or we should really plug in wt in the previous two questions!)
22/41
Nonconvex landscape of perceptron can be very bad
SGD converges to local minimizers. Are they global?
Theorem 12.1 (Auer et al., 1995)
Let σ(·) be sigmoid and `(·) be the quadratic loss function. There
exists a sequence of training samples {xi , yi }ni=1 such that `n (w) has
b nd cd distinct local minima.
23/41
Nonconvex landscape of perceptron can be very bad
SGD converges to local minimizers. Are they global?
Theorem 12.1 (Auer et al., 1995)
Let σ(·) be sigmoid and `(·) be the quadratic loss function. There
exists a sequence of training samples {xi , yi }ni=1 such that `n (w) has
b nd cd distinct local minima.
Consequence: there may exist exponentially many bad local minima

with arbitrary data! — curse of dimensionality
23/41
Why?
• saturation of the sigmoid

0.35 0.6
0.3
0.5
0.25
0.4
0.2
0.3
0.15
0.2
0.1
0.1
0.05
0 0
-10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -8 -6 -4 -2 0 2 4 6 8 10
`(w; 10, 0.55) `(w; 0.7, 0.25)

◦ each sample produces a local min + flat surfaces away from the
minimizer
24/41
Why?
• saturation of the sigmoid

0.35 0.6
0.3
0.5
0.25
0.4
0.2
0.3
0.15
0.2
0.1
0.1
0.05
0 0
-10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -8 -6 -4 -2 0 2 4 6 8 10
`(w; 10, 0.55) `(w; 0.7, 0.25)

◦ each sample produces a local min + flat surfaces away from the
minimizer
◦ if the local min of sample A falls into the flat region of sample B
(and vice versa), the sum of sample losses preserve both minima.
24/41
Why?
• We get one local minimum per sample in 1D.

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
local minima
0
-10 -8 -6 -4 -2 0 2 4 6 8 10
• Curse of dimensionality: we construct the samples to get b nd cd

distinct local minima in d dim.
25/41
Statistical models come to rescue
Data/measurements follow certain statistical models and hence are
not worst-case instances.
m
1 X
minimizew `n (w) = `(w; xi , yi )
m i=1
26/41
m
1 X m→∞
minimizew `n (w) = `(w; xi , yi ) =⇒ E[`(w; x, y)] := `(w)
m i=1
26/41
m
1 X m→∞
minimizew `n (w) = `(w; xi , yi ) =⇒ E[`(w; x, y)] := `(w)
m i=1
empirical risk ≈ population risk

3 3
θ0 = [1, 0]
θ̂n = [0.816, −0.268]
2 2
1 1
θ0
θ2
θ2
0 0
-1 -1
-2 -2
-3 -3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
θ1 θ1
Figure credit: Mei, Bai and Montanari 26/41

Statistical models of training data
Assume the training data {xi , yi }ni=1 is i.i.d. drawn from some
distribution:
(x, y) ∼ p(x, y)
We are using neural networks to fit p(x, y).
• A planted-truth model: let x ∼ N (0, I) and the label y is drawn
as
◦ regression model:
yi = σ(w?> xi )
◦ classification model: yi ∈ {0, 1}, where
P(yi = 1) = σ(w?> xi )
• Parameter recovery: can we recover w? using {xi , yi }ni=1 ?
27/41
Roadmap
1 Step 1: Verify the landscape properties of population loss;

2 Step 2: translate properties of population loss to empirical loss;
3 Step 3: argue wb (minimizer of empirical loss) is close to w ?
(minimizer of population loss).
28/41
Step 1: population risk
θ0
θ2
0
-1
-2
-3
-3 -2 -1 0 1 2 3
θ1
• w? is the unique local minimizer that is also global. No bad local

minima!
• strongly convex near global optima ; large gradient elsewhere
29/41
Nonconvex landscape: from population to empirical
Figure credit: Mei, Bai and Montanari
30/41
Step 2: uniform convergence of gradients & Hessian
Theorem 12.2 (Bai, Song, Montanari, 2017)

Under suitable assumptions, for any δ > 0, there exists a positive
constant C depending on (R, δ) but independent of n and d, such
that as long as n ≥ Cd log d, we have
1 preservation of gradient:
 s 
Cd log n 
P  sup k∇`n (w) − ∇`(w)k2 ≤ ≥ 1 − δ.
kwk≤R n
2 preservation of Hessian:
 s 
Cd log n 
P  sup ∇2 `n (w) − ∇2 `(w) ≤ ≥ 1 − δ.

kwk≤R n
31/41
Step 3: establish rate of convergence for ERM
By the mean-value theorem, there exists some w0 between w

b and w ?
such that
1
b = `n (w ? ) + h∇`n (w ? ), w
`n (w) b − w ? )> ∇2 `n (w 0 )(w
b − w ? i + (w b − w? )
2
≤ `n (w? )
where the last line follows by optimality of w.

b Then
1 1
b − w ? k22 ≤ (w
λmin (∇2 `n (w0 )) kw b − w ? )> ∇2 `n (w 0 )(w
b − w? )
2 2
≤ |h∇`n (w? ), w
b − w ? i|
≤ k∇`n (w? )k · kw
b − w? k
s
? 2k∇`n (w? )k Cd log n
→ kw
b − w k2 ≤ . .
λmin (∇2 `n (w0 )) n
32/41
two-layer networks
33/41
Two-layer linear network
Given arbitrary data {xi , yi }ni=1 , xi , yi ∈ Rd , fit the two-layer linear
network with quadratic loss:
n
kyi − ABxi k22
X
f (A, B) =
i=1
where B ∈ Rp×d , A∈ Rd×p , where p ≤ d.
B A
x
hidden
hidden layer input layer
layer inputlayer
output layer output layer
• special case: auto-association (auto-encoding, identity mapping),

where yi = xi , for e.g. image compression.
34/41
Landscape of 2-layer linear network
• bears some similarity with the nonconvex matrix factorization

problem;
• Lack identifiability: for any invertible C, AB = (AC)(C −1 B).
• Define X = [x1 , x2 , · · · , xn ] and Y = [y1 , y2 , · · · , yn ], then
f (A, B) = kY − ABXk2F
• Let ΣXX = ni=1 xi x> > >

i = XX , ΣY Y = Y Y ,
P
>
ΣXY = XY , and ΣY X = Y X .>
• When X = Y , any optimum AB = U U > , where U is the top

p eigenvectors of ΣXX .
35/41
Landscape of 2-layer linear network
Theorem 12.3 (Baldi and Hornik, 1989)

Suppose X is full rank and hence ΣXX is invertible. Further assume
Σ := ΣY X Σ−1
XX ΣXY
is full rank with d distinct eigenvalues λ1 > · · · > λd > 0. Then

f (A, B) has no spurious local minima, except for equivalent versions
of global minimum due to invertible transformations.
• no bad local min!

• generalizable to multi-layer linear networks.
36/41
Critical points of two-layer linear networks
Lemma 12.4 (critical points)

Any critical point satisfies
AB = PA ΣY X Σ−1
XX ,
where A satisfies
PA Σ = PA ΣPA = ΣPA ,
where PA is the ortho-projector projecting onto the column span of

the sub-indexed matrix.
37/41
Critical points of two-layer linear networks
Let the EVD of Σ = ΣY X Σ−1 >

XX ΣXY be Σ = U ΛU .
Lemma 12.5 (critical points)

At any critical point, A can be written in the form
A = [UJ , 0d×(p−r) ]C
where rank(A) = r ≤ p, J ⊂ {1, . . . , d}, |J | = r and C is invertible.

Correspondingly,
" #
−1 UJ> ΣY X Σ−1
XX
B=C ,
last p − r rows of CL
where L is any d × d matrix.
Verify (A, B) is global optima if and only if J = {1, . . . , p}.

38/41
Two-layer nonlinear network
Local
nlayer
layer strong
input
input layerconvexity
layer output under
layer the Gaussian model [Fu et al., 2018].
output layer
n layer input layer output
x xy y layer
WW y
yer hidden
input layer
layer input layer
output output layer
layer
n layerhidden
inputlayer
layer output
input y layer
x layer output
W layer y
n layer input layer output x y W y
x xy xlayer
Wy W
y W y y +
n layer input layer output
x y layer
W y
x input
hidden layer y W
hidden layer
layer y output layer
inputlayer
output layer
39/41
Reference I
[1] ”ImageNet,” www.image-net.org/.

[2] ”Deep learning,” Lecun, Bengio, and Hinton, Nature, 2015.
[3] ”ImageNet classification with deep convolutional neural networks,”
Krizhevsky et al., NIPS, 2012.
[4] ”Learning representations by back-propagating errors,” Rumelhart,
Hinton, and Williams, Nature, 1986.
[5] ”Dropout: A simple way to prevent neural networks from overfitting,”
Srivastava et al., JMLR, 2014.
[6] ”Dropout Training as Adaptive Regularization,” Wager et al., NIPS,
2013.
[7] ”On the importance of initialization and momentum in deep learning,”
Sutskever et al., ICML, 2013.
[8] ”Exponentially many local minima for single neurons,” P. Auer, M.
Herbster and K. Warmuth, NIPS, 1995.
40/41
Reference II
[9] ”The landscape of empirical risk for non-convex losses,” S. Mei, Y. Bai,
and A. Montanari, arXiv preprint arXiv:1607.06534, 2016.
[10] ”Neural networks and principal component analysis: Learning from
examples without local minima,” P. Baldi and K. Hornik, Neural
Networks, 1989.
[11] ”Local Geometry of One-Hidden-Layer Neural Networks for Logistic
Regression,” Fu, Chi, Liang, arXiv preprint arXiv:1802.06463, 2018.
41/41

Ece18898g Neural Networks

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Ece18898g Neural Networks

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ece18898g Neural Networks

Uploaded by

Copyright:

Available Formats

ECE 18-898G: Special Topics in Signal Processing:

Sparsity, Structure, and Inference

• introduction to deep learning

• perceptron model (a single neuron)

• 1-hidden-layer (2-layer) neural network

ImageNet Large Scale Visual Recognition Challenge (ILSVRC): Led by

Total number of non-empty synsets (categories): 21841;

The deeper, the better?

• Won the 2012 LSVRC competition by a large margin: top-1 and

• 60 million parameters and 650,000 neurons.

• Rectified linear units (ReLU):

• compared to tanh and sigmoid,

ReLu doesn’t saturate. 7/41

Important to reduce overfitting since:

number of training data  number of parameters

Important to reduce overfitting since:

number of training data  number of parameters

• Data augmentation: apply label-invariant transforms

Important to reduce overfitting since:

number of training data  number of parameters

Important to reduce overfitting since:

number of training data  number of parameters

• Other ways of “implicit regularization”:

Learned representations using CNN trained on ImageNet:

Input x = [x1 , . . . , xd ] ∈ Rd , weight w = [x1 , . . . , xd ] ∈ Rd , output

where σ(·) is a nonlinear activation function, e.g. σ(z) = sign(z)

Decision making at test stage: given a test sample x, calculate y.

Nonlinear activation is critical for complex decision boundary.

• find the weight parameter w that best fits the data;

Stochastic gradient descent uses only a mini-batch of data every

• Backpropagation is the basic algorithm to train neural network,

main idea: chain rule from calculus.

This is the same as updating the perceptron.

2 To update wm,j ’s: realize

• Representation: how well can a given network (fixed activation)

• Generalization: how well can the learned w

• Optimization: how does the output of (S)GD wt relate to w? b

Consequence: there may exist exponentially many bad local minima

• saturation of the sigmoid

`(w; 10, 0.55) `(w; 0.7, 0.25)

• saturation of the sigmoid

`(w; 10, 0.55) `(w; 0.7, 0.25)

• We get one local minimum per sample in 1D.

• Curse of dimensionality: we construct the samples to get b nd cd

empirical risk ≈ population risk

Figure credit: Mei, Bai and Montanari 26/41

• Parameter recovery: can we recover w? using {xi , yi }ni=1 ?

1 Step 1: Verify the landscape properties of population loss;

• w? is the unique local minimizer that is also global. No bad local

Figure credit: Mei, Bai and Montanari

Theorem 12.2 (Bai, Song, Montanari, 2017)

By the mean-value theorem, there exists some w0 between w

where the last line follows by optimality of w.

where B ∈ Rp×d , A∈ Rd×p , where p ≤ d.

hidden layer input layer output layer

• special case: auto-association (auto-encoding, identity mapping),

• bears some similarity with the nonconvex matrix factorization

• Lack identifiability: for any invertible C, AB = (AC)(C −1 B).

number of training data number of parameters

number of training data number of parameters

number of training data number of parameters

number of training data number of parameters