Ece18898g Neural Networks
Ece18898g Neural Networks
Ece18898g Neural Networks
Yuejie Chi
Department of Electrical and Computer Engineering
Spring 2018
1/41
Outline
2/41
The rise of deep neural networks
3/41
The rise of deep neural networks
4/41
AlexNet
5/41
AlexNet
3 fully connected
5 convolutional layers layers
Softmax
output
6/41
Faster training with ReLU
y = max(0, x)
ReLU
2.5
tanh
sigmoid
2
1.5
0.5
-0.5
-1
-3 -2 -1 0 1 2 3
8/41
Reduce overfitting
8/41
Reduce overfitting
• Dropout
9/41
Reduce overfitting
10/41
Learned hierarchical representations
Figure credit: Y. Lecun’s slide with research credit to, Zeiler and
Fergus, 2013.
11/41
single-layer networks (perceptron)
12/41
Perceptron
1 x
ayer input layer output layer
w1
x2 w
x y W2 y
x3 w3
14/41
Training of a perceptron
Empirical risk minimization: Given training data {xi , yi }ni=1 , find the
weight vector w:
n
1X
w
b = arg min `(w; xi , yi )
w∈Rd n i=1
15/41
Training via (stochastic) gradient descent
n
1 X 2
w
b = arg min yi − σ(w> xi ) := arg min `n (w)
w∈Rd 2n i=1 w∈Rd
• Gradient descent:
wt+1 = wt − ηt ∇`n (wt )
where ηt is the step-size or learning rate.
16/41
Training via (stochastic) gradient descent
n
1 X 2
w
b = arg min yi − σ(w> xi ) := arg min `n (w)
w∈Rd 2n i=1 w∈Rd
• Gradient descent:
wt+1 = wt − ηt ∇`n (wt )
where ηt is the step-size or learning rate.
• The gradient can be calculated via chain rule.
◦ call ŷi = ŷi (w) = σ(w> xi ), then
d 1 2 dŷi (w)
(yi − ŷi ) = (ŷi − yi ) = (ŷi − yi )σ 0 (w> xi )xi
dw 2 dw
= (ŷi − yi )ŷi (1 − ŷi ) xi
| {z }
scalar
where we used σ 0 (z) = σ(z)(1 − σ 0 (z)). This is called “delta rule”.
16/41
Stochastic gradient descent
17/41
Detour: Backpropagation
network output:
!
X X X
ŷ = σ vm hm = σ vm σ wm,j xj
m m j
loss function: f = 1
2 (y − ŷ)2 .
19/41
Backpropogation I
Optimize the weights for each layer, starting with the layer closest to
outputs and working back to the layer closest to inputs.
1 To update vm ’s: realize
df df dŷ
=
dvm dŷ dvm
dŷ
= (ŷ − y)
dvm
!
0
X
= (ŷ − y)σ vm hm hm
m
= (ŷ − y)ŷ(1 − ŷ) hm .
| {z }
δ
20/41
Backpropogation II
df df dŷ dhm
=
dwm,j dŷ dhm dwm,j
= (ŷ − y)ŷ(1 − ŷ)vm hm (1 − hm )xj
= δvm hm (1 − hm )xj
21/41
Questions we may ask (in general)
22/41
Nonconvex landscape of perceptron can be very bad
SGD converges to local minimizers. Are they global?
Theorem 12.1 (Auer et al., 1995)
Let σ(·) be sigmoid and `(·) be the quadratic loss function. There
exists a sequence of training samples {xi , yi }ni=1 such that `n (w) has
b nd cd distinct local minima.
23/41
Nonconvex landscape of perceptron can be very bad
SGD converges to local minimizers. Are they global?
Theorem 12.1 (Auer et al., 1995)
Let σ(·) be sigmoid and `(·) be the quadratic loss function. There
exists a sequence of training samples {xi , yi }ni=1 such that `n (w) has
b nd cd distinct local minima.
23/41
Why?
0.3
0.5
0.25
0.4
0.2
0.3
0.15
0.2
0.1
0.1
0.05
0 0
-10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -8 -6 -4 -2 0 2 4 6 8 10
24/41
Why?
0.3
0.5
0.25
0.4
0.2
0.3
0.15
0.2
0.1
0.1
0.05
0 0
-10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -8 -6 -4 -2 0 2 4 6 8 10
24/41
Why?
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
local minima
0
-10 -8 -6 -4 -2 0 2 4 6 8 10
25/41
Statistical models come to rescue
Data/measurements follow certain statistical models and hence are
not worst-case instances.
m
1 X
minimizew `n (w) = `(w; xi , yi )
m i=1
26/41
Statistical models come to rescue
Data/measurements follow certain statistical models and hence are
not worst-case instances.
m
1 X m→∞
minimizew `n (w) = `(w; xi , yi ) =⇒ E[`(w; x, y)] := `(w)
m i=1
26/41
Statistical models come to rescue
Data/measurements follow certain statistical models and hence are
not worst-case instances.
m
1 X m→∞
minimizew `n (w) = `(w; xi , yi ) =⇒ E[`(w; x, y)] := `(w)
m i=1
1 1
θ0
θ2
θ2
0 0
-1 -1
-2 -2
-3 -3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
θ1 θ1
Assume the training data {xi , yi }ni=1 is i.i.d. drawn from some
distribution:
(x, y) ∼ p(x, y)
We are using neural networks to fit p(x, y).
• A planted-truth model: let x ∼ N (0, I) and the label y is drawn
as
◦ regression model:
yi = σ(w?> xi )
◦ classification model: yi ∈ {0, 1}, where
P(yi = 1) = σ(w?> xi )
27/41
Roadmap
28/41
Step 1: population risk
θ0
θ2
0
-1
-2
-3
-3 -2 -1 0 1 2 3
θ1
30/41
Step 2: uniform convergence of gradients & Hessian
2 preservation of Hessian:
s
Cd log n
P sup
∇2 `n (w) − ∇2 `(w)
≤ ≥ 1 − δ.
kwk≤R n
31/41
Step 3: establish rate of convergence for ERM
1 1
b − w ? k22 ≤ (w
λmin (∇2 `n (w0 )) kw b − w ? )> ∇2 `n (w 0 )(w
b − w? )
2 2
≤ |h∇`n (w? ), w
b − w ? i|
≤ k∇`n (w? )k · kw
b − w? k
s
? 2k∇`n (w? )k Cd log n
→ kw
b − w k2 ≤ . .
λmin (∇2 `n (w0 )) n
32/41
two-layer networks
33/41
Two-layer linear network
Given arbitrary data {xi , yi }ni=1 , xi , yi ∈ Rd , fit the two-layer linear
network with quadratic loss:
n
kyi − ABxi k22
X
f (A, B) =
i=1
B A
x
hidden
hidden layer input layer
layer inputlayer
output layer output layer
f (A, B) = kY − ABXk2F
35/41
Landscape of 2-layer linear network
Σ := ΣY X Σ−1
XX ΣXY
36/41
Critical points of two-layer linear networks
AB = PA ΣY X Σ−1
XX ,
where A satisfies
PA Σ = PA ΣPA = ΣPA ,
37/41
Critical points of two-layer linear networks
A = [UJ , 0d×(p−r) ]C
Local
nlayer
layer strong
input
input layerconvexity
layer output under
layer the Gaussian model [Fu et al., 2018].
output layer
n layer input layer output
x xy y layer
WW y
yer hidden
input layer
layer input layer
output output layer
layer
n layerhidden
inputlayer
layer output
input y layer
x layer output
W layer y
n layer input layer output x y W y
x xy xlayer
Wy W
y W y y +
n layer input layer output
x y layer
W y
x input
hidden layer y W
hidden layer
layer y output layer
inputlayer
output layer
39/41
Reference I
[9] ”The landscape of empirical risk for non-convex losses,” S. Mei, Y. Bai,
and A. Montanari, arXiv preprint arXiv:1607.06534, 2016.
[10] ”Neural networks and principal component analysis: Learning from
examples without local minima,” P. Baldi and K. Hornik, Neural
Networks, 1989.
[11] ”Local Geometry of One-Hidden-Layer Neural Networks for Logistic
Regression,” Fu, Chi, Liang, arXiv preprint arXiv:1802.06463, 2018.
41/41