0% found this document useful (0 votes)

3 views79 pages

Deep Learning Supervised

The document discusses deep learning, focusing on supervised and unsupervised learning techniques, particularly the development of statistical models to uncover underlying structures in data. It highlights the impact of deep learning across various domains such as speech recognition, computer vision, and medical analysis. Additionally, it outlines the architecture and training of neural networks, including activation functions and optimization techniques.

Uploaded by

Abbas sabbar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views79 pages

Deep Learning Supervised

Uploaded by

Abbas sabbar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 79

Deep Learning I

Supervised Learning

Russ Salakhutdinov

Machine Learning Department

Carnegie Mellon University
Canadian Institute for Advanced Research
Mining for Structure
Massive increase in both computa:onal power and the amount of
data available from web, video cameras, laboratory measurements.
Images & Video Text & Language Speech & Audio

Gene Expression

Rela:onal Data/
Product fMRI Tumor region
Social Network
Recommenda:on

• Develop sta:s:cal models that can discover underlying structure, cause, or

sta:s:cal correla:on from data in unsupervised or semi-supervised way.
• Mul:ple applica:on domains.
Mining for Structure
Massive increase in both computa:onal power and the amount of
data available from web, video cameras, laboratory measurements.
Images & Video Text & Language Speech & Audio

Gene Expression

Product
Deep Learning
Rela:onal Data/Models that
fMRI Tumor region
Social Network
support inferences and discover
Recommenda:on

structure at mul:ple levels.

• Develop sta:s:cal models that can discover underlying structure, cause, or

sta:s:cal correla:on from data in unsupervised or semi-supervised way.
• Mul:ple applica:on domains.
Impact of Deep Learning
• Speech Recogni:on

• Computer Vision

• Recommender Systems

• Language Understanding
• Drug Discovery and Medical
Image Analysis
Deep Genera:ve Model
Model P(document) Reuters dataset: 804,414
newswire stories: unsupervised
European Community
Interbank Markets Monetary/Economic

Energy Markets
Disasters and
Accidents

Leading Legal/Judicial
Economic
Indicators

Bag of words Government

Accounts/
Borrowings
Earnings

(Hinton & Salakhutdinov, Science 2006)

Example: Understanding Images
TAGS:
strangers, coworkers, conven:oneers,
aUendants, patrons

Nearest Neighbor Sentence:

people taking pictures of a crazy person

Model Samples
• a group of people in a crowded area .
• a group of people are walking and talking .
• a group of people, standing around and talking .
Cap:on Genera:on
Talk Roadmap
Part 1: Supervised Learning: Deep Networks
• Deﬁni:on of Neural Networks
• Training Neural Networks
• Recent Op:miza:on / Regulariza:on Techniques

Part 2: Unsupervised Learning: Learning Deep

Genera:ve Models

Part 3: Open Research Ques:ons

Learning Feature Representa:ons
pixel 1
pixel 2 Learning
Algorithm

Segway
Input Space Non-Segway
pixel 1

pixel 2
Learning Feature Representa:ons
Handle

Feature Learning
Representation Algorithm
Wheel

Segway
Input Space Non-Segway Feature Space
pixel 1

Wheel
pixel 2 Handle
Tradi:onal Approaches
Data Feature Learning
extraction algorithm

Object
detec:on

Image vision features Recogni:on

Audio
classiﬁca:on

Speaker
Audio audio features iden:ﬁca:on
Computer Vision Features

SIFT Textons

HoG RIFT
GIST
Audio Features

Spectrogram MFCC

Flux ZCR Rolloﬀ

Audio Features

Representa:on Learning:
Spectrogram MFCC
Can we automa:cally learn
these representa:ons?

Flux ZCR Rolloﬀ

Neural Networks Online Course
• Disclaimer: Some of the material and slides for this lecture were
borrowed from Hugo Larochelle’s class on Neural Networks:
https://sites.google.com/site/deeplearningsummerschool2016/

• Hugo’s class covers

many other topics:
convolutional networks,
neural language model,
Boltzmann machines,
autoencoders, sparse
coding, etc.

• We will use his

material for some of the
other lectures.
Feedforward Neural Networks
‣ Definition of Neural Networks
- Forward propagation
- Types of units
- Capacity of neural networks

‣ How to train neural nets:

- Loss function
- Backpropagation with gradient descent

‣ More recent techniques:

- Dropout
- Batch normalization
- Unsupervised Pre-training
Abstract
Université de Sherbrooke September
Math for my slides “Feedforward 6, 2012
neural network”.
P
Artificial
ugo.larochelle@usherbrooke.ca Neuron
Abstract
Abstract
Math for my slides > neural network”.
“Feedforward
• a(x) = b +
Math
• Neuron for my w x
i slides
pre-activation = b + w
i “Feedforward
i (or x network”.
neural
input activation):
SeptemberPP 6, 2012 P >Abstract
• •a(x)
• h(x) a(x)==b b+
= g(a(x))
Math +my=
for i
wx iix=
g(b
slides
w
i i
=b +
+ b+w i
ww
>
i“Feedforward
x ixxneural
i ) network”.
P PP
••h(x)
xh(x)
• x1•• Neuron
d b
a(x) = =g(a(x))
= wbg(a(x))
+ wi= =xg(b
dwig(b ++b +i w w
xx
wi i> ii x
) i)
output 1activation:i =
Abstract P
• x
• w • •x1h(x) x
1 xd = b w
d bg(a(x)) w
w1 1 wd=dg(b + i wi xi )
s “Feedforward neural network”.
• •ww
x1 xd b w 1 w d
• { where
>
xi = b •+•{w{w arexthe weights (parameters)
• g(·)P b is the bias term
g(b + • ••g(·)
{wisibx
g(·)
i
bi )
called the activation function
• h(x) = g(a(x))
• ••h(x)
g(·) ==
h(x) bg(a(x))
g(a(x)) ⇣ ⌘
• h(x)(1) g(a(x))(1)
= (1) (1)
⇣ ⇣ (1) P P
(1)
P (1)
⌘(1)⌘
Abstract
Abstract
Math
Mathfor
formy Artificial Neuron
myslides
slides“Feedforward
“Feedforwardneural
neuralnetwork”.
network”.
PP >
••Output
• a(x) = b + ii wii xii = b + w>xx
a(x) = b +
activation of w
thex = b +
neuron: w
PP
•• h(x)
h(x)==g(a(x))
g(a(x))==g(b g(b+
+ iiwwiixxii))
•• xx11 xxdd bb ww11 wwdd

•• ww
Range
•• {{ is
determined
•• g(·)
by g(·) bb
Bias only changes
•• h(x)
h(x)==g(a(x))
g(a(x)) the position of the
⇣⇣ riff ⌘⌘
(1) (from Pascal
(1) Vincent’s slides) (1)
(1) PP (1) (1)
• a(x) = b(1) + W(1) x
• a(x) = b + W x a(x)ii = bii a(x) = b W
jj Wi,j
xjj
i,j x

>
• x1 xd b w 1 w d

• w
Activation
• {
Function
• Sigmoid activation function:
• g(a) = a
1
Ø Squashes the neuron’s • g(a) = sigm(a) = 1+exp( a)
output between 0 and 1
exp(a) exp( a) exp(2a
Ø Always positive • g(a) = tanh(a) = exp(a)+exp( a) = exp(2a

Ø Bounded • g(·) b
Ø Strictly Increasing
• h(x) = g(a(x))
⇣ P
(1) (1) (1)
• a(x) = b +W x a(x)i = bi j

(out) (2) (2) >

• o(x) = g (b +w x)
1
• g(a) = sigm(a) = 1+exp( a)
Activation Function exp(a) exp( a) exp(
• g(a) = tanh(a) = exp(a)+exp( a) = exp(
• Rectified linear (ReLU) activation function:
• g(a) = max(0, a)
Ø Bounded below by 0 • g(a) = reclin(a) = max(0, a)
(always non-negative)

Ø Tends to produce units

• g(·) b
with sparse activities
• h(x) = g(a(x))
Ø Not upper bounded ⇣
(1) (1) (1) P
Ø Strictly increasing • a(x) = b +W x a(x)i = bi

(out) (2) (2) >

• o(x) = g (b +w x)
• •g(a) = tanh(a)
exp(a)+exp( a)
g(·) b = exp(a)+exp(
exp(2a)+1
exp(a)+exp(
a)
a) = exp(2a)+1
exp(2a)+1
g(a) = max(0, a)
• g(a)(1) = max(0, a)
• g(a) = max(0, a)
Single i Hidden Layer Neural Net
(1) (2) (2)
g(a) = reclin(a)• =W max(0,
i,j a)b x j h(x) i w i b
g(·) b • g(a) = reclin(a) = max(0, a)
• • •g(a)
Hiddenh(x) ==reclin(a)
layer g(a(x)) = max(0, a)
pre-activation:
(1) (1) (2)
Wi,j bi • xj p(y
h(x)= i w 1|x)
i b(2) ⇣ P ⌘
h(x) = g(a(x))
• • p(y
a(x) = =1|x)b (1)
+ W (1)
x a(x) i = b
(1)
i + W
(1)
i,j xj
j
• g(·) b
⇣ P ⌘
a(x) = b(1) +•Wg(·) x ba(x)i = b(2)
(1) (1) (1)
i + Wi,j>xj
j (2)
• f(1)
(x) =(1)
o(b +w x)(2) (2)
• W
f (x) = o(b(2)•+ w
> b
(1)x) i(1)
i,j
(2) x j h(x) i w i(2) b (2)
W
• Hidden b x h(x)
i,jlayer iactivation:
j i w i b
• h(x) = g(a(x))
• h(x) = g(a(x)) ⇣ ⌘
(1) (1) ⇣ (1) P (1) ⌘
• a(x)
• Output
= b
layer
+ W
activation:
(1) 1
x a(x) i = b i(1)+ P j W i,j x
(1) j
• a(x) = b + W x a(x)i = bi1 + j Wi,j xj
(1)
⇣ >
⌘
• f (x) = o ⇣b(2) + w (2)
>h ⌘x
(1)
• f (x) = o b(2) + w(2) x
Output activation
function
1
• p(y = c|x)
h i>
•• o(a)
p(y == c|x) Multilayer
softmax(a) = Neural
Pexp(a1 )
...
c exp(ac )
Net
Pexp(aC )
c exp(ac )
h i>
••••Consider
fo(a) a network with
(x) = softmax(a) = Pexp(a1layers.
L hidden )
. . . Pexp(aC )
p(y = c|x) exp(ac )
c exp(ac )
c

•- h (1)
layer h(2) (x) Wfor
pre-activation
(x) (1)
hW(2) W(3) b(1) b(2) bi(3)
k>0 >
• f (x) Pexp(a1 ) . . . Pexp(aC )
• o(a) = softmax(a) =
(k) (k 1)c exp(ac )(0)
• a(k) c exp(a c)
(k)
(x) = b + W h (x) (h (x) = x)
• h(1) (x) h (2)
• p(y =(k)c|x)(x) W (1)
W (2)
W (3)
b (1)
b (2)
b (3)
•• hf (x)
(k)
- hidden(x) = g(aactivation
layer (x))
(k) (k) (k) (k 1) h (0) i>
• afrom(1)
(x)
1 to =L:b (2)
+ W h
(1)
x
(2)
(h (3)
= x)
(1) (2) (3)
•• hh(L+1) (x)
•(x) =h o(a
o(a) =(x)(L+1)W
(x)) = W
softmax(a) (x) Pexp(a
f= W 1 ) b. . . Pexp(a
b Cb )
(k) (k) c exp(ac ) c exp(ac )
• h (k)(x) = g(a (k) (x))(k) (k 1)
• a (x) = b + W h x (h(0) = x)
• f (x)
(L+1) (L+1)
• -houtput (x)
layer
= o(a
activation
(x)) = f (x)
(k=L+1):
• h(k) (x) = g(a (k)
(x))
• h(1) (x) h(2) (x) W(1) W(2) W(3) b(1) b(2) b(3)
• h(L+1) (x) (L+1)
• a (x) = b + W(k)fh(x)
(k)= o(a (k) (x)) = (k 1)
x (h(0) (x) = x)

• h(k) (x) = g(a(k) (x))

(L+1) (L+1)
Capacity of Neural Nets
• Consider a single layer neural network 2
Réseaux de neurones
z x2
1

0 1
-1
0
-1
0
-1
1
x1
zk

Output
sortie k
y1 x2 y2
x2
y2 wkj
1 -1 -.4 1

0
y1 .7
1 0 1
-1
0 -1.5 Hidden
cachée j-1 0
-1
0
bias
biais .5
-1
0
-1 -1
1 1 wji 1
x1 1 1 1
x1

Input i
entrée
x1 x2
x2
(from Pascal Vincent’s slides)
z=-1
1
R2
z=+1 R1
x
Capacity of Neural Nets
• Consider a single layer neural network

(from Pascal Vincent’s slides)

Universal Approximation
• Universal Approximation Theorem (Hornik, 1991):

- “a single hidden layer neural network with a linear output

unit can approximate any continuous function arbitrarily well,
given enough hidden units’’

• This applies for sigmoid, tanh and many other activation

functions.

• However, this does not mean that there is learning algorithm that
can find the necessary parameter values.
Feedforward Neural Networks
‣ How neural networks predict f(x) given an input x:
- Forward propagation
- Types of units
- Capacity of neural networks

‣ How to train neural nets:

- Loss function
- Backpropagation with gradient descent

‣ More recent techniques:

- Dropout
- Batch normalization
- Unsupervised Pre-training
Training set: Dtrain = {(x(t) , y (t) )}
f (x; ✓)
Training
Dvalid Dtest
• Empirical Risk Minimization:

1X
arg min l(f (x(t) ; ✓), y (t) ) + ⌦(✓)
✓ T t
Loss function Regularizer

• Learning is cast as optimization.

Ø For classification problems, we would like to minimize

classification error.

ØLoss function can sometimes be viewed as a surrogate for

what we want to optimize (e.g. upper bound)
• ⌦(✓)
•
• ✓ ✓+ Dé
Abstract
P (t)1 dX Hugo Laro
U
Stochastic Gradient Descend
1
• = T Mathr
t forl(f (x (t)
• ; ✓),
{x y2 R) | r
✓ my slides “Feedforwardxneuralr f
✓ ⌦(✓)
(x) = 0}
network”.
(t) (t)
arg min l(f (x ; ✓),Département
y ) + ⌦(✓)d’i
• ✓ ✓+ • ✓v> rT 2
x f (x)v > 0 8v hugo
• f (x) after seeing each example:
• Perform updates
t
Université de S
h
• {x 2 R(t)
- Initialize
d
| rx f(1)
• l(f•:(x✓ ⌘;{W (x)
✓), y (t)
> 2
=
) 0}
(1)
•
,b ,...,W
v r x f (x)v
(L+1)
,b
< 0
(L+1)
8v
}
hugo.larochelle@u
> 2 (t) (t)
• v
- For t=1:T r f (x)v > 0 8v • = r l(f (x ; ✓), y ) r✓ ⌦(✓)
September 1
x ✓
(t) (t)
• ⌦(✓) • l(f (x ; ✓), y )
- •forveach
> 2 training example
rx f (x)v < 0 8v • (x(t) , y (t) )
• r P
1 (x(t) ; ✓), y (t) )(t)
l(f (t)Math for my slidesepoch
“Feedforward ne
• Training
= tr
(t) ✓ l(f (x ; ✓), y ) r✓ ⌦(✓)
✓
• = rT✓ l(f (x ; ✓), y (t) ) r✓ ⌦(✓) =
• ⌦(✓) Math for my slides “Feedforward neural ne
Abstra
• f (x) Math for Iteration of all“Feedforward
my slides examples neu
• ✓ ✓+↵ 5
Math for my slides “Feedforward neural network”.
• r✓ ⌦(✓) • f (x)• l(f (x(t) ; ✓), y (t) )
• {x 2 R
• To train a neural
d
| rxwe
net, f (x)
need:= 0} • f (x) 5
• f (x)
• f (x)c = p(y = c|x) (t) (t)
• l(f (x • r; ✓),
✓ (t) l(fy(x(t)
(t)
) ; ✓), y (t) )
>
2 • l(f (x ; ✓), y )
• v r f
Ø Loss function:
(x)v >
(t)0 8v(t)
• x • l(f
x(t)
y (x ; ✓),•y r) l(f
(t)
(t)
; ✓),(t)y (t) ) (t)
✓ • (x⌦(✓)
Ø A procedure
2 to compute(t) gradients:
P • r✓ l(f (x ; ✓), y )
• v>•rl(f
x•f (x)v
ry)
(x), <
✓ l(f
0 8v
=(x ; c✓),
(t)
y )log f (x)c = log f (x)y
1(y=c) =
Ø Regularizer and its gradient: • ⌦(✓)• , r✓ ⌦(✓)
• •= •r✓⌦(✓) l(f (x(t) ; ✓), y (t)•) ⌦(✓)r✓ ⌦(✓)
• f (x)c = p(y = c|x)
• r✓ ⌦(✓)
• r ⌦(✓) ✓
Computational Flow Graph
• Forward propagation can be represented
as an acyclic flow graph

• Forward propagation can be implemented

in a modular way:

Ø Each box can be an object with an fprop

method, that computes the value of the
box given its children

Ø Calling the fprop method of each box in

the right order yields forward propagation
Computational Flow Graph
• Each object also has a bprop method

- it computes the gradient of the loss with

respect to each child box.

• By calling bprop in the reverse order, we

obtain backpropagation
T 1b 1
• T ⌃= T t (x (t) b (t)
µ)(x b >
µ)

Model Selection
• Supervised
• T 1b Machine
⌃= 1 learnin
Pexamp
learning
t (x (t)
T T
• Training Protocol: • Supervised learning example:
• Training set: D (x, y)
train x {(x
= y

- Train your model on •theTraining

Training set:
Set •Dtrain = {(x
f (x; ✓)
(t) (t)
, y )}
• Supervised learning e
- For model selection,•use
f (x; ✓)
Validation • Dvalid Dtest
Set
• Training set: Dtrain =
ØHyper-parameter search: hidden layer size, learning rate,
number of iterations/epochs, etc.
• f (x; ✓)
valid
• D
- Estimate generalization performance using the Test Set Dtest

• Generalization is the behavior of the model on unseen

examples.
Early Stopping
• To select the number of epochs, stop training when validation set
error increases (with some look ahead).
@f (x) f (x+✏) f (x ✏)
• @x ⇡ 2✏

•
Mini-batch,
f (x) x ✏
Momentum
• Make updates
• f (x based
+ ✏) on
f (xa mini-batch
✏) of examples (instead of a
single example):
P 1
Ø • gradient
the t=1 ↵is tthe 1
= average regularized loss for that mini-batch
Ø P1a more
can give 2 accurate estimate of the gradient
• t=1 t < 1 ↵t
↵
Ø can leverage matrix/matrix operations, which are more efficient
↵
• ↵t = 1+ t

• Momentum: = t↵use0.5
• ↵tCan an exponential
<  1 average of previous
gradients:
(t) (t) (t) (t 1)
• r✓ = r✓ l(f (x ), y )+ r✓
Ø can get pass plateaus more quickly, by ‘‘gaining momentum’’
Feedforward Neural Networks
‣ How neural networks predict f(x) given an input x:
- Forward propagation
- Types of units
- Capacity of neural networks

‣ How to train neural nets:

- Loss function
- Backpropagation with gradient descent

‣ More recent techniques:

- Dropout
- Batch normalization
- Unsupervised Pre-training
Learning Distributed Representations
• Deep learning is research on learning models with multilayer
representations

Ø multilayer (feed-forward) neural networks

Ø multilayer graphical model (deep belief network, deep Boltzmann
machine)

• Each layer learns ‘‘distributed representation’’

Ø Units in a layer are not mutually exclusive
• each unit is a separate feature of the input
• two units can be ‘‘active’’ at the same time
Ø Units do not correspond to a partitioning (clustering) of the inputs
• in clustering, an input can only belong to a single cluster
Local vs. Distributed Representations
• Clustering, Nearest • RBMs, Factor models,
Neighbors, RBF SVM, local PCA, Sparse Coding,
density es:mators Deep models
• Parameters
Local regionsfor each region.
• # of regions is linear with C1=1
# of parameters. C2=1

C1=0
C1=1 C2=1
C2=0
C1=0
C2=0

C1 C2 C3

Learned
prototypes (Bengio, 2009, Foundations and
Trends in Machine Learning)
Local vs. Distributed Representations
• Clustering, Nearest • RBMs, Factor models,
Neighbors, RBF SVM, local PCA, Sparse Coding,
density es:mators Deep models C1=1
C2=1
• Parameters C3=1
Local regionsfor each region.
• # of regions is linear with C1=1
# of parameters. C2=1
C3=0
C1=0
C1=1 C2=1 C1=0
C2=0 C3=0 C2=1
C3=0 C1=0 C3=1
C2=0
C3=0
C1 C2 C3
C1=0
C2=0
C3=1
Learned
prototypes
Local vs. Distributed Representations
• Clustering, Nearest • RBMs, Factor models,
Neighbors, RBF SVM, local PCA, Sparse Coding,
density es:mators Deep models C1=1
C2=1
• Parameters
Local regionsfor each region. • Each parameter aﬀects many
C3=1
• # of regions is linear with regions, not
C1=1just local.
# of parameters. C2=1 grows (roughly)
• # of regions
C3=0
exponen:ally in #C1=0
of parameters.
C1=1 C2=1 C1=0
C2=0 C3=0 C2=1
C3=0 C1=0 C3=1
C2=0
C3=0
C1 C2 C3
C1=0
C2=0
C3=1
Learned
prototypes
Inspiration from Visual Cortex
Why Training is Hard
• First hypothesis: Hard optimization
problem (underfitting)

Ø vanishing gradient problem

Ø saturated units block gradient
propagation

• This is a well known problem in

recurrent neural networks
Why Training is Hard
• Second hypothesis: Overfitting

Ø we are exploring a space of complex functions

Ø deep nets usually have lots of parameters

• Might be in a high variance / low bias situation

Why Training is Hard
• First hypothesis (underfitting): better optimize

Ø Use better optimization tools (e.g. batch-normalization, second

order methods, such as KFAC)
Ø Use GPUs, distributed computing.

• Second hypothesis (overfitting): use better regularization

Ø Unsupervised pre-training
Ø Stochastic drop-out training

• For many large-scale practical problems, you will need to use both:
better optimization and better regularization!
Unsupervised Pre-training
• Initialize hidden layers using unsupervised learning

Ø Force network to represent latent structure of input distribution

Ø Encourage hidden layers to encode that structure

Unsupervised Pre-training
• Initialize hidden layers using unsupervised learning

Ø This is a harder task than supervised learning (classification)

Ø Hence we expect less overfitting

Hugo Larochelle
Autoencoders: DépartementPreview
d’informatique
h(x) = g(a(x))
Université de Sherbrooke
• Feed-forward neural network trained to = sigm(b
reproduce its + Wx)
input at the
hugo.larochelle@usherbrooke.ca
output layer
October
Decoder 16, 2012
b = o(b
x a(x))
sigm(c + W⇤ h(x))
= Abstract
P
Math for my slides “Autoencoders”. 2
P
For binary units
b l(f (x)) =
f (x) ⌘ x k (b
xk xk ) l(f (x)) = k (xk log(b xk ) + (1 xk ) log(1

Encoder
h(x) = g(a(x))
= sigm(b + Wx)
b
x Autoencoders:
⇤
= o(c + W h(x)) Preview
• h(x) = g(a(x))
= sigm(c + W⇤ h(x)) = sigm(b + Wx
• Loss function for binary inputs
2
P
bk
x xk ) l(f (x)) = k (xk log(b
xk ) + (1 xk ) log(1 bk ))
x
P
Ø Cross-entropy error function • fx bo(b
b(x)=⌘ x l(f (x)) =
a(x)) k (b
x
= sigm(c + W⇤ h(x
• Loss function for real-valued inputs
1
P 2
P
b l(f (x)) =
f (x) ⌘ x 2 k (b
xk xk ) l(f (x)) = k (xk log(b
xk )
(t) of squared
b(t) differences
Ø (xsum
rba(x(t) ) l(f )) = x x(t)
Ø we use a linear activation function at the output

a(x ) (= b + Wx(t)
(t)

h(x(t) ) (= sigm(a(x(t)
(t) >
Pre-training
• We will use a greedy, layer-wise procedure

Ø Train one layer at a time with unsupervised criterion

Ø Fix the parameters of previous hidden layers
Ø Previous layers can be viewed as feature extraction
Fine-tuning
• Once all layers are pre-trained
Ø add output layer
Ø train the whole network using
supervised learning

• We call this last phase fine-tuning

Ø all parameters are ‘‘tuned’’ for the
supervised task at hand
Ø representation is adjusted to be more
discriminative
Why Training is Hard
• First hypothesis (underfitting): better optimize

Ø Use better optimization tools (e.g. batch-normalization, second

order methods, such as KFAC)
Ø Use GPUs, distributed computing.

• Second hypothesis (overfitting): use better regularization

Ø Unsupervised pre-training
Ø Stochastic drop-out training

• For many large-scale practical problems, you will need to use both:
better optimization and better regularization!
Dropout
• Key idea: Cripple neural network by removing hidden units
stochastically

Ø each hidden unit is set to 0 with

probability 0.5

Ø hidden units cannot co-adapt to

other units

Ø hidden units must be more

generally useful

• Could use a different dropout

probability, but 0.5 usually works well
• p(y = c|x)
h i>
•• o(a)
p(y == c|x)
softmax(a) = Dropout
Pexp(a ) . . . Pexp(a )
1
exp(a )
c c exp(a )
c
C
c
h i>
•• fo(a)
(x) = softmax(a) = Pexp(a 1)
. . . Pexp(aC )
• Use random binary masks mc (k) exp(ac ) c exp(ac )
• p(y = c|x)
•• Øhf (1)
(x) (x) h (2)
(x) W (1)
layer pre-activation for W (2)
W (3)
b (1)
b (2)
b (3)
h k>0 i>
exp(a ) exp(a )
• o(a)
(k)
•• ah(1)(x) = softmax(a)
(k) =
(k) P
(k 1)
1
. . . P
(0)
C

(x)= b + WW(1)
h(2) (x) h cW (x)
(2)c ) (h (3)c(x)
exp(a
W exp(a =cx)
b(1) ) (2)
b b(3)
•••Øhaf(k)
(x)(x) = g(a
(k)
(x) =layer
hidden
(k)
(k)
+(x))
b activationW(k) h(k=1
(k 1)
toxL):(h(0) = x)
(1) (2) (1) (2)
••• hhh(L+1)
(k) (x)
(x) h
= (x)
o(a W(x))
(k)(L+1) W= f (x)W(3) b(1) b(2) b(3)
(x) = g(a (x))
(k)
• a(L+1) (x) = b(k) + (L+1)
W(k) h(k 1) x (h(0) = x)
• h (x) = o(a (x)) = f (x)
•Ø h(k) Output g(a(k) (x))
(x) =activation (k=L+1)

• h(L+1) (x) = o(a(L+1) (x)) = f (x)

Dropout at Test Time
• At test time, we replace the masks by their expectation

Ø This is simply the constant vector 0.5 if dropout probability is 0.5

Ø For single hidden layer: equivalent to taking the geometric average
of all neural networks, with all possible binary masks

• Can be combined with unsupervised pre-training

• Beats regular backpropagation on many datasets

• Ensemble: Can be viewed as a geometric average of exponential

number of networks.
Why Training is Hard
• First hypothesis (underfitting): better optimize

Ø Use better optimization tools (e.g. batch-normalization, second

order methods, such as KFAC)
Ø Use GPUs, distributed computing.

• Second hypothesis (overfitting): use better regularization

Ø Unsupervised pre-training
Ø Stochastic drop-out training

• For many large-scale practical problems, you will need to use both:
better optimization and better regularization!
Batch Normalization
• Normalizing the inputs will speed up training (Lecun et al. 1998)

Ø could normalization be useful at the level of the hidden layers?

• Batch normalization is an attempt to do that (Ioffe and Szegedy, 2014)

Ø each unit’s pre-activation is normalized (mean subtraction, stddev

division)
Ø during training, mean and stddev is computed for each minibatch
Ø backpropagation takes into account the normalization
Ø at test time, the global mean / stddev is used
Batch Normalization

Learned linear transformation to adapt to non-linear

activation function (𝛾 and β are trained)
nsion µB ← xi // mini-b
(k) (k)
m i=1
(k) x − E[x ]
x
! = " m
Var[x(k) ] Batch Normalization
σ ←
1
m
(x − µ 2
B
#
i B)
2
// mini-batch
i=1
ation and variance are computed over the
• Why normalize the pre-activation? xi − µB
As shown in (LeCun et al., 1998b), such !i ← " 2
x //
can even
eds up convergence,
Ø help keep
whenthe thepre-activation
fea- σB + ϵ
in a non-saturating regime
rrelated. (though the linear transform yi ← γ! xi + β ≡could
BNγ,βcancel
(xi ) this // scale
y normalizing each input of a layer may
effect)
layer can represent. For instance, nor- Algorithm 1: Batch Normalizing Transform,
ts of a sigmoid would constrain them to activation x over a mini-batch.
• Use the global mean and stddev at test time.
of the nonlinearity. To address this, we
e transformation
Ø inserted
removesin thestochasticity
the network of The BN transform
the mean can be added to a network
and stddev
identity transform. To accomplish this, ulate any activation. In the notation y = BNγ
Ø requires a final phase where, from the first to the last hidden layer
• propagate all training data to that layer
•
3
compute and store the global mean and stddev of each unit

Ø for early stopping, could use a running average

Optimization Tricks
• SGD with momentum, batch-normalization, and dropout usually
works very well

• Pick learning rate by running on a subset of the data

Ø Start with large learning rate & divide by 2 until loss does not diverge
Ø Decay learning rate by a factor of ~100 or more by the end of training

• Use ReLU nonlinearity

• Initialize parameters so that each feature across layers has

similar variance. Avoid units in saturation.

[From Marc'Aurelio Ranzato, CVPR 2014 tutorial]

Visualization
• Check gradients numerically by finite differences

• Visualize features (features need to be uncorrelated) and have

high variance

• Good training: hidden units

are sparse across samples

[From Marc'Aurelio Ranzato, CVPR 2014 tutorial]

Visualization
• Check gradients numerically by finite differences

• Visualize features (features need to be uncorrelated) and have

high variance

• Visualize parameters: learned features should exhibit structure

and should be uncorrelated and are uncorrelated
Visualization
• Check gradients numerically by finite differences

• Visualize features (features need to be uncorrelated) and have

high variance

• Bad training: many hidden

units ignore the input and/or
exhibit strong correlations
Computer Vision
• Design algorithms that can process visual data to accomplish a given task:

Ø For example, object recognition: Given an input image, identify

which object it contains
Deep Convolutional Nets

Prediction
Very deep network

….

• Convolution High-level feature

• Pooling space
• Normalization
• Densely connected
Deep Convolutional Nets
…

Pooling

Convolution
ConvNets: Examples
• Optical Character Recognition, House Number and Traffic Sign
classification
ConvNets: Examples
• Pedestrian detection

(Sermanet et al., Pedestrian detection with unsupervised multi-stage, CVPR 2013)

ConvNets: Examples
• Object Detection

Sermanet et al., OverFeat: Integrated recognition, localization, 2013

Girshick et al., Rich feature hierarchies for accurate object detection, 2013
Szegedy et al., DNN for object detection, NIPS 2013
ImageNet Dataset
• 1.2 million images, 1000 classes

Examples of Hammer

(Deng et al., Imagenet: a large scale hierarchical image database, CVPR 2009)
Important Breakthrough
• Deep Convolu:onal Nets for Vision (Supervised)
Krizhevsky, A., Sutskever, I. and Hinton, G. E., ImageNet Classification with Deep
Convolutional Neural Networks, NIPS, 2012.

1.2 million training images

1000 classes
Architecture
• How can we select the right architecture:
Ø Manual tuning of features is now replaced with the manual tuning
of architechtures

• Depth
• Width
• Parameter count
How to Choose Architecture
• Many hyper-parameters:
Ø Number of layers, number of feature maps

• Cross Validation

• Grid Search (need lots of GPUs)

• Smarter Strategies

Ø Random search
Ø Bayesian Optimization
AlexNet
Softmax Output
• 8 layers total
Layer 7: Full
• Trained on Imagenet
dataset [Deng et al. CVPR’09] Layer 6: Full

• 18.2% top-5 error Layer 5: Conv + Pool

Layer 4: Conv

Layer 3: Conv

Layer 2: Conv + Pool

Layer 1: Conv + Pool

[From Rob Fergus’ CIFAR 2016 tutorial] Input Image

AlexNet
Softmax Output
• Remove top fully connected layer 7

• Drop ~16 million parameters

Layer 6: Full
• Only 1.1% drop in performance!
Layer 5: Conv + Pool

Layer 4: Conv

Layer 3: Conv

Layer 2: Conv + Pool

Layer 1: Conv + Pool

[From Rob Fergus’ CIFAR 2016 tutorial] Input Image

AlexNet
Softmax Output
• Let us remove upper feature extractor layers
and fully connected:

Ø Layers 3,4, 6 and 7

Layer 6: Full

• Drop ~50 million parameters Layer 5: Conv + Pool

• 33.5 drop in performance!

• Depth of the network is the key. Layer 2: Conv + Pool

Layer 1: Conv + Pool

[From Rob Fergus’ CIFAR 2016 tutorial] Input Image

GoogLeNet

• 24 layer model that uses so-called inception Convolution

module. Pooling
Softmax
Other

(Szegedy et al., Going Deep with Convolutions, 2014)

GoogLeNet
• GoogLeNet inception module:

Ø Multiple filter scales at each layer

Ø Dimensionality reduction to keep computational requirements down

number
of filters 1x1

3x3

5x5

(a) Inception module, naı̈ve version (b) Inception module with dimension reductions

Figure 2: Inception module

(Szegedy et al., Going Deep with Convolutions, 2014)

GoogLeNet

• Width of inception modules ranges from 256 filters (in early modules) to
1024 in top inception modules.
• Can remove fully connected layers on top completely
• Number of parameters is reduced to 5 million
• 6.7% top-5 validation error on Imagnet

(Szegedy et al., Going Deep with Convolutions, 2014)

Learning for Image Recognition
VGG-19 34-layer plain 34-layer residual Residual
image image image insert sho
network

Residual Networks
output

gyu Zhang Shaoqing Ren Jian Sun

3x3 conv, 64

shortcuts
size: 224
3x3 conv, 64
output ar
Microsoft Research
pool, /2
output
size: 112
3x3 conv, 128
Fig. 3). W
in Fig. 3)
ngz, v-shren, jiansun}@microsoft.com 3x3 conv, 128 7x7 conv, 64, /2 7x7 conv, 64, /2
performs
Really, really deep convnets do not train well, output
size: 56
pool, /2 pool, /2 pool, /2
for increa
model top-1 err. top-5 err. paramete
64-d 256-d 3x3 conv, 64

E.g. CIFAR10:
3x3 conv, 256 3x3 conv, 64

28.07 VGG-16 [41] 9.33 3x3 conv, 256

3x3, 64
3x3 conv, 64 1x1, 64 3x3 conv, 64
match di
relu

20 20
GoogLeNet [44] - 9.15 3x3 conv,relu
256 3x3 conv, 64 3x3, 64 3x3 conv, 64
options, w
PReLU-net [13] 24.27 7.38
relu
3x3 3x3,
conv,64
256 3x3 conv, 64 1x1, 256 3x3 conv, 64
sizes, the
training error (%)

plain-34 28.54 10.02 3x3 conv, 64 3x3 conv, 64

test error (%)

56-layer
3.4. Imp
train. We ResNet-34 A 25.03 7.76
relu relu
3x3 conv, 64 3x3 conv, 64
ResNet-34 B 20-layer24.52 7.46
Our im
e training Figure 5. Apool,deeper residual function F for ImageNet. Left: a
10 10 /2 3x3 conv, 128, /2 3x3 conv, 128, /2
ResNet-34 C 24.19 7.40 output
56-layer
22.85 6.71
building block
size: 28
3x3 conv, (on
512 56⇥56 feature maps)
3x3 conv, 128 as in Fig. 33x3for
conv,ResNet-
128 in [21, 41
hose used ResNet-50
ResNet-101 21.75 6.05
34. Right:3x3aconv,
“bottleneck”
512 building block
3x3 conv, 128 for ResNet-50/101/152.
3x3 conv, 128 domly sa
as learn- A 224⇥2
20-layer
ResNet-152 21.43 5.71 3x3 conv, 512 3x3 conv, 128 3x3 conv, 128
0
0 1 2 3 4 5 6
0
0 1 2 3 4 5 6 parameter-free,
3x3 conv, 512 identity shortcuts help with training.
3x3 conv, 128 Next horizonta
nputs, in- Table 3. Error rates (%, 10-crop testing) on ImageNet validation.
3x3 conv, 128
iter. (1e4) iter. (1e4) we investigate projection shortcuts (Eqn.(2)). In Table 3 we
VGG-16 is based on our test. ResNet-50/101/152 are of option B standard
Figure 1. Training error (left) and test errorthat(right)
only useson CIFAR-10
3x3 conv, 128 3x3 conv, 128
compare three options: (A) zero-padding shortcuts are used
vide com- projections for increasing dimensions. normaliza
Key
withidea:
20-layerintroduce “pass for increasing dimensions, and all shortcuts are parameter-
3x3 conv, 128 3x3 conv, 128

e residual and 56-layer “plain” networks. The deeper network free (the same as Table 2 and Fig. 4 right); 3x3(B)
3x3 conv, 128
projec-
conv, 128
before ac
method top-1 err. top-5 err. as in [13]
has higher training error, and thus test error. Similar phenomena
racy from through” into each layer tion 14 shortcuts are used for increasing dimensions, and other
output
VGG [41] (ILSVRC’14) - 8.43†
pool, /2 3x3 conv, 256, /2 3x3 conv, 256, /2
use SGD
size:

on ImageNet is presented in Fig. 4. GoogLeNet [44] (ILSVRC’14) - 7.89 shortcuts3x3are conv,identity;

512 and (C) all 256
3x3 conv, shortcuts are 3x3projections.
conv, 256

ataset we Table3x3
3 conv,
shows 512 that all three options are considerably
3x3 conv, 256 bet-
starts from
VGG [41] (v5) 24.4 7.1
and the m
3x3 conv, 256

ter than the plain512 counterpart.3x3 B is256slightly better3x3than A. We

yers—8⇥ PReLU-net [13] 21.59 5.71
use a wei
Thus only residual now
3x3 conv, conv, conv, 256
argue that this is because the3x3zero-padded dimensions in A
complex- greatly benefited from very deep models. BN-inception [16] 21.99 5.81 3x3 conv, 512 conv, 256
indeed have no residual learning. C is marginally better than
3x3 conv, 256
do not us
ResNet-34 B 21.84 5.71
needs to be
Driven learned
by the significance of depth, a question arises: Is B, and we attribute this to the extra parameters introduced In test
3x3 conv, 256 3x3 conv, 256

57% error ResNet-34 C 21.53 5.60

by many (thirteen) projection shortcuts. But the small dif-
3x3 conv, 256 3x3 conv, 256
10-crop t
ResNet-50 20.74 5.25
ace on the learning better networks as easy as stacking more layers? ImageNet test set, and won the 1st place in the ILSVRC
ResNet-101 19.87 4.60 ferences among A/B/C indicate that
3x3 conv, 256 projection shortcuts
3x3 conv, 256 are
convoluti
t analysis An obstaclexto answering this question was the notorious ResNet-152 19.38 4.49 not essential for addressing the degradation
3x3 conv, 256 problem. So we
3x3 conv, 256
at multip

problem of vanishing/exploding gradients [1,

Table 4. Error 9],
rates (%) which 2015
of single-model results onclassification
the ImageNet competition. The extremely deep rep- do not use option C in the rest
ory/time complexity and model
of this
3x3 conv, 256 paper, to reduce
sizes. Identity
3x3 conv, 256
shortcuts
mem-
are
side is in

hamper convergence from the beginning. This problem, resentations also have excellent generalization performance
†
weight layer validation set (except reported on the test set). 3x3 conv, 256
particularly important for not increasing the complexity of
3x3 conv, 256

mportance 4. Expe
the bottleneck architectures that are introduced below.
3x3 conv, 256 3x3 conv, 256

F(x) method
initial- on3.57% other recognition
top-5 err. (test)
tasks,Architectures.
and leadNextuswe to further win
4.1. the
however, has been largely addressed x by With normalized ensembling, 7.32 top-5
relu
o our ex-
output
Ima
pool, /2 3x3 conv, 512, /2 3x3 conv, 512, /2
size: 7
VGG [41] (ILSVRC’14) Deeper Bottleneck describe our
ization [23, 9, 37, 13] and intermediate
identity normalization layers
3x3 conv, 512 3x3 conv, 512

lative im- weight layer test error on

GoogLeNet
VGG [41] (v5)
[44]
ImageNet
(ILSVRC’14)
1st places 6.66
6.8
on: ImageNet
deeper nets for
detection, ImageNet localization,
ImageNet. Because of concerns on the
ing time that we can afford, we modify the building block
train- We ev 3x3 conv, 512 3x3 conv, 512

et. Deep [16], which enable networks with tens of layers to [13]
start con- cation da
PReLU-net
COCO 4.94
detection, as a and COCO
bottleneck segmentation
design4 . For each residual function F, in weILSVRC &
3x3 conv, 512 3x3 conv, 512

are traine
o ILSVRC verging
F(x)+xfor stochastic gradient descent (SGD) with
BN-inception [16] back- 4.82 use a stack of 3 layers instead of 2 (Fig. 5). The three layers
ated on t
3x3 conv, 512 3x3 conv, 512

on the 1st propagation [22]. relu ResNet (ILSVRC’15) COCO 3.57 2015 competitions.
are 1⇥1, 3⇥3, and 1⇥1 This strong
convolutions, whereevidence
the 1⇥1 layers shows that
are responsible for reducing and then increasing (restoring) result on
output
3x3 conv, 512 3x3 conv, 512

Net local-Figure 2.When

Residual learning: a building
are able block. the residual learning
Table 5. Error rates (%) of ensembles. The top-5 error is on the
principle islayer
generic,
a bottleneckand we expect that
fc 4096 avg pool avg pool

dimensions, leaving the 3⇥3 with smaller We evalu

size: 1

deeper networks toteststart converging,

set of ImageNet and reportedaby the test server. fc 4096 fc 1000 fc 1000

on. (He, Zhang, Ren, Sun, CVPR 2016)

degradation problem has been exposed: with the network it is applicable inbothother
input/output dimensions. Fig. 5 shows an example, where
vision
designs have andcomplexity.
similar time non-vision problems. Plain Ne fc 1000

plain nets
The parameter-free identity shortcuts are particularly im-
depth increasing, accuracy gets saturated (which might be
ResNet reduces the top-1 error by 3.5% (Table 2), resulting Figure 3. Example network architectures for ImageNet. Left: the 18-layer
Choosing the Architecture
• Task dependent

• Cross-validation

• [Convolution → pooling]* + fully connected layer

• The more data: the more layers and the more kernels
Ø Look at the number of parameters at each layer
Ø Look at the number of flops at each layer

• Computational resources

[From Marc'Aurelio Ranzato, CVPR 2014 tutorial]

End of Part 1

Lecture NN Part1
No ratings yet
Lecture NN Part1
62 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
Sparseautoencoder 2011new
No ratings yet
Sparseautoencoder 2011new
19 pages
Introduction To Deep Learning EXTC Hon
No ratings yet
Introduction To Deep Learning EXTC Hon
138 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
FDL Module1
No ratings yet
FDL Module1
102 pages
AML 03 Dense Neural Networks
No ratings yet
AML 03 Dense Neural Networks
20 pages
Sparse Autoencoder
No ratings yet
Sparse Autoencoder
15 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
36 pages
Artificial Intelligence: Outline
No ratings yet
Artificial Intelligence: Outline
35 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Unit II
No ratings yet
Unit II
12 pages
cst414 - Deep Learning
No ratings yet
cst414 - Deep Learning
34 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
1) Deep - Learning
No ratings yet
1) Deep - Learning
60 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
DeepLearing Theory
No ratings yet
DeepLearing Theory
51 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Neural NetworksChapter2Sup
No ratings yet
Neural NetworksChapter2Sup
20 pages
Unit Ii DNN
No ratings yet
Unit Ii DNN
24 pages
Week 03-04 - Deep Feedforward Networks - Intro
No ratings yet
Week 03-04 - Deep Feedforward Networks - Intro
141 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Dense Neural Nets
No ratings yet
Dense Neural Nets
68 pages
Lecture - 05 (Introduction To ANN)
No ratings yet
Lecture - 05 (Introduction To ANN)
27 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
Multilayer Feed Forward Neural Networks (Rev-1.0)
No ratings yet
Multilayer Feed Forward Neural Networks (Rev-1.0)
25 pages
Slides 11
No ratings yet
Slides 11
48 pages
Lecture 2 - Neural Network v1.0
No ratings yet
Lecture 2 - Neural Network v1.0
64 pages
ML MU Unit 5NeuralNetworkpdf 2025 04 16 13 47 39
No ratings yet
ML MU Unit 5NeuralNetworkpdf 2025 04 16 13 47 39
57 pages
Deep Learning Basics Lecture 1 Feedforward
No ratings yet
Deep Learning Basics Lecture 1 Feedforward
31 pages
LLM For Maths People
No ratings yet
LLM For Maths People
53 pages
cs188 Fa24 Lec24
No ratings yet
cs188 Fa24 Lec24
46 pages
Ad3451 ML Unit 4 Notes Eduengg
No ratings yet
Ad3451 ML Unit 4 Notes Eduengg
36 pages
Lec 23
No ratings yet
Lec 23
13 pages
L10 Neural Network
No ratings yet
L10 Neural Network
52 pages
NNDL
No ratings yet
NNDL
96 pages
Key Topics For Lecture 3
No ratings yet
Key Topics For Lecture 3
7 pages
Deep Learning for NLP Enthusiasts
No ratings yet
Deep Learning for NLP Enthusiasts
189 pages
Week 1 - Artificial Neural Networks - Part I - Justin
No ratings yet
Week 1 - Artificial Neural Networks - Part I - Justin
56 pages
Chapter-4 Fundamental of Neural Network
No ratings yet
Chapter-4 Fundamental of Neural Network
26 pages
FML Unit5
No ratings yet
FML Unit5
21 pages
Unit Iv DM
No ratings yet
Unit Iv DM
58 pages
Lesson 2 Neural Network Architectures
No ratings yet
Lesson 2 Neural Network Architectures
35 pages
NN Unit - 1
No ratings yet
NN Unit - 1
27 pages
Tutorial 1,2
No ratings yet
Tutorial 1,2
12 pages
Neural Network (Basics)
No ratings yet
Neural Network (Basics)
48 pages
10 Neural Nets
No ratings yet
10 Neural Nets
61 pages
Deep Learning Introduction
No ratings yet
Deep Learning Introduction
44 pages
Notes Chapter Neural Networks
No ratings yet
Notes Chapter Neural Networks
18 pages
NNDL Umit 1 Important Questions
No ratings yet
NNDL Umit 1 Important Questions
8 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
Area Navigation (RNAV) - SKYbrary Aviation Safety
100% (1)
Area Navigation (RNAV) - SKYbrary Aviation Safety
5 pages
Solutions Manual For Fundamentals of Electric Circuits 6th Edition by Alexander Ibsn 0078028221
50% (2)
Solutions Manual For Fundamentals of Electric Circuits 6th Edition by Alexander Ibsn 0078028221
99 pages
Al Farabi University College ةيلك ةعماجلا يبارافلا: ASTM D88-99
No ratings yet
Al Farabi University College ةيلك ةعماجلا يبارافلا: ASTM D88-99
6 pages
Streaming Compressed 3D Data On The Web Using Javascript and Webgl
No ratings yet
Streaming Compressed 3D Data On The Web Using Javascript and Webgl
10 pages
Streaming Compressed 3D Data On The Web Using Javascript and Webgl
No ratings yet
Streaming Compressed 3D Data On The Web Using Javascript and Webgl
10 pages
Streaming Compressed 3D Data On The Web Using Javascript and Webgl
No ratings yet
Streaming Compressed 3D Data On The Web Using Javascript and Webgl
10 pages
Simulation Study of Application Layer Relaying Algorithms With Data-Link ARQ in Flying Ad Hoc Networks
No ratings yet
Simulation Study of Application Layer Relaying Algorithms With Data-Link ARQ in Flying Ad Hoc Networks
8 pages
Application Level Relay For High-Bandwidth Data Transport: A A A B A A B
No ratings yet
Application Level Relay For High-Bandwidth Data Transport: A A A B A A B
21 pages
International Journal of Medical Informatics
No ratings yet
International Journal of Medical Informatics
4 pages
Scheme & Syllabus V& VI Sem NEP2
No ratings yet
Scheme & Syllabus V& VI Sem NEP2
55 pages
The Design and Implementation of Host-Based Intrusion Detection System
100% (1)
The Design and Implementation of Host-Based Intrusion Detection System
4 pages
CT1 NNDL Question Bank
No ratings yet
CT1 NNDL Question Bank
8 pages
ImageNet Classification With Deep Convolutional Convolutional Neural Networks PDF
No ratings yet
ImageNet Classification With Deep Convolutional Convolutional Neural Networks PDF
37 pages
ML QB
No ratings yet
ML QB
23 pages
Neural Network Programming With Java - Sample Chapter
No ratings yet
Neural Network Programming With Java - Sample Chapter
25 pages
Classfication and Prediction
No ratings yet
Classfication and Prediction
133 pages
Goh Et Al-2017-Journal of Computational Chemistry
No ratings yet
Goh Et Al-2017-Journal of Computational Chemistry
17 pages
RNN Gradient Problems Explained
No ratings yet
RNN Gradient Problems Explained
4 pages
Fruit Classification Draft
No ratings yet
Fruit Classification Draft
41 pages
Neural Networks for Beginners
No ratings yet
Neural Networks for Beginners
4 pages
Cs221 Section2 Solutions
No ratings yet
Cs221 Section2 Solutions
7 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
Activation Functions
No ratings yet
Activation Functions
9 pages
Neural Network Notes
No ratings yet
Neural Network Notes
268 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
29 pages
NE
No ratings yet
NE
22 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
18 pages
1 s2.0 S1877050923010761 Main
No ratings yet
1 s2.0 S1877050923010761 Main
84 pages
Deep Learning Decoding Problems
100% (1)
Deep Learning Decoding Problems
103 pages
Dossi Kojon - M. Zafar Iqbal
33% (3)
Dossi Kojon - M. Zafar Iqbal
7 pages
Unit1 DL JNTUK
No ratings yet
Unit1 DL JNTUK
43 pages
Unit II
No ratings yet
Unit II
56 pages
Logistic Regression - Gradient Descent - Example
No ratings yet
Logistic Regression - Gradient Descent - Example
4 pages
Machine Learning Lecture Notes
No ratings yet
Machine Learning Lecture Notes
226 pages
Deep Learning For Vision Systems 1st Edition Mohamed Elgendy - The Ebook in PDF Format Is Available For Download
No ratings yet
Deep Learning For Vision Systems 1st Edition Mohamed Elgendy - The Ebook in PDF Format Is Available For Download
56 pages
A Step by Step Backpropagation Example - Matt Mazur
No ratings yet
A Step by Step Backpropagation Example - Matt Mazur
13 pages
20EE0239-Neural Networks and Fuzzy Logic
No ratings yet
20EE0239-Neural Networks and Fuzzy Logic
7 pages
Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters Via Wise Resource Sharing
No ratings yet
Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters Via Wise Resource Sharing
11 pages
A Multilayer Feed-Forward Neural Network
No ratings yet
A Multilayer Feed-Forward Neural Network
9 pages