Overview of Machine Learning and Pattern Recognition
Neural Networks
A NN is a machine learning approach inspired by the way in which
the brain performs a particular learning task
Network of simple computational units (neurons) connecte
by links (synapses)
Neural Networks
A NN is a machine learning approach inspired by the way in which the brain
performs a particular learning task
Network of simple computational units (neurons) connecte
by links (synapses)
Knowledge about the learning task is given in the form of training examples (training vectors).
Inter neuron connection strengths (weights) are used to store the acquired information (the
training examples)
During the learning process the weights are modi ed in order to
model the particular learning task correctly on the training examples.
.
fi
d
NN - Learning
Supervised Learning
Pattern recognition, regression, etc.
Labeled training examples (input + desired output)
Neural Network models: perceptron, feed-forward, etc.
Unsupervised Learning
Clustering
Unlabeled training examples (different realizations of the input alone)
Neural Network models: self organizing maps (SOMs), Hopfield networks, etc.
.
NN - architectures
Three different classes of network architectures
– single-layer feed-forward neurons are organized in acyclic layers
– multi-layer feed-forward (links have only one direction)
– Recurrent
A standard architecture consists of
Input unit
represent the input as a fixed-length vector of numbers (user-defined
Hidden units (optional
thresholded weighted sums of the input
represent intermediate calculations that the network learn
Output unit
represent the output as a fixed length vector of number
The architecture of a neural network is linked with the learning algorithm used to train
s
Feed-forward single layer (perceptron)
Input layer of Output layer
source nodes of neurons
Inpu
output
t
Feed-forward multi layer
3-4-2 Network
Input Output
layer layer
input
hidden
Hidden Layer output
The neuron
The neuron is the basic information processing unit of a NN. It consists of:
A set of synapses or connecting links, each link characterized by a numeric
weight: W1, W2, …, Wm
An adder function (linear combiner) which computes the weighted sum of the inputs:
m
u = ∑ w jx j
j =1
Activation function (squashing function) for limiting the amplitude of the
output of the neuron.
y = ϕ (u + b)
𝜑
The neuron
Bias is an external parameter of the neuron. It can be modeled as an extra input.
The neuron
Typical activation functions:
Step function
Sign function
Sigmoid function
Emulates the typical response of a biological neuron: the neuron “ res”
only if the input signal is above the threshold potential
fi
Designing a Neural Network
Various types of neurons
Various network architectures
Various learning algorithms
Various applications
Feed-forward single layer NN
• The simplest architecture is the perceptrons
network
– No hidden units: synonym for a single-
layer, feed-forward network
– It can be used for multi-class problems:
each neueon learns to recognize one
particular class (i.e., output 1 if the input
is in that class, and 0 otherwise) Perceptrons
network
– It is a linear classi er: it can only classify
linearly separable instances
fi
Perceptron: learning rule
The teacher (or target) specifies the desired output for a given input
Network calculates the output based on its current weights (random
initialization). Then it iteratively changes its weights in proportion to the error
between the desired output & calculated output:
For a neuron j:
,
wj,i(t+1) = wi,j (t) + Δwi,j
∆wj,i = * [tj - yj] * ′( )*xi , where:
– is the learning rate;
– tj - yj is the error term
– netj is the weighted sum of the inputs Delta rule
– is the ac va on func on
– xi is the i-th input component
𝑤
𝜑
𝜂
𝑥
𝑛
𝑗𝜑𝑦
𝑒
𝑗
𝑡
𝑖
𝜼
t
ti
𝝋
𝒏
ti
𝒆
𝒕
Gradient descent
Delta rule is a derivation of a gradient descent optimization
algorithm, attempting to minimize the total error in the output:
• The aim is to obtain adjust weights in order to minimize E.
The fastest procedure is to compute for each neuron:
GradE = [dE/dw1, dE/dw2, . . ., dE/dwn]
• Change i-th weight by
• ∆ wi = - * dE/dwi
Error as function of weights in
multidimensional space
𝜼
Perceptron convergence theorem
• Provided that the classes are linearly separable, the final
weights of a perceptron’s network can be obtained in a
finite number of steps and independent of initialization.
• However, a single layer perceptron can only learn linearly
separable classes
Feed-forward multi layer NN
3-4-2 Network • Layer 0 is input nodes
Layers 1 to N-1 are
hidden nodes, Layer N
Output
Input
layer
is output node
layer
• All nodes at any layer
Hidden Layer
k are connected to all
nodes at layer k+
• There are no cycles
• Can compute functions with convex region
• Each hidden node acts like a perceptron, learning a separating line
• Combination of multiple linear classifiers
s
,
s
Backpropagation
• Can’t use Delta Learning Rul
– Target values for hidden units are not available! How to compute error term?
• Use backpropagation algorithm:
1. Compute the error term for the output units, as with perceptro
2. From output layer, repea
- propagating the error term back to the previous layer and
- updating the weights between the two layers
until the earliest hidden layer is reached
• Backpropagation is a derivation of gradient descent (delta rule is a special case).
t
Backpropagation algorithm
• Initialize weights (typically random!)
• Keep doing epochs:
– For each example e in training set do
• forward pass to compute
– O = neural-net-output
– miss = (T-O) at each output unit
• backward pass to calculate deltas to weights
• update all weights
– end
• Until error stops improving or max number of epochs reached
Backpropagation algorithm
Pictures below illustrate how signal is propagating through the network, Symbols w(xm)n represent
weights of connections between network input xm and neuron n in input layer. Symbols yn
represents output signal of neuron n.
Backpropagation algorithm
Backpropagation algorithm
Backpropagation algorithm
Propagation of signals through the hidden layer. Symbols wmn represent weights of
connections between output of neuron m and input of neuron n in the next layer.
Backpropagation algorithm
Backpropagation algorithm
Propagation of signals through the output layer.
Backpropagation algorithm
In the next algorithm step the output signal of the network y is compared
with the desired output value (the target), which is found in training data
set. The difference is called error signal of output layer neuron
𝛿
Backpropagation algorithm
The idea is to propagate the error signal back to all neurons, which
output signals were input for discussed neuron.
𝛿
Backpropagation algorithm
The idea is to propagate the error signal back to all neurons, which
output signals were input for discussed neuron.
𝛿
Backpropagation algorithm
The weights' coefficients wmn used to propagate errors back are equal to this used
during computing output value. Only the direction of data flow is changed (signals
are propagated from output to inputs one after the other). This technique is used for
all network layers.
Backpropagation algorithm
When the error signal for each neuron is computed, the weights coefficients of each
neuron input node may be modified. In formulas below df(e)/de represents derivative
of neuron activation function (which weights are modi ed).
fi
Backpropagation algorithm
When the error signal for each neuron is computed, the weights coefficients of each
neuron input node may be modified. In formulas below df(e)/de represents derivative of
neuron activation function (which weights are modi ed).
fi
Backpropagation algorithm
When the error signal for each neuron is computed, the weights coefficients of each
neuron input node may be modified. In formulas below df(e)/de represents derivative of
neuron activation function (which weights are modi ed).
fi
The decision boundary perspective
Initial random weights
The decision boundary perspective
Present a training instance / adjust the weights
The decision boundary perspective
Present a training instance / adjust the weights
The decision boundary perspective
Present a training instance / adjust the weights
The decision boundary perspective
Present a training instance / adjust the weights
The decision boundary perspective
Eventually ….
What learning rate?
1. Tuning set, or
2. Cross validation, or
3. Small for slow, conservative learning
How many layers?
Types of Exclusive-OR Classes with Most General
Structure Decision Problem Meshed regions Region Shapes
Regions
Single-Layer Half Plane A B
Bounded By
Hyperplane B A B A
Two-Layer
Convex Open A B
Or
Closed Regions B A B A
Three-Layer Abitrary A B
(Complexity
Limited by No. B
B A A
of Nodes)
Neural Networks – An Introduction Dr. Andrew Hunter
How big a training set?
• Mostly, empirical rules…
• Determine your target error rate, e
(success rate is 1- e
• Typical training set approx. n/e, where n is the number of
weights in the net
• Example
– e = 0.1, n = 80 weights
– training set size 800 trained until 95% correct training set
classification should produce 90% correct classification on testing
set (typical)
:
Limitations of Neural Networks
Random initialized densely connected networks lead to:
• High training cost
– Each neuron in the neural network can be considered as a regression algorithm.
Training the entire neural network is to train all the interconnected regressions.
• Difficult to train as the number of hidden layers increase
– In backpropagation, gradient is progressively getting more dilute. That is, below
top layers, the correction signal is minimal
• Stuck in local optima
– The random initialization does not guarantee starting from the
proximity of global optima.
Solution
– Deep Learning/Learning multiple levels of representation
:
𝛿
.
Deep learning
First conceived for image classification tasks, as a
replication of mammalians’ visual cortex
Cascade of many hidden layers of locally connected units, for feature
extraction and transformation
The hidden layers of a deep network learn multiple levels of
representations that correspond to different levels of abstraction;
the levels form a hierarchy of concepts.
Deep learning: the current big thing?
Deep learning: the current big thing?
In “Nature” 27 January 2016:
• “DeepMind’s program AlphaGo
beat Fan Hui, the European Go
champion, five times out of five...”
• “AlphaGo was not preprogrammed
to play Go: it used a general-purpose
algorithm to interpret the game’s
patterns.
• “…AlphaGo program applied deep
learning to neural networks
(convolutional NN).”
”
Deep Learning Today
• Computer Vision and Image Processing
– Feature engineering is the bread-and-butter of a large portion of the CV community,
which creates some resistance to feature learning
– But the record holders on ImageNet and Semantic Segmentation are convolutional
nets
• Speech recognition
– A few long-standing performance records were broken with deep learning methods
– Microsoft and Google have both deployed DL-based speech recognition systems in
their products
• Advancement in Natural Language Processing
– Fine-grained sentiment analysis, syntactic parsing
– Language model, machine translation, question answering
• … potentially any field, including bioinformatics
Motivations for Deep Architectures
• Insuf cient depth can hurt
– With shallow architecture (SVM, NB, KNN, etc.), the required number of nodes in the
graph (i.e. computations, and also number of parameters, when we try to learn the
function) may grow very large.
– Many functions that can be represented ef ciently with a deep architecture cannot be
represented ef ciently with a shallow one.
• The brain has a deep architecture
– The visual cortex shows a sequence of areas each of which contains a
representation of the input, and signals flow from one to the next.
– Note that representations in the brain are in between dense distributed and purely local:
they are sparse: about 1% of neurons are active simultaneously in the brain.
• Cognitive processes seem deep
– Humans organize their ideas and concepts hierarchically.
– Humans first learn simpler concepts and then compose them to represent more abstract
ones.
– Engineers break-up solutions into multiple levels of abstraction and processing
fi
fi
fi
Deep Learning vs Traditional ML
Traditional pattern recognition models use hand-crafted features
and relatively simple trainable classifier.
hand-crafted “Simple”
feature Trainable output
extractor Classifier
This approach has the following limitations:
It is very tedious and costly to develop hand-crafted feature
The hand-crafted features are usually highly dependent on one application, and
cannot be transferred easily to other applications
Deep Learning vs Traditional ML
Deep learning (a.k.a. representation learning) seeks to learn rich
hierarchical representations (i.e. features) automatically through
multiple stage of feature learning process.
Low-level Mid-level High-level Trainable
output
features features features classi er
fi
Learning Hierarchical Representations
Low-level Mid-level High-level Trainable
output
features features features classi er
Increasing level of abstractio
• Hierarchy of representations with increasing level of abstraction.
Each stage is a kind of trainable nonlinear feature transform
• Image recognition
– Pixel → edge → texton → motif → part → object
• Text
– Character → word → word group → clause → sentence → story
• Virtually, any kind of application requiring PR (e.g. DNA sequence classification)
fi
:
Convolutional Neural Network (CNN)
• Convolutional Neural Networks are inspired by mammalian visual cortex.
– The visual cortex contains a complex arrangement of cells, which are sensitive to
small sub-regions of the visual eld, called a receptive eld. These cells act as local
filters over the input space and are well-suited to exploit the strong spatially local
correlation present in natural images.
– Two basic cell types:
• Simple cells respond maximally to speci c edge-like patterns within their
receptive eld.
• Complex cells have larger receptive fields and are locally invariant to the exact
position of the pattern.
fi
fi
fi
fi
Convolutional Neural Networks (CNN)
• Inspired by the neurophysiological experiments conducted by [Hubel & Wiesel 1962],
CNNs are a special type of neural network whose hidden units are only connected to
local receptive eld. The number of parameters needed by CNNs is much smaller.
• Input can have very high dimension. Using a fully-connected neural network would
need a large amount of parameters.
Example: 200x200 image
a) fully connected: 40,000 hidden units
=> 1.6 billion parameters
b) CNN: 5x5 kernel, 100 feature maps
=> 2,500 parameters
fi
CNN Architecture
• Intuition: Neural network with specialized connectivity structure,
– Stacking multiple layers of feature extractors
– Low-level layers extract local features.
– High-level layers extract learn global patterns.
• A CNN is a list of layers that transform the input data into an output class/
prediction.
• There are a few distinct types of layers:
– Convolutional layer
– Non-linear layer
– Pooling layer
Building-blocks for CNN
Feature maps of a larger
region are combined.
Feature maps are trained
with neurons.
Each sub-region yields a
Shared weights
feature map, representing
its feature.
Images are segmented into
sub-regions.
Convolution operation
Input: an image (2-D array)
Convolution kernel/operator (2-D array of
learnable parameters): w Feature map (2-D
array of processed data):
Convolution operation in 2-D domains:
s
Multiple Convolutions
Usually there are multiple feature maps,
one for each convolution operator.
CNN Architecture: Convolutional Layer
• The core layer of CNNs
• The convolutional layer consists of a set of filters.
– Each filter covers a spatially small portion of the input data.
• Each filter is convolved across the dimensions of the input data, producing a
multidimensional feature map.
– As we convolve the filter, we are computing the dot product between the
parameters of the filter and the input.
• Intuition: the network will learn filters that activate when they see some
speci c type of feature at some spatial position in the input.
• The key architectural characteristics of the convolutional layer is
– local connectivity
– and shared weights.
fi
CNN Convolutional Layer: Local Connectivity
• Neurons in layer m are only connected to 3
adjacent neurons in the m-1 layer.
• Neurons in layer m+1 have a similar
connectivity with the layer below.
• Each neuron is unresponsive to variations
outside of its receptive field with respect to the
input.
– Receptive field: small neuron collections which
process portions of the input data
• The architecture thus ensures that the learnt
feature extractors produce the strongest
response to a spa ally local input pattern.
ti
CNN Convolutional Layer: Shared Weights
We show 3 hidden neurons belonging to the same feature map (the
layer right above the input layer).
Weights of the same color are shared—constrained to be identical.
Gradient descent can still be used to learn such shared parameters,
with only a small change to the original algorithm.
Replicating neurons in this way allows for features to be detected
regardless of their position in the input (spatial invariance).
Additionally, weight sharing increases learning efficiency by
greatly reducing the number of free parameters being learnt.
CNN Architecture: Non-linear Layer
Intuition: Increase the nonlinearity of the entire architecture without affecting the
receptive fields of the convolution laye
A layer of neurons that applies the non-linear activation function, such as,
Non-linearity
• Tanh(x) • ReLU
ex − e−X
tanh x = ex + e−X = max(0, )
𝑓
𝑥
𝑥
CNN Architecture: Pooling Layer
Intuition: to progressively reduce the spatial size of the representation
to reduce the amount of parameters and computation in the network, and
hence to also control overfitting
Pooling partitions the input image into a set of non-overlapping
rectangles and, for each such sub-region, outputs the maximum value
of the features in that region.
Input
63
Pooling
• Common pooling operations:
– Max pooling: reports the
maximum output within a
rectangular neighborhood.
– Average pooling: reports the
average output of a rectangular
neighborhood (possibly
weighted by the distance from
the central pixel).
Building-blocks for CNN
Feature maps of a larger
region are combined.
Feature maps are
trained with neurons.
Shared weights Each sub-region yields a
feature map, representing
its feature.
Images are segmented
into sub-regions.
Full CNN
Hand-crafted
kernel function
SVM Apply simple
classi er
Deep Learning Learnable simple
kernel classifier
…
x1 y1
…
x2 …
… y
…
…
…
…
…
…
xN
… yM
𝑥𝜙
𝑥
fi
Deep CNN: ImageNet 2012 winner
Multiple feature maps per convolutional layer.
Multiple convolutional layers for extracting features at different levels.
Higher-level layers take the feature maps in lower- level layers as input.
Tools for Deep Learning
Deep Learning
… moving beyond shallow machine learning since 2006!
http://deeplearning.net/software_links
Caffe (Python, Matlab)
Theano (Python)
Torch (Lua)
TensorFlow (Python, C++)
… many others /