Pattern Recognition
Artificial Neural Networks,
and Machine Learning
Yuan-Fang Wang
Department of Computer Science
University of California
Santa Barbara, CA 93106, USA
“Pattern Recognition”
What is a Pattern?
PR , ANN, & ML 2
PR , ANN, & ML 3
PR , ANN, & ML 4
PR , ANN, & ML 5
DNA patterns Protein Patterns
AGCTCGAT 20 amino acids
PR , ANN, & ML 6
PR , ANN, & ML 7
PR , ANN, & ML 8
PR , ANN, & ML 9
Faces
Finger prints
PR , ANN, & ML 10
Other Patterns
Insurance, credit card applications
applicants are characterized by a pattern
# of accidents, make of car, year of model
income, # of dependents, credit worthiness,
mortgage amount
Dating services
Age, hobbies, income, etc. establish your
“desirability”
PR , ANN, & ML 11
Other Patterns
Web documents
Key words based description (e.g., documents
containing War, Bagdad, Hussen are different
from those containing football, NFL, AFL,
draft, quarterbacks)
Intrusion detection
Usage and connection patterns
Cancer detection
Image features for tumors, patient age,
treatment option, etc.
PR , ANN, & ML 12
Other Patterns
Housing market
Location, size, year, school district
University ranking
Student population, student-faculty ratio,
scholarship opportunities, location, faculty research
grants, etc.
Too many
E.g.,
http://www.ics.uci.edu/~mlearn/MLSummary.html
PR , ANN, & ML 13
What is a pattern?
A pattern is a set of objects, processes or
events which consist of both deterministic
and stochastic components
A pattern is a record of certain dynamic
processes influenced both by deterministic
and stochastic factors
PR , ANN, & ML 14
What is a Pattern? (cont.)
Constellation patterns,
texture patterns, EKG
patterns, etc.
Completely regular, Completely
deterministic random
(e.g., crystal structure) (e.g., white noise)
PR , ANN, & ML 15
What is Pattern Recognition?
Classifies “patterns” into “classes”
Patterns (x)
have “measurements”, “traits”, or “features”
Classes ( i)
likelihood (a prior probabilityP( i ))
class-conditional density p( x| i )
Classifier (f(x) -> i)
An example
four coin classes: penny, nickel, dime, and quarter
measurements: weight, color, size, etc.
Assign a coin to a class based on its size, weight, etc.
We use P to denote probability mass function (discrete) and
p to denote probability density function (continuous)
PR , ANN, & ML 16
An Example
Many visual inspection systems are like this:
Circuit board, fruit, OCR, etc.
PR , ANN, & ML 17
Another Example
PR , ANN, & ML 18
Features
The intrinsic traits or characteristics that tell
one pattern (object) apart from another
Features extraction and representation allows
Focus on relevant, distinguishing parts of a pattern
Data reduction and abstraction
PR , ANN, & ML 19
Detection vs. Description
Detection: something Description: what has
happened happened?
Heard noise Gun shot, talking,
Saw something laughing, crying, etc.
interesting Lines, corners,
Non-flat signals textures
Mouse, cat, dog, bike,
etc.
PR , ANN, & ML 20
Feature Selection
More an art than a science
Effectiveness criteria:
population
size
Size alone is not effective
PR , ANN, & ML 21
perimeter
compactness
Perimeter is not effective
Discrimination is accomplished by compactness alone
elongatedness
compactness
The two feature values are correlated, only one of them
is needed
PR , ANN, & ML 23
PR , ANN, & ML 24
Too simple Too complicated
PR , ANN, & ML 25
Optimal tradeoff between performance and generalization
PR , ANN, & ML 26
Importance of Features
Cannot be over-stated
We usually don’t know which to select,
what they represent, and how to tune them
(face, gait recognition, tumor detection, etc.)
Classification and regression schemes are
mostly trying to make the best of whatever
features are available
PR , ANN, & ML 27
Features
One is usually not descriptive (no silver
bullet)
Many (shotgun approach) can actually hurt
Many problems:
Relevance
Dimensionality
Co-dpendency
Time and space varying characteristics.
Accuracy
Uncertainty and error
Missing values
PR , ANN, & ML 28
Feature Selection (cont.)
Q: How to decide if a feature is effective?
A: Through a training phase
Trainingon typical samples and typical features
to discover
Whether features are effective
Whether there are any redundancy
The typical cluster shape (e.g., Gaussian)
Decision boundaries between samples
Cluster centers of particular samples
Etc.
PR , ANN, & ML 29
Classifiers
i if gi ( x ) g j ( x ) for all j i
gi ( x ) P( i ) if no measurements are made
gi ( x ) P( i | x ) minimize misclassification rate
gi ( x ) R( i | x ) minimize associated risk
g1 g1 ( x )
g 2 (x)
x g2 max decision
g n (x)
gn
PR , ANN, & ML 30
Traditional Pattern Recognition
Parametric methods
Based on class sample exhibiting a certain
parametric distribution (e.g. Gaussian)
Learn the parameters through training
Density methods
Does not enforce a parametric form
Learn the density function directly
Decision boundary methods
Learn the separation in the feature space
PR , ANN, & ML 31
Parametric Methods
I. population II. population
size size
III. population IV. population
size size
1 | x x |2
1
2 2
e
( 2 )
n/ 2 n
PR , ANN, & ML 32
Density Methods
PR , ANN, & ML 33
Feature space
d dimensional (d the number of features)
populated with features from training samples
fd
f2
f1
PR , ANN, & ML 34
Decision
fd
Boundary ?
Methods f2
f1
• Decision surfaces • Cluster centers
fd fd
? ?
f2 f2
f1 f1
PR , ANN, & ML 35
PR , ANN, & ML 36
“Modern” vs “Traditional”
Pattern Recognition
Hand-crafted features Automatically learned
Simple and low-level features
concatenation of Hierarchical and
numbers or traits complex
Syntactic Semantic
Feature detection and Feature detection and
description are description are not
separate tasks from jointly optimized with
classifier design classifiers
PR , ANN, & ML 37
Traditional Features
PR , ANN, & ML 38
Modern Features
PR , ANN, & ML 39
Modern Features
PR , ANN, & ML 40
Modern Features
PR , ANN, & ML 41
Modern Features
PR , ANN, & ML 42
Modern Features
PR , ANN, & ML 43
“Modern” vs “Traditional”
Pattern Recognition
PR , ANN, & ML 44
Mathematical Foundation
Does not matter what methods or
techniques you use, the underlying
mathematical principle is quite simple
Bayesian theory is the foundation
PR , ANN, & ML 45
Review: Bayes Rule
Forward (synthesis) route:
From class to sample in a class
Grammar rules to sentences
Markov chain (or HMM) to pronunciation
Texture rules (primitive + repetition) to textures
Backward (analysis) route:
From sample to class ID
A sentence parsed by a grammar
A utterance is “congratulations” (not “constitution”)
Brickwall vs. plaid shirt
PR , ANN, & ML 46
Review: Bayes Rule
Backward is always harder
Because the interpretation is not unique
Presence of x has multiple possibilities
2
1 x n
3
PR , ANN, & ML 47
The simplest example
Two classes: pennies and dimes
No measurements
Classification:
based on the a prior probabilities
Error rate:
1 if P(1 ) P( 2 )
2 if P(1 ) P( 2 )
1 or 2 otherwise
min( P(1 ), P( 2 )) 1 2
PR , ANN, & ML 48
A slightly more complicated example
Two classes: pennies and dimes
A measurement x is made (e.g. weight)
Classification
based on the a posterior probabilities with
Bayes rule
1 if P(1| x ) P( 2 | x )
1
x
2
2 if P(1| x ) P( 2 | x )
1 or 2 otherwise
p( x , i ) p( x| i ) P( i )
P( i | x )
p( x ) p( x )
PR , ANN, & ML 49
probability
p( x|1 ) p( x | 2 )
weight
P(1 ) P( 2 )
probability
p( x|1 ) P(1 )
p( x| 2 ) P( 2 )
weight
p( x ) p( x )
probability
p(1| x ) p( 2 | x )
PR , ANN, & ML
weight 50
Why Both?
p( x | i ) & P( i ) ?
In the day time, some animal runs in front of
you on the bike path, you know exactly what it
is (p(x|w) is sufficient)
In the night time, some animal runs in front of
you on the bike path, you can hardly distinguish
the shape (p(x|w) is low for all cases, but you
know it is probably a squirrel, not a lion
because of p(w))
PR , ANN, & ML 51
Essence
Turn a backward (analysis) problem into
several forward (synthesis) problem
Or analysis-by-synthesis
Whichever model has a highly likelihood of
synthesizing the outcome wins
The formula is not mathematically provable
PR , ANN, & ML 52
Error rate
Determined by
The likelihood of a class
The likelihood of measuring x in a class
min( P(1| x ), P( 2 | x )) or
1
min( p( x|1 ) P(1 ), p( x| 2 ) P( 2 ))
p( x )
PR , ANN, & ML 53
Error Rate (cont.)
Bayes Decision Rule minimizes the average
error rate:
error p(error | x ) p( x )dx
p ( error | x ) p ( | x ) 1 p (
*
i ( x) | x)
i ( x )*
where
( x ) arg max p ( i | x )
*
i
PR , ANN, & ML 54
Various types of errors
PR , ANN, & ML 55
PR , ANN, & ML 56
Precision vs. Recall
A very common measure used in PR and
MI community
One goes up and the other HAS to go down
A range of options (Receiver operating
characteristic curves)
Area under the curve
as a goodness measure
PR , ANN, & ML 57
Various ways to measure error rates
Training error
Test error
Empirical error
Some under your control (training and test)
Some not (empirical error)
How: n-fold validation
Why: Overfitting and underfitting problems
PR , ANN, & ML 58
An even more complicated example
Two classes: pennies or dimes
A measurement x is made
Risk associated with making a wrong decision
Based on the a posterior probabilities with
Bayesian risk
R(1| x ) 11P(1| x ) 12 P( 2 | x )
R( 2 | x ) 21P(1| x ) 22 P( 2 | x )
ij : the loss of action i in state j
R( i | x ): the conditional risk of action i with x
PR , ANN, & ML 59
State1 State2
Mis-classification
Math
Observation
Decision1 Decision2
Mis-interpretation
Human factor
Observation
PR , ANN, & ML 60
State1 State2 Decision1 Decision2
Observation Observation
Decision1 Decision2 Decision1 Decision2
Incorrect decisions
Incur domain-specific cost
State1 State2
Observation
PR , ANN, & ML 61
An even more complicated example
p(x|pennies)P(pennies)
R(used as pennies | x) =
r(pennies used as pennies) * P(pennies | x) +
r(dimes used as pennies) * P(dimes | x)
p(x|dimes)P(dimes)
R(used as dimes | x) =
r(pennies used as dimes) * P(pennies | x) +
r(dimes used as dimes) * P(dimes | x)
PR , ANN, & ML 62
A more credible example
R(call FD|smoke) =
r(call,fire)*P(fire|smoke) +
r(call, no fire)*P(no fire|smoke)
False positive
R(no call FD|smoke)=
r(no call, no fire)*P(no fire|smoke) +
r(no call, fire)*P(fire|smoke)
False negative
The risk associated with false negative is
much higher than that of false positive
PR , ANN, & ML 63
A more credible example
R(attack|battle field intelligence) =
r(attack,<50%)*P(<50%|intelligence) +
r(attack,>50%)*P(>50%|intelligence)
False positive
R(no attack|battle field intelligence)=
r(no attack, >50%)*P(>50%|intelligence) +
r(no attack, <50%)*P(<50%|intelligence)
False negative
PR , ANN, & ML 64
Baysian Risk
Determined by
likelihood of a class
likelihood of measuring x in a class
the risk of making a wrong action
Classification
Baysian risk should be minimized
min( R(1 | x), R( 2 | x))or
min( 11P( 1 | x) 12 P( 2 | x), 21P( 1 | x) 22 P( 2 | x)) or
R(1 | x) R( 2 | x) 1
(21 11 ) P( 1 | x) (12 22 ) P( 2 | x)
PR , ANN, & ML 65
Bayesian Risk (cont.)
Again, decisions depend on
likelihood of a class
likelihood of observation of x in a class
Modified by some positive risk factors
Why?
Because in the real world, it might not be the
misclassification rate that is important, it is the
action you assume
(21 11 ) P ( 1 | x) (12 22 ) P ( 2 | x)
PR , ANN, & ML 66
Other generalizations
Multiple classes n
n classes P( i ) 1
i 1
Multiple measurements
X is a vector instead of a scalar
Non-numeric measurements
Actions vs. decisions
Correlated vs. independent events
speech signals and images
Training allowed or not
Time-varying behaviors
PR , ANN, & ML 67
Difficulties
What features to use
How many features (the curse of
dimensionality)
The a prior probability P( i )
The class-conditional density p( x| i )
The a posterior probability P( i | x)
PR , ANN, & ML 68
Typical Approaches
Supervised (with tagged samples x):
parameters of a probability function (e.g. Gaussian
) p( x | ) N ( , )
i i i
density functions (w/o assuming any parametric forms)
decision boundaries (classes are indeed separable)
Unsupervised (w/o tagged samples x):
minimum distance
hierarchical clustering
Reinforced (with hints)
Right or wrong, but not correct answer
Learning with a critic (not a teacher as in supervised)
PR , ANN, & ML 69
Pattern Recognition
Uncorrelated Events Correlated Events
Supervised Learning Un-supervised Clustering Hidden Markov Models
Parameter Density Decision Boundary Minimum Distance Hierarchical Clustering
PR , ANN, & ML 70
Applications
DNA sequence
Lie detectors
Handwritten digits recognition
Classification based on smell
Web document classification and search engine
Defect detection
Texture classification
Image database retrieval
Face recognition
etc.
PR , ANN, & ML 71
Other formulations
We talked about 1/3 of the scenarios – that
of classification (discrete)
Regression – continuous
Extrapolation and interpolation
Clustering
Similarity
Abnormality detection
Concept drift (discovery), etc.
PR , ANN, & ML 72
Classification vs. Regression
Classification Regression
Large vs. small hints Large means large,
on category small means small
Absolute values does Absolute values matter
not matter as much Fitting orders matter
(can actually hurt)
Normalization is often
necessary
Fitting order stays low
PR , ANN, & ML 73
Recent Development
Data can be “massaged” Surprisingly,
massaging the data and use simple
classifiers is better than massaging the
classifiers and use simple data (for simple
problems & small data sets)
Hard-to-visualize concept
Transform data into higher dimensional space
(e.g., infinite dimensional) has a tendency to
separate data and increase error margin
Concept of SVM and later kernel methods
74
More Recent Development
Think about fitting linear data with a model
Linear, quadratic, cubic, etc.
Higher the order, better the fit
n data points can be perfectly fit by an (n-1) order
polynomial
However
Overfitting is likely
No ability to extrapolate
“Massage” the classifiers (using deep networks)
Feature detection and description
Classification
Jointly optimization
PR , ANN, & ML 75