Vietnam National University of HCMC
International University
School of Computer Science and Engineering
INTRODUCTION TO ARTIFICIAL INTELLIGENCE
(IT097IU)
LECTURE 06: SUPERVISED LEARNING –
NEURAL NETWORKS
Instructor: Nguyen Trung Ky
Machine Learning
Up until now: model-based classification with Naive Bayes
Machine learning: how to acquire a model from data / experience
Learning parameters (e.g. probabilities)
Learning structure (e.g. neural networks)
Learning hidden concepts (e.g. clustering)
Perceptrons
Error-Driven Classification
Errors, and What to Do
Examples of errors
Dear GlobalSCAPE Customer,
GlobalSCAPE has partnered with ScanSoft to offer you the
latest version of OmniPage Pro, for just $99.99* - the regular
list price is $499! The most common question we've received
about this offer is - Is this genuine? We would like to assure
you that this offer is authorized by ScanSoft, is genuine and
valid. You can get the . . .
. . . To receive your $30 Amazon.com promotional certificate,
click through to
http://www.amazon.com/apparel
and see the prominent link for the $30 offer. All details are
there. We hope you enjoyed receiving this message. However, if
you'd rather not receive future e-mails announcing new store
launches, please click . . .
What to Do About Errors
Problem: there’s still spam in your inbox
Need more features – words aren’t enough!
Have you emailed the sender before?
Have 1M other people just gotten the same email?
Is the sending information consistent?
Is the email in ALL CAPS?
Do inline URLs point where they say they point?
Does the email address you by (your) name?
Naïve Bayes models can incorporate a variety of features, but tend to do
best in homogeneous cases (e.g. all features are word occurrences)
Linear Classifiers
Feature Vectors
Hello, # free : 2
YOUR_NAME : 0 SPAM
Do you want free printr MISSPELLED : 2
cartriges? Why pay more FROM_FRIEND : 0 or
when you can get them ...
ABSOLUTELY FREE! Just +
PIXEL-7,12 : 1
PIXEL-7,13
...
: 0
“2”
NUM_LOOPS : 1
...
Some (Simplified) Biology
Very loose inspiration: human neurons
Linear Classifiers
Inputs are feature values
Each feature has a weight
Sum is the activation
If the activation is: f1
w1
Positive, output +1
w2
f2 >0?
w3
Negative, output -1 f3
Weights
Binary case: compare features to a weight vector
Learning: figure out the weight vector from examples
# free : 4
YOUR_NAME :-1
MISSPELLED : 1 # free : 2
FROM_FRIEND :-3 YOUR_NAME : 0
... MISSPELLED : 2
FROM_FRIEND : 0
...
# free : 0
YOUR_NAME : 1
MISSPELLED : 1
Dot product positive FROM_FRIEND : 1
means the positive class ...
Decision Rules
Binary Decision Rule
In the space of feature vectors
Examples are points
Any weight vector is a hyper-plane
One side corresponds to Y=+1
Other corresponds to Y=-1
money
2
+1 = SPAM
1
BIAS : -3
free : 4
money : 2
... 0
-1 = HAM 0 1 free
Weight Updates
Learning: Binary Perceptron
Start with weights = 0
For each training instance:
Classify with current weights
If correct (i.e., y=y*), no change!
If wrong: adjust the weight vector
Learning: Binary Perceptron
Start with weights = 0
For each training instance:
Classify with current weights
If correct (i.e., y=y*), no change!
If wrong: adjust the weight vector by
adding or subtracting the feature
vector. Subtract if y* is -1.
Examples: Perceptron
Separable Case
Multiclass Decision Rule
If we have multiple classes:
A weight vector for each class:
Score (activation) of a class y:
Prediction highest score wins
Binary = multiclass where the negative class has weight zero
Learning: Multiclass Perceptron
Start with all weights = 0
Pick up training examples one by one
Predict with current weights
If correct, no change!
If wrong: lower score of wrong answer,
raise score of right answer
Example: Multiclass Perceptron
“win the vote”
“win the election”
“win the game”
BIAS : 1 BIAS : 0 BIAS : 0
win : 0 win : 0 win : 0
game : 0 game : 0 game : 0
vote : 0 vote : 0 vote : 0
the : 0 the : 0 the : 0
... ... ...
Properties of Perceptrons
Separable
Separability: true if some parameters get the training set
perfectly correct
Convergence: if the training is separable, perceptron will
eventually converge (binary case)
Mistake Bound: the maximum number of mistakes (binary Non-Separable
case) related to the margin or degree of separability
Examples: Perceptron
Non-Separable Case
Improving the Perceptron
Problems with the Perceptron
Noise: if the data isn’t separable,
weights might thrash
Averaging weight vectors over time
can help (averaged perceptron)
Trivial generalization: finds a
“barely” separating solution
Over-training: test / validation
accuracy usually rises, then falls
Overtraining is a kind of over-fitting
validation
Fixing the Perceptron
Idea: adjust the weight update to mitigate these effects
MIRA*: choose an update size that fixes the current
mistake…
… but, minimizes the change to w
The +1 helps to generalize
* Margin Infused Relaxed Algorithm
Minimum Correcting Update
min not =0, or would not have
made an error, so min will be
where equality holds
Maximum Step Size
In practice, it’s also bad to make updates that are too large
Example may be labeled incorrectly
You may not have enough features
Solution: cap the maximum possible value of with some
constant C
Corresponds to an optimization that assumes non-separable data
Usually converges faster than perceptron
Usually better, especially on noisy data
Linear Separators
Which of these linear separators is optimal?