2004 Vijayakumar
2004 Vijayakumar
Overview
Internal models and function approximation
Cost Function & Optimization
Generalization, Overfitting & Bias-Variance Dilemma
..
= f ( , , )
sensory feedback
y3
Learning
x1
x2
x3
x1
x2
x3
Data
{x1, y1}
Regression
{ x2 , y 2 }
Input
M
{ xn , y n }
Output
Function
Approx.
f : RN RM
D = {X, t} = {x i , ti }iN=1
y i = g (x i | )
where = parameters.
2.
3.
4.
Optimization Procedure
* = arg min L( | D)
Cost/Loss Functions(I)
For Classification y = 1, prediction = f ( x), Class prediction = sgn( f ( x))
Misclassification :
Exponential :
Binomial Deviance:
Squared Error :
Support Vector :
I (sgn( f ) y )
exp( yf )
log(1 + exp(2 yf ))
( y f ( x)) 2
(1 yf ) I ( yf > 1)
Here, I ( x) = 1 if x = TRUE
= 0 otherwise.
Exponential error loss concentrates much more on points
with large negative margins while Binomial deviance
spreads the influence over all data. Hence, Binomial
deviance is more robust in noise prone situations.
Lecture IV: MLSC - Dr. Sethu Vijayakumar
Cost/Loss Functions(II)
For Regression
2
Squared Error Loss : [ y f ( x)]
Absolute Error Loss : | y f ( x) |
Huber Loss :
[ y f ( x)] 2
for | y f ( x) |
2 (| y f ( x) | / 2), otherwise
Cost Functions
1
f ( x) f ( x, ) dx
2
1 N
( x , ) = 1
(
)
Lemp ( ) =
f
x
f
i
i
2 N i =1
2N
L( ) =
y
i =1
x1
x2
x3
9
Overfitting
Overfitting
The tendency of the learning system (typically with too many open parameters) to
concentrate on the idiosyncrasies of the training data and noise rather than capturing
the essential features of the data generating mechanism.
y = f (x, w ) = w0 + w1 x + w2 x 2 + + wn x n = wT x
How many inputs (i.e., degree of
polynomial) should be used to fit the data?
1
x
where x =
n
x
1
x2
x3
xn
10
Overfitting (contd)
A popular error criterion is the Mean Squared Error Criterion
1 N
2
J = ( yi y i )
N i =1
or the Normalized Mean-Squared Error
z
1 N
2
(
)
J=
y
y
i i
N y2 i=1
1.5
0.5
y = x + 2 exp(16x 2 )
-0.5
-1
-1.5
-2
-2
-1
11
2.5
0.9
0.8
1.5
0.7
1
MSE
0.6
0.5
0.5
0.4
-0.5
0.3
-1
0.2
-1.5
0.1
-2
-2
-1.5
-1
-0.5
0.5
1.5
5
10
degree of polynomial
15
12
Bias-Variance Dilemma
Too few features are bad, too many are bad, thus, there should be an
optimum
1 N
1 N
2
A closer look at the MSE criterion J = (t i y i ) = t i f ( x i )
N
i =1
i =1
min (E { J })
Lecture IV: MLSC - Dr. Sethu Vijayakumar
13
(t
i =1
y i )
1
=
N
E {(t
N
i =1
y i )
1
=
N
E {J }
i
i =1
2
2
E {J i } = 2 + (E {y i } f ( x i ) ) + E ( y i E {y i })
= var( noise ) + bias 2 + var( estimate )
Adobe Acrobat
Document
Usually, if we try to reduce bias in a model, it increases the variance and viceversa, resulting in the dilemma for optimal choice.
14