[go: up one dir, main page]

0% found this document useful (0 votes)
36 views14 pages

2004 Vijayakumar

This document provides an overview of using machine learning techniques for control. It discusses how learning internal models or control policies is a function approximation problem that can be addressed through supervised learning, adaptive control, dynamic programming, or reinforcement learning. The key topics covered include function approximation, types of internal models, learning as a regression problem, issues in machine learning design like choosing data and models, cost functions, the empirical vs generalization error, overfitting, and the bias-variance dilemma.

Uploaded by

Yang Yi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views14 pages

2004 Vijayakumar

This document provides an overview of using machine learning techniques for control. It discusses how learning internal models or control policies is a function approximation problem that can be addressed through supervised learning, adaptive control, dynamic programming, or reinforcement learning. The key topics covered include function approximation, types of internal models, learning as a regression problem, issues in machine learning design like choosing data and models, cost functions, the empirical vs generalization error, overfitting, and the bias-variance dilemma.

Uploaded by

Yang Yi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Lecture IV: Learning for Control

- Inference with Data

Overview
Internal models and function approximation
Cost Function & Optimization
Generalization, Overfitting & Bias-Variance Dilemma

Lecture IV: MLSC - Dr. Sethu Vijayakumar

Function Approximation for control


.

..

= f ( , , )

Learning Internal Models or


Control Policies is essentially
performing function
approximation
Supervised Learning
Adaptive Control
Dynamic Programming
Reinforcement Learning

sensory feedback

Lecture IV: MLSC - Dr. Sethu Vijayakumar

Types of internal models


Learn these models from
data or observations of
input-output pairs

[Figure reproduced from Wolpert & Ghahramani, Nature Neuroscience(2000)]


Lecture IV: MLSC - Dr. Sethu Vijayakumar

Learning as a function approx. problem


y2
y1

y3

Learning

x1

x2

x3

x1

x2

x3

Data
{x1, y1}

Regression

{ x2 , y 2 }

Input

M
{ xn , y n }

Output

Function
Approx.

Lecture IV: MLSC - Dr. Sethu Vijayakumar

f : RN RM

Data and Inference


Training Data :

D = {X, t} = {x i , ti }iN=1

The outputs ti (targets) can be :


true/false (Concept Learning),
class labels (Classification) or
real numbers (Regression).

Data Generating Process : yi = f (x i , z i ) where z i is the hidden variable.


Variables which cannot
be directly measured

Observed data contaminated by noise: ti = yi +


where is the noise
Modeling the data:

y i = g (x i | )
where = parameters.

Lecture IV: MLSC - Dr. Sethu Vijayakumar

Machine Learning Design Issues


1.

Choosing the Training Data (Active learning)


- Decide on type of data (reward, class or real values)
- Sampling
To get a representative distribution
To select informative data

2.

Model Selection or Target Representation


How to choose the right model g (x | ) ?

3.

Measure of Distance (Error/Loss function)-> d(.)


L( | D) = i d ( yi , y i ) = i d ( yi , g (x i | ))

4.

Optimization Procedure

* = arg min L( | D)

Lecture IV: MLSC - Dr. Sethu Vijayakumar

Cost/Loss Functions(I)
For Classification y = 1, prediction = f ( x), Class prediction = sgn( f ( x))

Misclassification :
Exponential :
Binomial Deviance:
Squared Error :

Support Vector :

I (sgn( f ) y )
exp( yf )

log(1 + exp(2 yf ))
( y f ( x)) 2
(1 yf ) I ( yf > 1)

Here, I ( x) = 1 if x = TRUE
= 0 otherwise.
Exponential error loss concentrates much more on points
with large negative margins while Binomial deviance
spreads the influence over all data. Hence, Binomial
deviance is more robust in noise prone situations.
Lecture IV: MLSC - Dr. Sethu Vijayakumar

Cost/Loss Functions(II)
For Regression

2
Squared Error Loss : [ y f ( x)]
Absolute Error Loss : | y f ( x) |
Huber Loss :
[ y f ( x)] 2
for | y f ( x) |
2 (| y f ( x) | / 2), otherwise

Cost Functions

Huber Loss combines the


good properties of squared
error loss near zero and
absolute error loss when
the error is large.

Lecture IV: MLSC - Dr. Sethu Vijayakumar

Empirical Error vs. Generalization Error


Generalization
The ability of a learning system to not only memorize training data but also to predict
reasonably well for novel inputs based on the training examples.

1
f ( x) f ( x, ) dx
2
1 N
( x , ) = 1
(
)
Lemp ( ) =
f
x

f
i
i
2 N i =1
2N
L( ) =

Usually, we only have access to the


empirical (training) error since we do not
know the true generating function.

: True Error/Generalization Error


N

y
i =1

y i : Empirical (training) Error

However, performing optimization only


based on empirical error does not ensure
good generalization we will see an
example of overfitting soon !!
Lecture IV: MLSC - Dr. Sethu Vijayakumar

x1

x2

x3
9

Overfitting
Overfitting
The tendency of the learning system (typically with too many open parameters) to
concentrate on the idiosyncrasies of the training data and noise rather than capturing
the essential features of the data generating mechanism.

Example from regression: Polynomial fitting

y = f (x, w ) = w0 + w1 x + w2 x 2 + + wn x n = wT x
How many inputs (i.e., degree of
polynomial) should be used to fit the data?

1
x
where x =

n
x
1

x2

x3

Lecture IV: MLSC - Dr. Sethu Vijayakumar

xn
10

Overfitting (contd)
A popular error criterion is the Mean Squared Error Criterion
1 N
2
J = ( yi y i )
N i =1
or the Normalized Mean-Squared Error
z

the normalized MSE is a measure of how much variance in the output


data was explained
2.5

1 N
2

(
)
J=
y

y
i i
N y2 i=1

1.5

0.5

Example target function :

y = x + 2 exp(16x 2 )

-0.5

-1

-1.5

-2

-2

-1

Lecture IV: MLSC - Dr. Sethu Vijayakumar

11

Overfitting with Polynomials


1

2.5

0.9

0.8

MSE on test set

1.5

0.7
1

MSE

0.6
0.5

0.5
0.4

-0.5

0.3

-1

0.2

-1.5

MSE on training set

0.1

-2
-2

-1.5

-1

-0.5

0.5

1.5

5
10
degree of polynomial

15

Observation: Just concentrating on reducing training error results in worse


generalization with novel datadue to overfitting.

Lecture IV: MLSC - Dr. Sethu Vijayakumar

12

Bias-Variance Dilemma
Too few features are bad, too many are bad, thus, there should be an
optimum
1 N
1 N
2
A closer look at the MSE criterion J = (t i y i ) = t i f ( x i )
N

i =1

i =1

What we actually want to minimize is the generalization/true error but


we do not have the original function and hence, all we have access to is
the training /empirical error !!
The next best thing to do: minimize J in expectation, i.e., over
infinitely many data sets ->

min (E { J })
Lecture IV: MLSC - Dr. Sethu Vijayakumar

13

Bias-Variance Dilemma (II)


1
E{ J } = E
N

(t
i =1

y i )

1
=
N

E {(t
N

i =1

y i )

1
=
N

E {J }
i

i =1

Bias Variance Decomposition of Expected Error

2
2
E {J i } = 2 + (E {y i } f ( x i ) ) + E ( y i E {y i })
= var( noise ) + bias 2 + var( estimate )

Note: For derivation of the decomposition, refer to class handout.

Adobe Acrobat
Document

Usually, if we try to reduce bias in a model, it increases the variance and viceversa, resulting in the dilemma for optimal choice.

Lecture IV: MLSC - Dr. Sethu Vijayakumar

14

You might also like