Chapter 4: Linear Models For Classification: Grit Hein & Susanne Leiberg

Chapter 4: Linear Models for Classification
Grit Hein & Susanne Leiberg

Goal
• Our goal is to “classify” input vectors x into one of k classes. Similar to

regression, but the output variable is discrete.
• input space is divided into decision regions whose boundaries are called
decision boundaries or decision surfaces
• linear models for classification: decision boundaries are linear functions of

input vector x
Decision
boundaries
discriminant dimension
color
weight vector
linear decision
boundary
diameter
Classifier seek an ‘optimal’ separation of classes (e.g., apples and

oranges) by finding a set of weights for combining features
(e.g., color and diameter).
no computation computation
of posterior probabilities of posterior probabilities
(probability of certain class given the data)
Classifier
Discriminant Probabilistic Probabilistic

function Generative Discriminative
Models Models
• directly • model class priors • model posterior

map each x onto (p(Ck)) & class-conditional probabilities (p(Ck/x))
a class label densities (p(x/Ck)) directly
• use to compute posterior
Tools probabilities (p(Ck/x))
Tools
• Least Square Tools • Logistic
Classification Regression
• Bayes
• Fisher’s Linear
Discriminant
Pros and Cons of the three approaches
Discriminant Functions are the most simple and intuitive approach to

classifying data, but do not allow to
• compensate for class priors (e.g. class 1 is a very rare disease)
• minimize risk (e.g. classifying sick person as healthy more costly than
classifying healthy person as sick)
• implement reject option (e.g. person cannot be classified as sick or healthy

with a sufficiently high probability)
Probabilistic Generative and Discriminative models can do all that

Pros and Cons of the three approaches
• Generative models provide a probabilistic model of all variables that

allows to synthesize new data – but -
• generating all this information is computationally expensive and complex

and is not needed for a simple classification decision
• Discriminative models provide a probabilistic model for the target variable

(classes) conditional on the observed variables
• this is usually sufficient for making a well-informed classification decision

without the disadvantages of the simple Discriminant Functions
no computation
of posterior probabilities
(probability of certain class given the data)
Classifier
Discriminant
function
• directly
map each x onto
a class label
Tools
• Least Square
Classification
• Fisher’s Linear
Discriminant
Discriminant functions
• are functions that are optimized to assign input x to one of k classes

y(x) = wTx + ω0 feature 2
Decision region 1
decision boundary
Decision region 2
w determines orientation
of decision boundary
feature 1
ω0 determines location
of decision boundary
Discriminant functions - How to determine
parameters?
1. Least Squares for Classification

• General Principle: Minimize the squared distance (residual) between the
observed data point and its prediction by a model function
parameters?
• In the context of classification: find the parameters which minimize the

squared distance (residual) between the data points and the decision
boundary
parameters?
• Problem: sensitive to outliers; also distance between the outliers and the
discriminant function is minimized --> can shift function in a way that
leads to misclassifications
least squares
logistic
regression
parameters?
2. Fisher’s Linear Discriminant

• General Principle: Maximize distance between means of different classes
while minimizing the variance within each class
maximizing maximizing between-class

between-class variance & minimizing
variance within-class variance
Probabilistic Generative Models
• model class-conditional densities (p(x⎮Ck)) and class priors (p(Ck))
• use them to compute posterior class probabilities (p(Ck⎮x)) according to

Bayes theorem
• posterior probabilities can be described as logistic sigmoid function
inverse of sigmoid function is the logit

function which represents the
ratio of the posterior probabilities for
the two classes
ln[p(C1⎮x)/p(C2⎮x)] --> log odds

Probabilistic Discriminative Models - Logistic
Regression
• you model the posterior probabilities directly assuming that they have a
sigmoid-shaped distribution (without modeling class priors and class-
conditional densities)
• the sigmoid-shaped function (σ) is model function of logistic regressions
• first non-linear transformation of inputs using a vector of basis functions

ϕ(x) → suitable choices of basis functions can make the modeling of the
posterior probabilities easier
p(C1/ϕ) = y(ϕ) = σ(wTϕ)
p(C2/ϕ) = 1-p(C1/ϕ)
Probabilistic Discriminative Models - Logistic
Regression
• Parameters of the logistic regression model determined by maximum

likelihood estimation
• maximum likelihood estimates are computed using iterative reweighted

least squares → iterative procedure that minimizes error function using
mathematical algorithms (Newton-Raphson iterative optimization scheme)
• that means starting from some initial values the weights are changed until
the likelihood is maximized
Normalizing posterior probabilities
• To compare models and to use posterior probabilities in Bayesian Logistic

Regression it is useful to have posterior probabilities in Gaussian form
• LAPLACE APPROXIMATION is the tool to find a Gaussian approximation

to a probability density defined over a set of continuous variables; here it is
used to find a gaussian approximation of your posterior probabilities
Z = unknown
p(z) = 1/Z f(z)
normalization constant
• Goal is to find Gaussian
• approximation q(z) centered on
q(z) p(z)
• the mode of p(z)

How to find the best model? - Bayes Information
Criterion (BIC)
• the approximation of the normalization constant Z can be used to obtain an

approximation for the model evidence
• Consider data set D and models {Mi} having parameters {θi}
• For each model define likelihood p(D|θi,Mi}
• Introduce prior over parameters p(θi|Mi)
• Need model evidence p(D|Mi) for various models
• Z is approximation of model evidence p(D|Mi)

Making predictions
• having obtained a Gaussian approximation of your posterior distribution

(using Laplace approximation) you can make predictions for new data
using BAYESIAN LOGISTIC REGRESSION
• you use the normalized posterior distribution to arrive at a predictive

distribution for the classes given new data
• you marginalize with respect to the normalized posterior distribution

discriminant dimension
color
weight vector
linear decision
boundary
diameter
Terminology
• Two classes
• single target variable with binary representation
• t ∈ {0,1}; t = 1 → class C1, t = 0 → class C2
• K > 2 classes
• 1-of-K coding scheme; t is vector of length K
• t = (0,1,0,0,0)T

Chapter 4: Linear Models For Classification: Grit Hein & Susanne Leiberg

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Chapter 4: Linear Models For Classification: Grit Hein & Susanne Leiberg

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4: Linear Models For Classification: Grit Hein & Susanne Leiberg

Uploaded by

Copyright:

Available Formats

Chapter 4: Linear Models for Classification

Grit Hein & Susanne Leiberg

• Our goal is to “classify” input vectors x into one of k classes. Similar to

• linear models for classification: decision boundaries are linear functions of

Classifier seek an ‘optimal’ separation of classes (e.g., apples and

Discriminant Probabilistic Probabilistic

• directly • model class priors • model posterior

Discriminant Functions are the most simple and intuitive approach to

• compensate for class priors (e.g. class 1 is a very rare disease)

• implement reject option (e.g. person cannot be classified as sick or healthy

Probabilistic Generative and Discriminative models can do all that

• Generative models provide a probabilistic model of all variables that

• generating all this information is computationally expensive and complex

• Discriminative models provide a probabilistic model for the target variable

• this is usually sufficient for making a well-informed classification decision

• are functions that are optimized to assign input x to one of k classes

1. Least Squares for Classification

• In the context of classification: find the parameters which minimize the

2. Fisher’s Linear Discriminant

maximizing maximizing between-class

• model class-conditional densities (p(x⎮Ck)) and class priors (p(Ck))

• use them to compute posterior class probabilities (p(Ck⎮x)) according to

• posterior probabilities can be described as logistic sigmoid function

inverse of sigmoid function is the logit

ln[p(C1⎮x)/p(C2⎮x)] --> log odds

• the sigmoid-shaped function (σ) is model function of logistic regressions

• first non-linear transformation of inputs using a vector of basis functions

p(C1/ϕ) = y(ϕ) = σ(wTϕ)

• Parameters of the logistic regression model determined by maximum

• maximum likelihood estimates are computed using iterative reweighted

• To compare models and to use posterior probabilities in Bayesian Logistic

• LAPLACE APPROXIMATION is the tool to find a Gaussian approximation

• the mode of p(z)

• the approximation of the normalization constant Z can be used to obtain an

• Consider data set D and models {Mi} having parameters {θi}

• For each model define likelihood p(D|θi,Mi}

• Introduce prior over parameters p(θi|Mi)

• Need model evidence p(D|Mi) for various models

• Z is approximation of model evidence p(D|Mi)

• having obtained a Gaussian approximation of your posterior distribution

• you use the normalized posterior distribution to arrive at a predictive

• you marginalize with respect to the normalized posterior distribution

• single target variable with binary representation

• t ∈ {0,1}; t = 1 → class C1, t = 0 → class C2

• 1-of-K coding scheme; t is vector of length K

You might also like