CS-E4715 Supervised Machine Learning
Lecture 5: Linear classification
Course topics
• Part I: Theory
• Introduction
• Generalization error analysis & PAC learning
• Rademacher Complexity & VC dimension
• Model selection
• Part II: Algorithms and models
• Linear models: perceptron, logistic regession
• Support vector machines
• Kernel methods
• Boosting
• Neural networks (MLPs)
• Part III: Additional topics
• Feature learning, selection and sparsity
• Multi-class classification
• Preference learning, ranking
1
Linear classification
Linear classification
• Input space X ⊂ Rd , each x ∈ X is a d-dimensional real-valued
vector, output space: Y = {−1, +1}
• Training sample S = {(x1 , y1 ), . . . , (xm , ym )} drawn from an
unknown distribution D
• Hypothesis class
Pd d
H = {x 7→ sgn j=1 wj xj + w0 |w ∈ R , w0 ∈ R} consists of
P
d
functions h(x) = sgn j=1 wj xj + w0 that map each example in
one of the two classes
(
+1, a ≥ 0
• sgn (a) = is the sign function
−1 a < 0
2
Linear classifiers
Linear classifiers
d
X
wj xj + w0 = sgn wT x + w0
h(x) = sgn
j=1
have several attractive properties
• They are fast to evaluate and takes small space to store (O(d) time
and space)
• Easy to understand: |wj | shows the importance of variable xj and its
sign tells if the effect is positive or negative
• Linear models have relatively low complexity (e.g. VCdim = d + 1)
so they can be reliably estimated from limited data
Good practise is to try a linear model before something more complicated
3
The geometry of the linear classifier
• The points
{x ∈ X |g (x) = wT x + w0 = 0} define
a hyperplane in Rd , where d is the
number of variables in x
• The hyperplane g (x) = wT x + w0 = 0
splits the input space into two
half-spaces. The linear classifier
predicts +1 for points in the halfspace
{x ∈ X |g (x) = wT x + w0 ≥ 0} and
−1 for points in
{x ∈ X |g (x) = wT x + w0 < 0}
4
The geometry of the linear classifier
• w is the normal vector of the
hyperplane g (x) = wT x + w0 = 0
• The distance of the hyperplane from
the origin
qP is |w0 |/ kwk, where
kwk = 2
j wj denotes the
Euclidean norm
• If w0 < 0 the hyperplane lies in the
direction of w from origin, otherwise
it lies in the direction of −w
5
The geometry of the linear classifier
• The value g (x0 ) tells where x0 lies in
relation to the hyperplane:
• g (x0 ) > 0: x0 lies in the halfspace
that is in the direction of w from
the hyperplane
• g (x0 ) = 0: x0 lies on the hyperplane
• g (x0 ) < 0: x0 lies in the direction of
−w from the hyperplane
• The distance of a point x0 from the
hyperplane g (x) = 0 is |g (x0 )|/ kwk
6
Learning linear classifiers
Change of representation
• Consider the parameters (w, w0 ) of the linear function
g (x) = wT x + w0
• For presentation is is convenient to subsume term w0 into the weight
vector " #
w
w⇐
w0
and augment all inputs with a constant 1:
" #
x
x⇐
1
• The models give the same value for x:
" #T " #
w x
= w T x + w0
w0 1
7
Geometric interpretation
• Geometrically, the hyperplane in the
changed representation goes now
through origin
• The positive points have an acute
angle with w: wT x ≥ 0
• The negative points have an obtuse
angle with w: wT x < 0
8
Checking for prediction errors
• When the labels are Y = {−1, +1} for a training example (x, y ) we
have for g (x) = wT x,
(
y if x is correctly classified
sgn (g (x)) =
−y if x is incorrectly classified
• Alternative we can just multiply with the correct label to check for
misclassification:
(
≥ 0 if x is correctly classified
yg (x) =
< 0 if x is incorrectly classified
9
Margin
• The geometric margin of a labeled
example (x, y ) is given by
γ(x) = yg (x)/ kwk
• It takes into account both the
distance |wT x|/ kwk from the
hyperplane, and whether x is on the
correct side of the hyperplane
• The unnormalized version of the
margin is sometimes called the
functional margin γ(x) = yg (x)
• Often the term margin is used for
both variants, assuming the context
makes clear which one is meant
10
Perceptron
Perceptron
• Perceptron algorithm by Frank
Rosenblatt (1956) is perhaps the first
machine learning algorithm
• Its purpose was to learn a linear
function separating two classes
• It was built in hardware and shown to
be capable of performing rudimentary
pattern recognition tasks
• New York Times in 1958: ”the
embryo of an electronic computer that
[the Navy] expects will be able to
walk, talk, see, write, reproduce itself
Mark I perceptron ca. 1958 (Picture: Wikipedia)
and be conscious of its existence.”
(Source: Wikipedia)
11
The perceptron algorithm
• The perceptron algorithm a learns a hyperplane separating two
classes
g (x) = wT x
• It processes incrementally a set of training examples
• At each step, it finds a training example xi that is incorrectly
classified by the current model
• It updates the model by adding the example to the current weight
vector together with the label: w(t+1) ← w(t) + yi xi
• This process is continued until incorrectly predicted training
examples are not found
12
The perceptron algorithm
Input: Training set S = {(xi , yi )}m d
i=1 , x ∈ R , y ∈ {−1, +1}
(1)
Initialize w ← (0, . . . , 0), t ← 1, stop ← FALSE
repeat
T
if exists i, s.t. yi w(t) xi ≤ 0 then
w(t+1) ← w(t) + yi xi
else
stop ← TRUE
end if
t ←t +1
until stop
13
Understanding the update rule
• Let us examine the update rule
w(t+1) ← w(t) + yi xi
• We can see that the margin of the example (xi , yi ) increases after
the update
T
yi g (t+1) (xi ) = yi w(t+1) xi = yi (w(t) + yi xi )T xi
T 2
= yi w(t) xi + yi2 xT
i xi = yi g
(t)
(xi ) + kxi k
≥ yi g (t) (xi )
• Note that this does not guarantee that yi g (t+1) (xi ) > 0 after the
update, further updates may be required to achieve that
14
Perceptron animation
• Assume w(t) has been found by running the algorithm for t steps
• We notice two misclassified examples
15
Perceptron animation
• Select the misclassified example (φ(xi ), −1)
• Note: φ(xi ) is here some transformation of xi e.g. with some basis
functions but it could be identity φ(x) = x
(τ) T φ
+ w >0
+ φ(x i)
(τ) _
w
+
_
+ _
(τ)T
w φ <0
15
Perceptron animation
• Update the weight vector: w(t+1) = w(t) + yi φ(xi )
+
+ φ(x i)
(τ) _
w
+
(τ) _ φ(x
w i)
_
_
+ _
15
Perceptron animation
• The update tilts the hyperplane to make the example ”more
correct”, i.e. more negative
• We repeat the process by finding the next misclassified example
φ(xi+1 ) and update: w(t+2) = w(t+1) + yi+1 φ(xi+1 )
+
+
(τ+1) _ φ(x )
w i+1 _
+
(τ+1)
w
_
+ φ(x i+1) _
15
Perceptron animation
• Next iteration
+
+
(τ+2)
w _
+
_
+ _
15
Perceptron animation
• Next iteration
+
+
_
+
_
+ _
15
Perceptron animation
• Finally we have found a hyperplane that correctly classify the
training points
• We can stop the iteration and output the final weight vector
+
+
_
+
_
+ _
15
Convergence of the perceptron algorithm
• The perceptron algorithm can be shown to eventually converge to a
consistent hyperplane if the two classes are linearly separable, that
is, if there exists a hyperplane that separates the two classes
• Theorem (Novikoff):
• Let S = {(xi , yi )}m
i=1 be a linearly separable training set.
• Let R = maxxi ∈S kxi k.
• Let there exist a vector w∗ that satisfies kw∗ k = 1 and
yi w∗T xi + bopt ≥ γ for i = 1 . . . , m.
• Then the perceptron algorithm will stop after at most t ≤ ( 2R γ
)2
(t) (t)
iterations and output a weight vector w for which yi w xi ≥ 0 for
all i = 1 . . . , m
16
Convergence of the perceptron algorithm
The number of iterations in the bound t ≤ ( 2R 2
γ ) depend on:
• γ: The largest achievable
geometric margin so that all
training examples have at
least that margin
• R: The smallest radius of the γ
d-dimensional ball that
encloses the training data R
• Intuitively: how large the
||w|| = 1
margin in is relative to the w
distances of the training
points
However, Perceptron algorithm does not stop on a non-separable training
set, since there will always be a misclassified example that causes an
update
17
The loss function of the Perceptron algorithm
It can be shown that the
Perceptron algorithm is using the
following loss:
LPerceptron (y , wT x) = max(0, −y wT x)
• y wT x is the margin
• if y wT x < 0, a loss of
−y wT x is incurred, otherwise
no loss is incurred
18
Convexity of Perceptron loss
A function f : Rn 7→ R is convex if for all x, y , and 0 ≤ θ ≤ 1, we have
f (θx + (1 − θ)y ) ≤ θf (x) + (1 − θ)f (y ).
• Geometrical interpretation:
the graph of a convex
function lies below the line
segment from (x, f (x)) to
(y , f (y ))
• It is easy to see that
Perceptron loss is convex but
zero-one loss is not convex
19
Convexity of Perceptron loss
• The convexity of the Perceptron loss has an important consequence:
every local minimum is also the global minimum
• In principle we can minimize it with incremental updates that
gradually decrease the loss
• In contrast, finding a hyperplane that minimizes the zero-one loss is
computationally hard (NP-hard to minimize training error)
• However, we need better algorithms than the Perceptron, which
terminate when we are close to the optimum
20
Logistic regression
Logistic regression
Logistic regression is a classification technique (despite the name)
• it gets its name from the logistic
function
1 exp(z)
φlogistic (z) = =
1 + exp(−z) 1 + exp(z)
that maps a real valued input z onto
the interval 0 < φlogistic (z) < 1
• The function is an example of
sigmoid (”S” shaped) functions
21
Logistic function: a probabilistic interpretation
• The logistic function φlogistic (z) is the inverse of logit function
• The logit function is the logarithm of odds ratio of probability p of
and event happening vs. the probability of the event not happening,
1 − p;
p
z = logit(p) = log = log p − log(1 − p)
1−p
• Thus the logistic function
1
φlogistic (z) = logit −1 (z) =
1 + exp(−z)
answer the question ”what is the probability p that gives the log
odds ratio of z”
22
Logistic regression
• Logistic regression model assumes a underlying conditional
probability:
exp(+ 12 y wT x)
Pr (y |x) =
exp(+ 12 y wT x) + exp(− 12 y wT x)
where the denominator normalizes the right-hand side to be between
zero and one.
• Dividing the numerator and denominator by exp(+ 12 y wT x) reveals
1
the logistic function Pr (y |x) = φlogistic (y wT x) = 1+exp(−y wT x)
• The margin z = y wT x is thus interpreted as the log odds ratio of
(y |x)
label y vs. label −y given input x: y wT x = log PrPr(−y |x)
• Note: these equations assume the labels Y = {−1, +1}. With labels
Y = {0, 1} the equations will be slightly different.
23
Logistic loss
• Consider the maximization of the likelihood of the observed
input-output in the training data:
m m
Y Y 1
w∗ = argmaxw P(yi |xi ) = argmaxw
1 + exp(−y wT x)
i=1 i=1
• Since the logarithm is monotonically increasing function, we can
take the logarithm to obtain an equivalent objective:
m
X m
X
log Pr (yi |xi ) = − log(1 + exp(−yi wT xi ))
i=1 i=1
• The right-hand side is the logistic loss:
Llogistic (y , wT x) = log(1 + exp(−y wT x))
• Minimizing the logistic loss correspond maximizing the likelihood of
the training data
24
Geometric interpretation of Logistic loss
Llogistic (y , wT x) = log(1 + exp(−y wT x))
• Logistic loss is convex and
differentiable
• It is a monotonically decreasing
function of the margin y wT x
• The loss changes fast when the
margin is highly negative =⇒
penalization of examples far in the
incorrect halfspace
• It changes slowly for highly positive
margins =⇒ does not give extra
bonus for being very far in the correct
halfspace
25
Logistic regression optimization problem
• To train a logistic regression model, we need to find the w that
Pm
minimizes the average logistic loss J(w) = m1 i=1 Llogistic (yi , wT xi )
over the training set:
m
1 X
min J(w) = log(1 + exp(−yi wT xi )
m
i=1
w .r .t parameters w ∈ Rd
• The function to be minimized is continuous and differentiable
• However, it is a non-linear function so it is not easy to find the
optimum directly (e.g. unlike in linear regression)
• We will use stochastic gradient descent to incrementally step
towards the direction where the objective decreases fastest, the
negative gradient
26
Gradient
• The gradient is the vector of partial derivatives of the objective
function J(w) with respect to all parameters wj
m m iT
1 X 1 Xh ∂ ∂
∇J(w) = ∇Ji (w) = ∂w1 Ji (w), . . . , ∂wd J i (w)
m m
i=1 i=1
• Compute the gradient by using the regular rules for differentiation.
For the logistic loss we have
∂ ∂ exp(−yi wT xi )
Ji (w) = log(1 + exp(−yi wT xi )) = · (−yi xij )
∂wj ∂wj 1 + exp(−yi wT xi )
1
=− yi xij = −φlogistic (−yi wT xi )yi xij
1 + exp(yi wT xi )
27
Stochastic gradient descent
• We collect the partial derivatives with respect to a single training
example into a vector:
−(φlogistic (−yi wT xi )yi ) · xi1
..
.
∇Ji (w) = −(φlogistic (−yi wT xi )yi ) · xij = −φlogistic (−yi wT xi )yi · xi
..
.
T
−(φlogistic (−yi w xi )yi ) · xid
• The vector −∇Ji (w) gives the update direction that fastest
decreases the loss on training example (xi , yi )
28
Stochastic gradient descent
• Evaluating the full gradient
m m
1 X 1 X
∇J(w) = ∇Ji (w) = − φlogistic (−yi wT xi )yi · xi
m m
i=1 i=1
is costly since we need to process all training examples
• Stochastic gradient descent instead uses a series of smaller updates
that depend on single randomly drawn training example (xi , yi ) at a
time
• The update direction is taken as −∇Ji (w)
• Its expectation is the full negative gradient:
−Ei=1...,m [ ∇Ji (w) ] = −∇J(w)
• Thus on average, the updates match that of using the full gradient
29
Stochastic gradient descent algorithm
Initialize w = 0; t = 1;
repeat
Draw a training example (x, y ) uniformly at random;
Compute the update direction corresponding to the training example:
∆w = −∇Jt (w);
Determine a stepsize ηt ;
Update w = w − ηt ∇Jt (w);
t = t + 1;
until stopping criterion statisfied
Output w;
30
Stepsize selection
Consider the SGD update: w = w − ηt OJt (w)
• The stepsize parameter ηt , also called the learning rate is a critical
one for convergence to the optimum value
• One uses small constant stepsize, the initial convergence may be
unnecessarily slow
• Too large stepsize may cause the method to continually overshoot
the optimum.
Source: https://dunglai.github.io/2017/12/21/gradient-descent/ 31
Diminishing stepsize
• We can use a diminishing stepsize by starting with an initial larger
stepsize, controlled by hyperparameter η0 > 0
• In each iteration, the stepsize is divided by the iteration counter
t > 0:
η0
ηt =
t
• Caution: In practice, finding a good value for hyperparameter η0
requires experimenting with several values
Source: https://dunglai.github.io/2017/12/21/gradient-descent/ 32
Stopping criterion
When should we stop the algorithm? Some possible choices:
1. Set a maximum number of iterations, after which the algorithm
terminates
• This needs to be separately calibrated for each dataset to avoid
premature termination
2. Gradient of the objective: If we are at a optimum point w∗ of J(w),
the gradient vanishes ∇J(w∗ ) = 0, so we can stop kJ(w)k < γ
where γ is some user-defined parameter
3. It is usually sufficient to train until the zero-one error on training
data does not change anymore
• This usually happens before the logistic loss converges
33
Summary
• Linear classification model are and important class of machine
learning models, they are used as standalone models and appear as
building blocks of more complicated, non-liner models
• Perceptron is a simple algorithm to train linear classifiers on linearly
separable data
• Logistic regression is a classification method that can be interpreted
as maximizing odds ratios of conditional class probabilities
• Stochastic gradient descent is an efficient optimization method for
large data that is nowadays very widely used
34