0% found this document useful (0 votes)

26 views45 pages

SML Lecture5

The document covers the theory and algorithms of supervised machine learning, focusing on linear classification, including concepts like generalization error, model selection, and specific algorithms such as perceptron and logistic regression. It explains the geometric interpretation of linear classifiers, their properties, and the learning process involved in training these models. Additionally, it discusses the convergence of the perceptron algorithm and the loss function associated with it.

Uploaded by

mohamnaf.b

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views45 pages

SML Lecture5

Uploaded by

mohamnaf.b

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

CS-E4715 Supervised Machine Learning

Lecture 5: Linear classification

Course topics

• Part I: Theory
• Introduction
• Generalization error analysis & PAC learning
• Rademacher Complexity & VC dimension
• Model selection
• Part II: Algorithms and models
• Linear models: perceptron, logistic regession
• Support vector machines
• Kernel methods
• Boosting
• Neural networks (MLPs)
• Part III: Additional topics
• Feature learning, selection and sparsity
• Multi-class classification
• Preference learning, ranking

1
Linear classification
Linear classification

• Input space X ⊂ Rd , each x ∈ X is a d-dimensional real-valued

vector, output space: Y = {−1, +1}
• Training sample S = {(x1 , y1 ), . . . , (xm , ym )} drawn from an
unknown distribution D
• Hypothesis class
Pd d
H = {x 7→ sgn j=1 wj xj + w0 |w ∈ R , w0 ∈ R} consists of
P
d
functions h(x) = sgn j=1 wj xj + w0 that map each example in
one of the two classes
(
+1, a ≥ 0
• sgn (a) = is the sign function
−1 a < 0

2
Linear classifiers

Linear classifiers
 
d
X
wj xj + w0  = sgn wT x + w0

h(x) = sgn 
j=1

have several attractive properties

• They are fast to evaluate and takes small space to store (O(d) time
and space)
• Easy to understand: |wj | shows the importance of variable xj and its
sign tells if the effect is positive or negative
• Linear models have relatively low complexity (e.g. VCdim = d + 1)
so they can be reliably estimated from limited data

Good practise is to try a linear model before something more complicated

3
The geometry of the linear classifier

• The points
{x ∈ X |g (x) = wT x + w0 = 0} define
a hyperplane in Rd , where d is the
number of variables in x
• The hyperplane g (x) = wT x + w0 = 0
splits the input space into two
half-spaces. The linear classifier
predicts +1 for points in the halfspace
{x ∈ X |g (x) = wT x + w0 ≥ 0} and
−1 for points in
{x ∈ X |g (x) = wT x + w0 < 0}

4
The geometry of the linear classifier

• w is the normal vector of the

hyperplane g (x) = wT x + w0 = 0
• The distance of the hyperplane from
the origin
qP is |w0 |/ kwk, where
kwk = 2
j wj denotes the
Euclidean norm
• If w0 < 0 the hyperplane lies in the
direction of w from origin, otherwise
it lies in the direction of −w

5
The geometry of the linear classifier

• The value g (x0 ) tells where x0 lies in

relation to the hyperplane:
• g (x0 ) > 0: x0 lies in the halfspace
that is in the direction of w from
the hyperplane
• g (x0 ) = 0: x0 lies on the hyperplane
• g (x0 ) < 0: x0 lies in the direction of
−w from the hyperplane
• The distance of a point x0 from the
hyperplane g (x) = 0 is |g (x0 )|/ kwk

6
Learning linear classifiers
Change of representation

• Consider the parameters (w, w0 ) of the linear function

g (x) = wT x + w0
• For presentation is is convenient to subsume term w0 into the weight
vector " #
w
w⇐
w0
and augment all inputs with a constant 1:
" #
x
x⇐
1

• The models give the same value for x:

" #T " #
w x
= w T x + w0
w0 1

7
Geometric interpretation

• Geometrically, the hyperplane in the

changed representation goes now
through origin
• The positive points have an acute
angle with w: wT x ≥ 0
• The negative points have an obtuse
angle with w: wT x < 0

8
Checking for prediction errors

• When the labels are Y = {−1, +1} for a training example (x, y ) we
have for g (x) = wT x,
(
y if x is correctly classified
sgn (g (x)) =
−y if x is incorrectly classified
• Alternative we can just multiply with the correct label to check for
misclassification:
(
≥ 0 if x is correctly classified
yg (x) =
< 0 if x is incorrectly classified

9
Margin

• The geometric margin of a labeled

example (x, y ) is given by
γ(x) = yg (x)/ kwk
• It takes into account both the
distance |wT x|/ kwk from the
hyperplane, and whether x is on the
correct side of the hyperplane
• The unnormalized version of the
margin is sometimes called the
functional margin γ(x) = yg (x)
• Often the term margin is used for
both variants, assuming the context
makes clear which one is meant

10
Perceptron
Perceptron

• Perceptron algorithm by Frank

Rosenblatt (1956) is perhaps the first
machine learning algorithm
• Its purpose was to learn a linear
function separating two classes
• It was built in hardware and shown to
be capable of performing rudimentary
pattern recognition tasks
• New York Times in 1958: ”the
embryo of an electronic computer that
[the Navy] expects will be able to
walk, talk, see, write, reproduce itself
Mark I perceptron ca. 1958 (Picture: Wikipedia)
and be conscious of its existence.”
(Source: Wikipedia)

11
The perceptron algorithm

• The perceptron algorithm a learns a hyperplane separating two

classes
g (x) = wT x
• It processes incrementally a set of training examples
• At each step, it finds a training example xi that is incorrectly
classified by the current model
• It updates the model by adding the example to the current weight
vector together with the label: w(t+1) ← w(t) + yi xi
• This process is continued until incorrectly predicted training
examples are not found

12
The perceptron algorithm

Input: Training set S = {(xi , yi )}m d

i=1 , x ∈ R , y ∈ {−1, +1}
(1)
Initialize w ← (0, . . . , 0), t ← 1, stop ← FALSE
repeat
T
if exists i, s.t. yi w(t) xi ≤ 0 then
w(t+1) ← w(t) + yi xi
else
stop ← TRUE
end if
t ←t +1
until stop

13
Understanding the update rule

• Let us examine the update rule

w(t+1) ← w(t) + yi xi

• We can see that the margin of the example (xi , yi ) increases after
the update
T
yi g (t+1) (xi ) = yi w(t+1) xi = yi (w(t) + yi xi )T xi
T 2
= yi w(t) xi + yi2 xT
i xi = yi g
(t)
(xi ) + kxi k
≥ yi g (t) (xi )

• Note that this does not guarantee that yi g (t+1) (xi ) > 0 after the
update, further updates may be required to achieve that

14
Perceptron animation

• Assume w(t) has been found by running the algorithm for t steps
• We notice two misclassified examples

15
Perceptron animation

• Select the misclassified example (φ(xi ), −1)

• Note: φ(xi ) is here some transformation of xi e.g. with some basis
functions but it could be identity φ(x) = x

(τ) T φ
+ w >0
+ φ(x i)
(τ) _
w
+

_
+ _
(τ)T
w φ <0

15
Perceptron animation

• Update the weight vector: w(t+1) = w(t) + yi φ(xi )

+
+ φ(x i)
(τ) _
w
+

(τ) _ φ(x
w i)
_

_
+ _

15
Perceptron animation

• The update tilts the hyperplane to make the example ”more

correct”, i.e. more negative
• We repeat the process by finding the next misclassified example
φ(xi+1 ) and update: w(t+2) = w(t+1) + yi+1 φ(xi+1 )

+
+
(τ+1) _ φ(x )
w i+1 _
+
(τ+1)
w

_
+ φ(x i+1) _

15
Perceptron animation

• Next iteration

+
+
(τ+2)
w _
+

_
+ _

15
Perceptron animation

• Next iteration

+
+

_
+

_
+ _

15
Perceptron animation

• Finally we have found a hyperplane that correctly classify the

training points
• We can stop the iteration and output the final weight vector

+
+

_
+

_
+ _

15
Convergence of the perceptron algorithm

• The perceptron algorithm can be shown to eventually converge to a

consistent hyperplane if the two classes are linearly separable, that
is, if there exists a hyperplane that separates the two classes
• Theorem (Novikoff):
• Let S = {(xi , yi )}m
i=1 be a linearly separable training set.
• Let R = maxxi ∈S kxi k.
• Let there exist a vector w∗ that satisfies kw∗ k = 1 and
yi w∗T xi + bopt ≥ γ for i = 1 . . . , m.
• Then the perceptron algorithm will stop after at most t ≤ ( 2R γ
)2
(t) (t)
iterations and output a weight vector w for which yi w xi ≥ 0 for
all i = 1 . . . , m

16
Convergence of the perceptron algorithm

The number of iterations in the bound t ≤ ( 2R 2

γ ) depend on:

• γ: The largest achievable

geometric margin so that all
training examples have at
least that margin
• R: The smallest radius of the γ
d-dimensional ball that
encloses the training data R

• Intuitively: how large the

||w|| = 1
margin in is relative to the w
distances of the training
points
However, Perceptron algorithm does not stop on a non-separable training
set, since there will always be a misclassified example that causes an
update
17
The loss function of the Perceptron algorithm

It can be shown that the

Perceptron algorithm is using the
following loss:

LPerceptron (y , wT x) = max(0, −y wT x)

• y wT x is the margin
• if y wT x < 0, a loss of
−y wT x is incurred, otherwise
no loss is incurred

18
Convexity of Perceptron loss

A function f : Rn 7→ R is convex if for all x, y , and 0 ≤ θ ≤ 1, we have

f (θx + (1 − θ)y ) ≤ θf (x) + (1 − θ)f (y ).

• Geometrical interpretation:
the graph of a convex
function lies below the line
segment from (x, f (x)) to
(y , f (y ))
• It is easy to see that
Perceptron loss is convex but
zero-one loss is not convex

19
Convexity of Perceptron loss

• The convexity of the Perceptron loss has an important consequence:

every local minimum is also the global minimum
• In principle we can minimize it with incremental updates that
gradually decrease the loss
• In contrast, finding a hyperplane that minimizes the zero-one loss is
computationally hard (NP-hard to minimize training error)
• However, we need better algorithms than the Perceptron, which
terminate when we are close to the optimum

20
Logistic regression
Logistic regression

Logistic regression is a classification technique (despite the name)

• it gets its name from the logistic

function
1 exp(z)
φlogistic (z) = =
1 + exp(−z) 1 + exp(z)

that maps a real valued input z onto

the interval 0 < φlogistic (z) < 1
• The function is an example of
sigmoid (”S” shaped) functions

21
Logistic function: a probabilistic interpretation

• The logistic function φlogistic (z) is the inverse of logit function

• The logit function is the logarithm of odds ratio of probability p of
and event happening vs. the probability of the event not happening,
1 − p;
p
z = logit(p) = log = log p − log(1 − p)
1−p
• Thus the logistic function
1
φlogistic (z) = logit −1 (z) =
1 + exp(−z)

answer the question ”what is the probability p that gives the log
odds ratio of z”

22
Logistic regression

• Logistic regression model assumes a underlying conditional

probability:

exp(+ 12 y wT x)
Pr (y |x) =
exp(+ 12 y wT x) + exp(− 12 y wT x)

where the denominator normalizes the right-hand side to be between

zero and one.
• Dividing the numerator and denominator by exp(+ 12 y wT x) reveals
1
the logistic function Pr (y |x) = φlogistic (y wT x) = 1+exp(−y wT x)

• The margin z = y wT x is thus interpreted as the log odds ratio of

(y |x)
label y vs. label −y given input x: y wT x = log PrPr(−y |x)
• Note: these equations assume the labels Y = {−1, +1}. With labels
Y = {0, 1} the equations will be slightly different.

23
Logistic loss

• Consider the maximization of the likelihood of the observed

input-output in the training data:
m m
Y Y 1
w∗ = argmaxw P(yi |xi ) = argmaxw
1 + exp(−y wT x)
i=1 i=1

• Since the logarithm is monotonically increasing function, we can

take the logarithm to obtain an equivalent objective:
m
X m
X
log Pr (yi |xi ) = − log(1 + exp(−yi wT xi ))
i=1 i=1

• The right-hand side is the logistic loss:

Llogistic (y , wT x) = log(1 + exp(−y wT x))

• Minimizing the logistic loss correspond maximizing the likelihood of

the training data
24
Geometric interpretation of Logistic loss

Llogistic (y , wT x) = log(1 + exp(−y wT x))

• Logistic loss is convex and

differentiable
• It is a monotonically decreasing
function of the margin y wT x
• The loss changes fast when the
margin is highly negative =⇒
penalization of examples far in the
incorrect halfspace
• It changes slowly for highly positive
margins =⇒ does not give extra
bonus for being very far in the correct
halfspace
25
Logistic regression optimization problem

• To train a logistic regression model, we need to find the w that

Pm
minimizes the average logistic loss J(w) = m1 i=1 Llogistic (yi , wT xi )
over the training set:
m
1 X
min J(w) = log(1 + exp(−yi wT xi )
m
i=1

w .r .t parameters w ∈ Rd

• The function to be minimized is continuous and differentiable

• However, it is a non-linear function so it is not easy to find the
optimum directly (e.g. unlike in linear regression)
• We will use stochastic gradient descent to incrementally step
towards the direction where the objective decreases fastest, the
negative gradient

26
Gradient

• The gradient is the vector of partial derivatives of the objective

function J(w) with respect to all parameters wj
m m iT
1 X 1 Xh ∂ ∂
∇J(w) = ∇Ji (w) = ∂w1 Ji (w), . . . , ∂wd J i (w)
m m
i=1 i=1

• Compute the gradient by using the regular rules for differentiation.

For the logistic loss we have

∂ ∂ exp(−yi wT xi )
Ji (w) = log(1 + exp(−yi wT xi )) = · (−yi xij )
∂wj ∂wj 1 + exp(−yi wT xi )
1
=− yi xij = −φlogistic (−yi wT xi )yi xij
1 + exp(yi wT xi )

27
Stochastic gradient descent

• We collect the partial derivatives with respect to a single training

example into a vector:
 
−(φlogistic (−yi wT xi )yi ) · xi1
 .. 

 . 

∇Ji (w) =  −(φlogistic (−yi wT xi )yi ) · xij  = −φlogistic (−yi wT xi )yi · xi
 

 .. 

 . 
T
−(φlogistic (−yi w xi )yi ) · xid

• The vector −∇Ji (w) gives the update direction that fastest
decreases the loss on training example (xi , yi )

28
Stochastic gradient descent

• Evaluating the full gradient

m m
1 X 1 X
∇J(w) = ∇Ji (w) = − φlogistic (−yi wT xi )yi · xi
m m
i=1 i=1

is costly since we need to process all training examples

• Stochastic gradient descent instead uses a series of smaller updates
that depend on single randomly drawn training example (xi , yi ) at a
time
• The update direction is taken as −∇Ji (w)
• Its expectation is the full negative gradient:

−Ei=1...,m [ ∇Ji (w) ] = −∇J(w)

• Thus on average, the updates match that of using the full gradient

29
Stochastic gradient descent algorithm

Initialize w = 0; t = 1;
repeat
Draw a training example (x, y ) uniformly at random;
Compute the update direction corresponding to the training example:
∆w = −∇Jt (w);
Determine a stepsize ηt ;
Update w = w − ηt ∇Jt (w);
t = t + 1;
until stopping criterion statisfied
Output w;

30
Stepsize selection

Consider the SGD update: w = w − ηt OJt (w)

• The stepsize parameter ηt , also called the learning rate is a critical
one for convergence to the optimum value
• One uses small constant stepsize, the initial convergence may be
unnecessarily slow
• Too large stepsize may cause the method to continually overshoot
the optimum.

Source: https://dunglai.github.io/2017/12/21/gradient-descent/ 31
Diminishing stepsize

• We can use a diminishing stepsize by starting with an initial larger

stepsize, controlled by hyperparameter η0 > 0
• In each iteration, the stepsize is divided by the iteration counter
t > 0:
η0
ηt =
t
• Caution: In practice, finding a good value for hyperparameter η0
requires experimenting with several values

Source: https://dunglai.github.io/2017/12/21/gradient-descent/ 32
Stopping criterion

When should we stop the algorithm? Some possible choices:

1. Set a maximum number of iterations, after which the algorithm

terminates
• This needs to be separately calibrated for each dataset to avoid
premature termination
2. Gradient of the objective: If we are at a optimum point w∗ of J(w),
the gradient vanishes ∇J(w∗ ) = 0, so we can stop kJ(w)k < γ
where γ is some user-defined parameter
3. It is usually sufficient to train until the zero-one error on training
data does not change anymore
• This usually happens before the logistic loss converges

33
Summary

• Linear classification model are and important class of machine

learning models, they are used as standalone models and appear as
building blocks of more complicated, non-liner models
• Perceptron is a simple algorithm to train linear classifiers on linearly
separable data
• Logistic regression is a classification method that can be interpreted
as maximizing odds ratios of conditional class probabilities
• Stochastic gradient descent is an efficient optimization method for
large data that is nowadays very widely used

Classification
No ratings yet
Classification
47 pages
3 Percept Ron
No ratings yet
3 Percept Ron
34 pages
Supervised Learning: Linear Models
No ratings yet
Supervised Learning: Linear Models
34 pages
Perceptron
No ratings yet
Perceptron
23 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
ch6 (Q 2,8,4)
No ratings yet
ch6 (Q 2,8,4)
9 pages
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
No ratings yet
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
54 pages
Lecture Notes 3 Perceptron
100% (1)
Lecture Notes 3 Perceptron
7 pages
AI Linear Regression & Perceptron
No ratings yet
AI Linear Regression & Perceptron
8 pages
SVMs for Cybersecurity Students
No ratings yet
SVMs for Cybersecurity Students
55 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
3 Linear
No ratings yet
3 Linear
5 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Perceptron Algorithm in Credit Analysis
No ratings yet
Perceptron Algorithm in Credit Analysis
33 pages
Linear Classification: The Perceptron
No ratings yet
Linear Classification: The Perceptron
13 pages
ML - Lec 6 - Linear Classifiers
No ratings yet
ML - Lec 6 - Linear Classifiers
55 pages
Linear Classifiers & Perceptron Guide
No ratings yet
Linear Classifiers & Perceptron Guide
5 pages
SVM Notes
No ratings yet
SVM Notes
40 pages
Ai and ML
No ratings yet
Ai and ML
16 pages
Perceptron & Margin Classifiers Guide
No ratings yet
Perceptron & Margin Classifiers Guide
5 pages
315 F19 14 SVM 1
No ratings yet
315 F19 14 SVM 1
33 pages
ML Unit I
No ratings yet
ML Unit I
14 pages
Lect 1
No ratings yet
Lect 1
24 pages
NN Theory
No ratings yet
NN Theory
138 pages
Gradient-Based Learning & Neural Networks
No ratings yet
Gradient-Based Learning & Neural Networks
72 pages
Lecture 16 - Hyperplane Classifiers - Perceptron - Plain
No ratings yet
Lecture 16 - Hyperplane Classifiers - Perceptron - Plain
9 pages
Session 6 Machine Learning Algorithms
No ratings yet
Session 6 Machine Learning Algorithms
46 pages
Lecturenotes Perceptron
No ratings yet
Lecturenotes Perceptron
7 pages
Perceptron
No ratings yet
Perceptron
26 pages
CS229 Andrew NG Lecture Notes
No ratings yet
CS229 Andrew NG Lecture Notes
216 pages
Lecture 2
No ratings yet
Lecture 2
57 pages
Andrew NG Main - Notes PDF
100% (1)
Andrew NG Main - Notes PDF
226 pages
Ds 2
No ratings yet
Ds 2
27 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
05 Optimization Basics
No ratings yet
05 Optimization Basics
94 pages
CS229
No ratings yet
CS229
216 pages
Minsky y Papert
No ratings yet
Minsky y Papert
77 pages
XOR Problem & Two-Layer Perceptron
No ratings yet
XOR Problem & Two-Layer Perceptron
74 pages
06 Optimization Basics PDF
No ratings yet
06 Optimization Basics PDF
82 pages
What Is Computer Vision?
No ratings yet
What Is Computer Vision?
120 pages
Linear Classifiers Explained
No ratings yet
Linear Classifiers Explained
13 pages
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
No ratings yet
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
223 pages
Lecture 2 - Supervised Learning
No ratings yet
Lecture 2 - Supervised Learning
6 pages
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
No ratings yet
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
38 pages
What Is Computer Vision?
No ratings yet
What Is Computer Vision?
125 pages
Lecture 13 - Perceptrons: Machine Learning March 16, 2010
No ratings yet
Lecture 13 - Perceptrons: Machine Learning March 16, 2010
49 pages
Main
No ratings yet
Main
5 pages
Lecture 19
No ratings yet
Lecture 19
8 pages
Machine Learning-4
100% (1)
Machine Learning-4
18 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
46 pages
6.86x Machine Learning With Python: Linear Classifiers
No ratings yet
6.86x Machine Learning With Python: Linear Classifiers
7 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
Lec1 PerceptronPocket Recap
100% (1)
Lec1 PerceptronPocket Recap
61 pages
05 Linear Classifiers
No ratings yet
05 Linear Classifiers
59 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Main Notes
No ratings yet
Main Notes
227 pages
ML Fundamentals by Bitspace
No ratings yet
ML Fundamentals by Bitspace
19 pages
D091181006 - Muh. Zaid Iskandar - Exercise 5
No ratings yet
D091181006 - Muh. Zaid Iskandar - Exercise 5
5 pages
The World Population Explosion Causes Backgrounds
No ratings yet
The World Population Explosion Causes Backgrounds
11 pages
5e Inventory Tracking Sheet (Auto-Calcuating)
No ratings yet
5e Inventory Tracking Sheet (Auto-Calcuating)
1 page
Aikon VDF
50% (2)
Aikon VDF
20 pages
SPICE + Part B - Approval Letter - AB2519849
No ratings yet
SPICE + Part B - Approval Letter - AB2519849
1 page
MECON Limited: Engineering & Consultancy Overview
No ratings yet
MECON Limited: Engineering & Consultancy Overview
5 pages
Solar Powered Back Pack
No ratings yet
Solar Powered Back Pack
44 pages
Chapter 2
No ratings yet
Chapter 2
25 pages
Transport From Bayswater
No ratings yet
Transport From Bayswater
7 pages
BW Developer Resume Profile
No ratings yet
BW Developer Resume Profile
9 pages
Exercises On Inversion (Keys)
No ratings yet
Exercises On Inversion (Keys)
9 pages
2 Precepts Transcript
No ratings yet
2 Precepts Transcript
2 pages
Quartal Jazz Piano Voicings PDF
0% (2)
Quartal Jazz Piano Voicings PDF
2 pages
Alphalist 2022
No ratings yet
Alphalist 2022
15 pages
Gourma Freidora GAF588
No ratings yet
Gourma Freidora GAF588
136 pages
Private Excel Modelling Test
No ratings yet
Private Excel Modelling Test
40 pages
CHHINDWARA
No ratings yet
CHHINDWARA
4 pages
Communications Processor User Manual: March 30, 2016
No ratings yet
Communications Processor User Manual: March 30, 2016
47 pages
Basic Settings For Approval: Short Text
No ratings yet
Basic Settings For Approval: Short Text
27 pages
Gow Props
No ratings yet
Gow Props
6 pages
Na Xie Nian (胡夏-那些年)
No ratings yet
Na Xie Nian (胡夏-那些年)
6 pages
Caroll (1988) 25 Year The Carroll Model Retrospective and Perspective View
No ratings yet
Caroll (1988) 25 Year The Carroll Model Retrospective and Perspective View
7 pages
GHG Protocol Agricultural Guidance (April 26) - 0
No ratings yet
GHG Protocol Agricultural Guidance (April 26) - 0
103 pages
Blood Bank I
No ratings yet
Blood Bank I
136 pages
Test Doble Lactato
No ratings yet
Test Doble Lactato
8 pages
Flexible Polyurethane Foam A Primer
No ratings yet
Flexible Polyurethane Foam A Primer
7 pages
Statement 1: Confirmation Number Is The Control Number Issued by Authorized Agent
No ratings yet
Statement 1: Confirmation Number Is The Control Number Issued by Authorized Agent
55 pages
MBAN Brochure 2325
No ratings yet
MBAN Brochure 2325
8 pages
Liehr
No ratings yet
Liehr
9 pages
NDIA GVSETS 2024 MOSA Session - (Papers) Harnessing Advanced Technologies For Swarm Operations Within CJADC2
No ratings yet
NDIA GVSETS 2024 MOSA Session - (Papers) Harnessing Advanced Technologies For Swarm Operations Within CJADC2
13 pages

SML Lecture5

Uploaded by

SML Lecture5

Uploaded by

CS-E4715 Supervised Machine Learning

Lecture 5: Linear classification

• Input space X ⊂ Rd , each x ∈ X is a d-dimensional real-valued

have several attractive properties

Good practise is to try a linear model before something more complicated

• w is the normal vector of the

• The value g (x0 ) tells where x0 lies in

• Consider the parameters (w, w0 ) of the linear function

• The models give the same value for x:

• Geometrically, the hyperplane in the

• The geometric margin of a labeled

• Perceptron algorithm by Frank

• The perceptron algorithm a learns a hyperplane separating two

Input: Training set S = {(xi , yi )}m d

• Let us examine the update rule

• Select the misclassified example (φ(xi ), −1)

• Update the weight vector: w(t+1) = w(t) + yi φ(xi )

• The update tilts the hyperplane to make the example ”more

• Finally we have found a hyperplane that correctly classify the

• The perceptron algorithm can be shown to eventually converge to a

The number of iterations in the bound t ≤ ( 2R 2

• γ: The largest achievable

• Intuitively: how large the

It can be shown that the

A function f : Rn 7→ R is convex if for all x, y , and 0 ≤ θ ≤ 1, we have

f (θx + (1 − θ)y ) ≤ θf (x) + (1 − θ)f (y ).

• The convexity of the Perceptron loss has an important consequence:

Logistic regression is a classification technique (despite the name)

• it gets its name from the logistic

that maps a real valued input z onto

• The logistic function φlogistic (z) is the inverse of logit function

• Logistic regression model assumes a underlying conditional

where the denominator normalizes the right-hand side to be between

• The margin z = y wT x is thus interpreted as the log odds ratio of

• Consider the maximization of the likelihood of the observed

• Since the logarithm is monotonically increasing function, we can

• The right-hand side is the logistic loss:

Llogistic (y , wT x) = log(1 + exp(−y wT x))

• Minimizing the logistic loss correspond maximizing the likelihood of

Llogistic (y , wT x) = log(1 + exp(−y wT x))

• Logistic loss is convex and

• To train a logistic regression model, we need to find the w that

• The function to be minimized is continuous and differentiable

• The gradient is the vector of partial derivatives of the objective

• Compute the gradient by using the regular rules for differentiation.

• We collect the partial derivatives with respect to a single training

• Evaluating the full gradient

is costly since we need to process all training examples

−Ei=1...,m [ ∇Ji (w) ] = −∇J(w)

Consider the SGD update: w = w − ηt OJt (w)

• We can use a diminishing stepsize by starting with an initial larger

When should we stop the algorithm? Some possible choices:

1. Set a maximum number of iterations, after which the algorithm

• Linear classification model are and important class of machine

You might also like