0% found this document useful (0 votes)

47 views7 pages

Multivariate Classification

This lecture discusses various multivariate classification approaches, including k-NN, decision trees, and logistic regression. It highlights the challenges of nonparametric estimators in high-dimensional data and introduces methods for classifier optimization through empirical risk minimization. The document also explains how to estimate parameters in logistic regression using maximum likelihood estimation.

Uploaded by

Sylvia Cheung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views7 pages

Multivariate Classification

Uploaded by

Sylvia Cheung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

STAT 425: Introduction to Nonparametric Statistics Winter 2018

Lecture 13: Classification: Multivariate Approaches

Instructor: Yen-Chi Chen

In the previous lecture, we have learned that we can construct an optimal classifier if we know the true under-
lying probability model. Even we do not know the true model, we can estimate the model nonparametrically
and then train a classifier using the estimated density function or regression function.
This approach is mathematically great but practically problematic because in reality the dimension of the
data is often large (i.e., the feature/covariate X has many values). Recall that the MSE of a nonparametric
4
density/regression estimator often has a convergence rate of the order O(n− 4+d ). This rate is extremely slow
when d is large. Thus, converting a nonparametric estimator to a classifier may not work in practice.
Now we are going to introduce a couple of other classifiers that does not require an explicit estimation of
the density function or the regression function.

13.1 k-NN Approach

The k-NN approach can be applied to classification as well. The idea is very simple – for a given point x0 ,
we find its k-th nearest data points. Then we compare the labels of these k points and assign the label of
x0 as the majority label in these k points.
Take the data in the following picture as an example. There are two classes: black dots and red crosses. We
are interested in the class label at the two blue boxes (x1 and x2 ).

Assume we use a 3-NN classifier b c3−N N . At point x1 , its 3-NN contains two black dots and one red cross,
so b
c3−N N (x1 ) = black dot. At point x2 , its 3-NN has one black dots and two red crosses, so b c3−N N (x1 ) =
red cross. Note that if there are ties, we will randomly assign the class label attaining the tie.
The k-NN approach is simple and easy to operate. It can be easily generalize to multiple classes – the idea
is the same: we assign the class label according to the majority in the neighborhood.
How do we choose k? The choice of k is often done by a technique called cross-validation. The basic principle
is: we split the data into two parts, use one part to train our classifier and evaluate the performance on
the other part. Repeating the above procedure multiple times and applying it to each k, we can obtain an

13-1
13-2 Lecture 13: Classification: Multivariate Approaches

estimate of the performance. We then choose the k with the best performance. We will talk about this topic
later.

13.2 Decision Tree

Decision tree is another common approach in classification. A decision tree is like a regression tree – we
partition the space of covariate into rectangular regions and then assign each region a class label. If the tree
is given, the class label at each region is determined by the majority vote (using the majority of labels in
that region).
The following picture provides an example of a decision tree using the same data as the one in the previous
section.
x2
8
x2
<5 ≥5
6
x1 x1
< 5.4 ≥ 5.4 <5 ≥5

< 6.6
x2 ≥ 6.6
4

4 6 8 x1

In the left panel, we display the scatterplot and regions separated by a decision tree. The background color
denotes the estimated label. The right panel displays the tree structure.
In the case of binary classification (the class label Y = 0 or 1), the decision tree can be written as follows.
Let (X1 , Y1 ), · · · , (Xn , Yn ) denotes the data and R1 , · · · , Rk is a rectangular partition of the space of the
covariates. Let R(x) denotes the rectangular regions where x falls within. The decision tree is
Pn
i=1 Yi I(Xi ∈ R(x)) 1
cDT (x) = I Pn > .
i=1 I(Xi ∈ R(x))
b
2
Here, you can see that the decision tree is essentially a classifier converted from a regression tree.

13.3 Logistic Regression

The logistic regression is a regression model that is commonly applied to classification problems as well.
Like the method of converting a regression estimator to a classifier, the logistic regression use a regression
function as an intermediate step and then form a classifier. We first talk about some interesting examples.
Example. In graduate school admission, we are wondering how a student’s GPA affects the chance that
this applicant received the admission. In this case, each observations is a student and the response variable
Y represents whether the student received admission (Y = 1) or not (Y = 0). GPA is the covariate X.
Thus, we can model the probability

P (admitted|GPA = x) = P (Y = 1|X = x) = q(x).

Lecture 13: Classification: Multivariate Approaches 13-3

Example. In medical research, people are often wondering if the heretability of the type-2 diabetes is related
to some mutation from of a gene. Researchers record if the subject has the type-2 diabetes (response) and
measure the mutation signature of genes (covariate X). Thus, the response variable Y = 1 if this subject
has the type-2 diabetes. A statistical model to associate the covariate X and the response Y is through

P (subject has type-2 diabetes|mutation signature = x) = P (Y = 1|X = x) = q(x).

Thus, the function q(x) now plays a key role in determining how the response Y and the covariate X are
associated. The logistic regression provides a simple and elegant way to characterize the function q(x) in a
‘linear’ way. Because q(x) represents a probability, it ranges within [0, 1] so naively using a linear regression
will not work. However, consider the following quantity:

q(x) P (Y = 1|X = x)
O(x) = = ∈ [0, ∞).
1 − q(x) P (Y = 0|X = x)

The quantity O(x) is called the odds that measures the contrast between the event Y = 1 versus Y = 0.
When the odds is greater than 1, we have a higher change of getting Y = 1 than Y = 0. The odds
has an interesting asymmetric form– if P (Y = 1|X = x) = 2P (Y = 0|X = x), then O(x) = 2 but if
P (Y = 0|X = x) = 2P (Y = 1|X = x), then O(x) = 12 . To symmetrize the odds, a straight-forward
approach is to take (natural) logarithm of it:

q(x)
log O(x) = log .
1 − q(x)

This quantity is called log odds. The log odds has several beautiful properties, for instance when the two
probabilities are the same (P (Y = 1|X = x) = P (Y = 0|X = x)), log O(x) = 0, and

P (Y = 1|X = x) = 2P (Y = 0|X = x) ⇒ log O(x) = log 2

P (Y = 0|X = x) = 2P (Y = 1|X = x) ⇒ log O(x) = − log 2.

The logistic regression is to impose a linear model to the log odds. Namely, the logistic regression models

q(x)
log O(x) = log = β0 + β T x,
1 − q(x)

which leads to
T
eβ0 +β x
P (Y = 1|X = x) = q(x) = .
1 + eβ0 +β T x
Thus, the quantity q(x) = q(x; β0 , β) depends on the two parameter β0 , β. Here β0 behaves like the intercept
and β behaves like the slope vector (they are the intercept and slope in terms of the log odds).
When we observe data, how can we estimate these two parameters? In general, we will use the maximum
likelihood approach to estimate them. You can view the (minus) likelihood function as the loss function in
the classification (actually, we will use the log-likelihood function as the loss function). And the goal is to
find the parameter via minimizing such a loss.
Recall that we observe IID random sample:

(X1 , Y1 ), · · · , (Xn , Yn ).

Let pX (x) denotes the probability density of X; note that we will not use it in estimating β0 , β. For a given
pair Xi , Yi , recalled that the random variable Yi given Xi is just a Bernoulli random variable with parameter
13-4 Lecture 13: Classification: Multivariate Approaches

q(x = Xi ). Thus, the PMF of Yi given Xi is

L(β0 , β|Xi , Yi ) = P (Y = Yi |Xi ) = q(Xi )Yi (1 − q(Xi ))1−Yi

T
!Yi 1−Yi
eβ0 +β Xi 1
=
1 + eβ0 +β T Xi 1 + eβ0 +β T Xi
T
eβ0 Yi +β Xi Yi
= .
1 + eβ0 +β T Xi
Note that here we construct the likelihood function using only the conditional PMF because similarly to the
linear regression, the distribution of the covariate X does not depends on the parameter β0 , β. Thus, the
log-likelihood function is
n
X
`(β0 , β|X1 , Y1 , · · · , Xn , Yn ) = log L(β0 , β T |Xi , Yi )
i=1
n T
!
X eβ0 Yi +β Xi Yi
= log
i=1
1 + eβ0 +β T Xi
n
X T
= β0 Yi + β T Xi Yi − log 1 + eβ0 +β Xi .
i=1

Our estimates are

βb0 , βb = argmax `(β0 , β|X1 , Y1 , · · · , Xn , Yn )

β0 ,β

= argmin − `(β0 , β|X1 , Y1 , · · · , Xn , Yn )

β0 ,β
n
1X
= argmin −`(β0 , β|Xi , Yi ) ,
β0 ,β n i=1 | {z }
loss function
| {z }
empirical estimate of the loss function

where the loss function is

T

−`(β0 , β|Xi , Yi ) = β0 Yi + β T Xi Yi − log 1 + eβ0 +β Xi .

βb0 , βb does not have a closed-form solution in general so we cannot write down a simple expression of the
estimator. Despite this disadvantage, such a log-likelihood function can be optimized by a gradient ascent
approach such as the Newton-Raphson1 .

13.4 Training Classifiers as an Optimization Problem

Even without any probabilistic model, classification problem can be viewed as an optimization problem. The
key element is to replace the risk function E(L(c(X), Y ) by the empirical risk
n
bn (c) = 1
X
R L(c(Xi ), Yi ).
n i=1
1 some references can be found: https://www.cs.princeton.edu/~bee/courses/lec/lec_jan24.pdf
Lecture 13: Classification: Multivariate Approaches 13-5

Consider a collection of classifiers C. Then the goal is to find the best classifier c∗ ∈ C such that the the
empirical risk R
bn (c) is minimized. Namely,

c∗ = argmin R
bn (c).
c∈C

Ideally, this should work because

n
!
1X
E(R
bn (c)) = E L(c(Xi ), Yi ) = E(L(c(X1 ), Y1 )) = R(c) (13.1)
n i=1

is the actual risk function. Namely, R

bn (c) is an unbiased estimator of the risk function.

Here are some concrete examples.

Linear classifier. Assume that we consider a linear classifier:

cβ0 ,β (x) = I(β0 + β T x > 0),

where β0 is a number like the intercept and β is the vector like the slope of each covariate/feature. Namely,
how we assign the class label purely depends on the value of β0 + β T x. If this value is positive, then we give
it a label 1. If the value is negative, then we give it a label 0. Then the set

Clin = {cβ0 ,β : β0 ∈ R, β ∈ Rd }

is a collection of all linear classifier. The idea of empirical risk minimization is to find the classifier in Clin
such that the empirical risk R bn (·) is minimized. Because every classifier is indexed by the two quantities
(parameters) β0 , β, finding the one that minimizes the empirical risk is equivalent to finding the best β0 , β
minimizing R bn (·).

Logistic regression. Similar to the linear classifier, the logistic regression can be viewed as a classifier of
the form !
T
eβ0 +β x 1
cβ0 ,β (x) = I > .
1 + eβ0 +β T x
e
2
Then we can define the set
cβ0 ,β : β0 ∈ R, β ∈ Rd }
Clogistic = {e
as the collection of all classifiers from a logistic regression model. The MLE approach becomes an empirical
risk minimization method using a particular loss function – the log-likelihood function.
Decision tree (fixed k, fixed leaves). The decision tree classifier b cDT can be viewed as a classifier from
the empirical risk minimization as well. For simplicity, we assume that the regions/leaves of the decision
trees R1 , · · · , Rk are fixed. Then any decision tree classifier can be written as
k
X
cDT (x) = αj I(x ∈ Rj ),
j=1

where α1 , · · · , αk are quantities/parameters that determine how we will predict the class label of region
R1 , · · · , Rk , respectively. And the collection of all possible classifiers will be

CDT = {cDT (x) : αj ∈ {0, 1}, j = 1, · · · , k}.

Note that there are only 2k classifiers in the set CDT . The estimator b
cDT is just the one that minimizes the
bn (·) with a 0 − 1 loss.
empirical loss R
13-6 Lecture 13: Classification: Multivariate Approaches

Decision tree (fixed k, non-fixed leaves). When the regions/leaves are not fixed, the collection of
all possible decision tree classifiers is much more complex. Here is an abstract way of describing such a
collection. Recall that at each split of the tree, we pick one feature and a threshold (e.g., x1 > 10 versus
x1 ≤ 10), every split can be represented by two indices: the feature index (which feature is this split occurs),
and the threshold index (what is the split level). Thus, a split is characterized by a pair (m, λ), where
m ∈ {1, 2, · · · , d} and λ ∈ R. If a tree has k leaves, there will be k − 1 splits. So any decision tree classifier
is indexed by
α1 , · · · , αk , (m1 , λ1 ), · · · , (mk−1 , λk−1 ).
Namely, given a set of these values, we can construct a unique decision tree classifier. Thus, the collection
of all decision tree with k leaves can be written as

CDT (k) = {cDT (x) = cα,m,λ (x) : αj ∈ {0, 1}, (m` , λ` ) ∈ {1, 2, · · · , d} × R, j = 1, · · · , k, ` = 1, · · · , k − 1}.

In reality, when we train a decision tree classifier with a fixed number k, the regions are also computed from
the data. Thus, we are actually finding b cDT such that

cDT = argmin R
b bn (c).
c∈CDT (k)

Decision tree (both k andSleaves are non-fixed). If we train the classifier with k being un-specified,
then we are finding b cDT from k∈N CDT (k) that minimizes the empirical risk. However, if we really consider
all possible k, such an optimal decision tree is not unique and may be problematic. When k > n, we can make
each leave contains at most and the predicted label is just the label of that observation. Such a classifier
has 0 empirical risk but it may have very poor performance in future prediction because we are overfitting
the data. Overfitting the data implies that R bn (c) and R(c) are very different, even if R
bn (c) is an unbiased
estimator of R(c)!

Why will this happen? Rbn (c) is just the sample-average version of R(c), right? Is this contradict to the law
of large number that R
bn (c) converges to R(c)?

It is true that R bn (c) is an unbiased estimator of R(c) and yes indeed the law of large number is applicable
in this case. BUT a key requirement for using the law of large number is that we assume c is fixed. Namely,
if the classifier c is fixed, then the law of large number guarantees that the empirical risk R
bn (c) converges
to the true risk function R(c).
However, when we are finding the best classifier, we are consider many many many possible classifiers c.
Although for a given classifier c the law of large number works, it may not work when we consider many
classifiers. The empirical risk minimization works if
P
bn (c) − R(c) →
sup R 0
c∈C

Namely, the convergence is uniform for all classifiers in the collection that we are considering. In the next
few lectures we will be talking about how the above uniform convergence may be established.
Remark.

• Regression problem. The empirical risk minimization method can also be applied to regression
problem. We just replace the classifier by the regression function and the loss function can be chosen
as the L2 loss (squared distance). In this formulation, we obtain

R(m) = E kY − m(X)k2

Lecture 13: Classification: Multivariate Approaches 13-7

and
n
bn (m) = 1
X
R kYi − m(Xi )k2 .
n i=1
Estimating a regression function can thus be written as an empirical risk minimization – we minimizes
Rbn (m) for m ∈ M to obtain our regression estimator. The set M is a collection of many regression
functions. Similar to the classification problem, we need a uniform convergence of the empirical risk
to make sure we have a good regression estimator.
• Penalty function. Another approach to handle the difference supc∈C |R bn (c) − R(c)| is to add an
extra quantity to Rn (c) so that such a uniform difference is somewhat being controlled. Instead of
b
minimizing Rbn (c), we minimizes
Rbn (c) + Pλ (c),

where Pλ is a penalty function. This is just the penalized regression approach but now applying it to
a classification problem. When the penalty function is chosen in a good way, we can make sure the
optimal classifier/regression estimator from the (penalized) empirical risk minimization indeed has a
very small risk.

Linear Models For Classification: Logreg - PDF - May 4, 2010 - 1
No ratings yet
Linear Models For Classification: Logreg - PDF - May 4, 2010 - 1
7 pages
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
No ratings yet
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
10 pages
Class
No ratings yet
Class
102 pages
Oracle Bounds and Exact Algorithm For Dyadic Classification Trees
No ratings yet
Oracle Bounds and Exact Algorithm For Dyadic Classification Trees
15 pages
Temario Isl or
No ratings yet
Temario Isl or
15 pages
Logistic Regression Explained
No ratings yet
Logistic Regression Explained
15 pages
Log-Linear Models and Conditional Random Fieldsels
No ratings yet
Log-Linear Models and Conditional Random Fieldsels
27 pages
Linearclassification
No ratings yet
Linearclassification
31 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
Unit 5
No ratings yet
Unit 5
21 pages
Murphy Book Solution
No ratings yet
Murphy Book Solution
100 pages
Linear Classification: 1 1 N N I D I
No ratings yet
Linear Classification: 1 1 N N I D I
33 pages
Lec5 Class
No ratings yet
Lec5 Class
14 pages
Cours 2 MVA
No ratings yet
Cours 2 MVA
5 pages
Weather Wax Hastie Solutions Manual
No ratings yet
Weather Wax Hastie Solutions Manual
18 pages
Discrimination and Classification
No ratings yet
Discrimination and Classification
7 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Preface VII Mathematical Notation Xi Contents Xiii
No ratings yet
Preface VII Mathematical Notation Xi Contents Xiii
6 pages
Lec 20
No ratings yet
Lec 20
16 pages
3 Classification
No ratings yet
3 Classification
26 pages
Intro&NP Stat
No ratings yet
Intro&NP Stat
122 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Xiii Xiv Contents: 2 Probability Distributions 67
No ratings yet
Xiii Xiv Contents: 2 Probability Distributions 67
6 pages
Xiii Xiv Contents: 2 Probability Distributions 67
No ratings yet
Xiii Xiv Contents: 2 Probability Distributions 67
6 pages
Notes Stat Learning
No ratings yet
Notes Stat Learning
64 pages
Legal 3 AI
No ratings yet
Legal 3 AI
3 pages
07 - Linear Models For Classification
No ratings yet
07 - Linear Models For Classification
76 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Lec 42
No ratings yet
Lec 42
12 pages
2223hk1 Slide03 ML2022
No ratings yet
2223hk1 Slide03 ML2022
33 pages
8822 LectureNotes
No ratings yet
8822 LectureNotes
74 pages
Unit - 5
No ratings yet
Unit - 5
111 pages
Stats 2
No ratings yet
Stats 2
6 pages
Lecture3-Bayesian Decision Theory
No ratings yet
Lecture3-Bayesian Decision Theory
57 pages
hw2b 2017
No ratings yet
hw2b 2017
7 pages
Bayesian Probability & Regression
No ratings yet
Bayesian Probability & Regression
6 pages
Notests PDF
No ratings yet
Notests PDF
153 pages
Cheat Sheet
No ratings yet
Cheat Sheet
163 pages
Practical 7 Classification Revision Questions
No ratings yet
Practical 7 Classification Revision Questions
8 pages
STAT2102 Chapter6
No ratings yet
STAT2102 Chapter6
5 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
Supervised Learning Cheatsheet
No ratings yet
Supervised Learning Cheatsheet
4 pages
04 Probability and Learning PDF
No ratings yet
04 Probability and Learning PDF
34 pages
Lecturenotes
No ratings yet
Lecturenotes
56 pages
ML 2024 Part6 Classification Unsupervised
No ratings yet
ML 2024 Part6 Classification Unsupervised
43 pages
Machine Learning Classifiers Guide
No ratings yet
Machine Learning Classifiers Guide
12 pages
Slides 3
No ratings yet
Slides 3
25 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
No ratings yet
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
58 pages
AIML-Unit 5 Notes
No ratings yet
AIML-Unit 5 Notes
45 pages
Machine Learning Lecture Notes Undergrad
No ratings yet
Machine Learning Lecture Notes Undergrad
19 pages
3logistic Regression
No ratings yet
3logistic Regression
61 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
Bayes Lecture Notes
No ratings yet
Bayes Lecture Notes
172 pages
Statistical+Inference+1 Shaw2007
No ratings yet
Statistical+Inference+1 Shaw2007
66 pages
IIT Kanpur Machine Learning End Sem Paper
100% (1)
IIT Kanpur Machine Learning End Sem Paper
10 pages
Energy Resonance System
No ratings yet
Energy Resonance System
2 pages
There Is No Matter, Only Waves
No ratings yet
There Is No Matter, Only Waves
6 pages
Quantum Uncertainty and Wave-Particle Duality Are Equivalent
No ratings yet
Quantum Uncertainty and Wave-Particle Duality Are Equivalent
3 pages
New Quantum Paradox Clarifies Where Our Views of Reality Go Wrong
No ratings yet
New Quantum Paradox Clarifies Where Our Views of Reality Go Wrong
32 pages
Nonparametric Regression
No ratings yet
Nonparametric Regression
7 pages
The Unfolding of Time and Space
No ratings yet
The Unfolding of Time and Space
38 pages
M 3 F 22 CH VIII
No ratings yet
M 3 F 22 CH VIII
25 pages
Information Converted To Energy
No ratings yet
Information Converted To Energy
4 pages
Chapter Iv: Stochastic Processes in Discrete Time 1. Filtrations
No ratings yet
Chapter Iv: Stochastic Processes in Discrete Time 1. Filtrations
12 pages
Exploring Negative Space: Log Cross Ratio 2i
100% (3)
Exploring Negative Space: Log Cross Ratio 2i
11 pages
Chapter V: Mathematical Finance in Discrete Time: Finite
No ratings yet
Chapter V: Mathematical Finance in Discrete Time: Finite
26 pages
Emptiness and Category Theory
No ratings yet
Emptiness and Category Theory
14 pages
RealAnalCh4 PDF
No ratings yet
RealAnalCh4 PDF
44 pages
Advanced Regression Techniques
No ratings yet
Advanced Regression Techniques
32 pages
Karma: What It Is, What It Isn't, Why It Matters. by Traleg Kyabgon
100% (1)
Karma: What It Is, What It Isn't, Why It Matters. by Traleg Kyabgon
2 pages
Something From Nothing
100% (1)
Something From Nothing
6 pages
Tinnitus - How An Alternative Remedy Became The Only Weapon Against The Ringing
No ratings yet
Tinnitus - How An Alternative Remedy Became The Only Weapon Against The Ringing
13 pages
Emptiness and Category Theory
No ratings yet
Emptiness and Category Theory
14 pages
Capital Asset Pricing Model
No ratings yet
Capital Asset Pricing Model
1 page
Informatics and Consciousness Transfer
No ratings yet
Informatics and Consciousness Transfer
13 pages
Quantum Pure Possibilities
No ratings yet
Quantum Pure Possibilities
44 pages
Spacetime Geometry in Quantum Mechanics
No ratings yet
Spacetime Geometry in Quantum Mechanics
8 pages
Testing Two Related Means
No ratings yet
Testing Two Related Means
19 pages
Prewhitening With SPSS
No ratings yet
Prewhitening With SPSS
4 pages
Time Series Decomposition Guide
No ratings yet
Time Series Decomposition Guide
14 pages
Exploring Data Patterns: Time Series
No ratings yet
Exploring Data Patterns: Time Series
20 pages
A Simple Method To Quantify The - V - O2 Ramp-Incremental
No ratings yet
A Simple Method To Quantify The - V - O2 Ramp-Incremental
7 pages
Study On Employee Engagement - With Reference To Middle and Junior Level Management Employees at Manufacturing Industry, Chennai, Tamilnadu
No ratings yet
Study On Employee Engagement - With Reference To Middle and Junior Level Management Employees at Manufacturing Industry, Chennai, Tamilnadu
54 pages
Shared Autonomous Vehicles and Their Potential Impacts On Household Vehicle Ownership An Exploratory Empirical Assessment
No ratings yet
Shared Autonomous Vehicles and Their Potential Impacts On Household Vehicle Ownership An Exploratory Empirical Assessment
13 pages
05 Ijetmr19 A01 612 PDF
No ratings yet
05 Ijetmr19 A01 612 PDF
8 pages
Machine Learning Regression Guide
No ratings yet
Machine Learning Regression Guide
4 pages
Hansen (2006, Econometrics)
No ratings yet
Hansen (2006, Econometrics)
196 pages
AI HL Functions
No ratings yet
AI HL Functions
93 pages
A Simple Method For Computing Sprint Acceleration Kinetics From
No ratings yet
A Simple Method For Computing Sprint Acceleration Kinetics From
6 pages
D 5720-95
No ratings yet
D 5720-95
9 pages
Aff700 1000 220401
No ratings yet
Aff700 1000 220401
8 pages
Revenue Management & Forecasting Techniques
No ratings yet
Revenue Management & Forecasting Techniques
42 pages
Think Stats Exploratory Data Analysis Second Edition Allen B. Downey Instant Download
No ratings yet
Think Stats Exploratory Data Analysis Second Edition Allen B. Downey Instant Download
52 pages
DR - Nadeem Anwar
No ratings yet
DR - Nadeem Anwar
20 pages
Probability Statistics and Data A Fresh Approach Using R 1st Edition Darrin Speegle PDF Available
No ratings yet
Probability Statistics and Data A Fresh Approach Using R 1st Edition Darrin Speegle PDF Available
79 pages
IADIS Conference Template
No ratings yet
IADIS Conference Template
12 pages
Logistic Regression Models: Series: Basic Statistics For Busy Clinicians (Vii)
No ratings yet
Logistic Regression Models: Series: Basic Statistics For Busy Clinicians (Vii)
11 pages
Quick and Dirty Regression Tutorial
No ratings yet
Quick and Dirty Regression Tutorial
6 pages
Sample Problems. EDA Final Report
No ratings yet
Sample Problems. EDA Final Report
26 pages
Web Sites: Xy XX Xy XX
No ratings yet
Web Sites: Xy XX Xy XX
4 pages
Correlation and Regration
No ratings yet
Correlation and Regration
8 pages
Zhang Et Al 2021 The Audience S Perspective Decline of Mythical Elements in Films
No ratings yet
Zhang Et Al 2021 The Audience S Perspective Decline of Mythical Elements in Films
13 pages
Fds Answers
No ratings yet
Fds Answers
53 pages
Unit Iii
No ratings yet
Unit Iii
50 pages
Institute of Actuaries of India: Subject CT3 - Probability and Mathematical Statistics
No ratings yet
Institute of Actuaries of India: Subject CT3 - Probability and Mathematical Statistics
7 pages
Guidebook BioDinamica English
No ratings yet
Guidebook BioDinamica English
100 pages
(Ebook) Applied Longitudinal Data Analysis For Epidemiology: A Practical Guide by Jos W. R. Twisk ISBN 9781107030039, 110703003X PDF Download
No ratings yet
(Ebook) Applied Longitudinal Data Analysis For Epidemiology: A Practical Guide by Jos W. R. Twisk ISBN 9781107030039, 110703003X PDF Download
77 pages
Elite Volleyball Jump Performance Analysis
No ratings yet
Elite Volleyball Jump Performance Analysis
8 pages
Effects of Inventory Control On Profitability of Industrial and Allied Firms in Kenya
No ratings yet
Effects of Inventory Control On Profitability of Industrial and Allied Firms in Kenya
8 pages
Factors Affecting The Net Interest Margin of Commercial Bank of Ethiopia
No ratings yet
Factors Affecting The Net Interest Margin of Commercial Bank of Ethiopia
12 pages
Statistics Revision
No ratings yet
Statistics Revision
5 pages

Multivariate Classification

Uploaded by

Multivariate Classification

Uploaded by

STAT 425: Introduction to Nonparametric Statistics Winter 2018

Lecture 13: Classification: Multivariate Approaches

13.1 k-NN Approach

13.2 Decision Tree

13.3 Logistic Regression

P (admitted|GPA = x) = P (Y = 1|X = x) = q(x).

P (subject has type-2 diabetes|mutation signature = x) = P (Y = 1|X = x) = q(x).

P (Y = 1|X = x) = 2P (Y = 0|X = x) ⇒ log O(x) = log 2

q(x = Xi ). Thus, the PMF of Yi given Xi is

L(β0 , β|Xi , Yi ) = P (Y = Yi |Xi ) = q(Xi )Yi (1 − q(Xi ))1−Yi

Our estimates are

βb0 , βb = argmax `(β0 , β|X1 , Y1 , · · · , Xn , Yn )

= argmin − `(β0 , β|X1 , Y1 , · · · , Xn , Yn )

where the loss function is

13.4 Training Classifiers as an Optimization Problem

Ideally, this should work because

is the actual risk function. Namely, R

Here are some concrete examples.

cβ0 ,β (x) = I(β0 + β T x > 0),

CDT = {cDT (x) : αj ∈ {0, 1}, j = 1, · · · , k}.

You might also like