Multivariate Classification
Multivariate Classification
In the previous lecture, we have learned that we can construct an optimal classifier if we know the true under-
lying probability model. Even we do not know the true model, we can estimate the model nonparametrically
and then train a classifier using the estimated density function or regression function.
This approach is mathematically great but practically problematic because in reality the dimension of the
data is often large (i.e., the feature/covariate X has many values). Recall that the MSE of a nonparametric
4
density/regression estimator often has a convergence rate of the order O(n− 4+d ). This rate is extremely slow
when d is large. Thus, converting a nonparametric estimator to a classifier may not work in practice.
Now we are going to introduce a couple of other classifiers that does not require an explicit estimation of
the density function or the regression function.
The k-NN approach can be applied to classification as well. The idea is very simple – for a given point x0 ,
we find its k-th nearest data points. Then we compare the labels of these k points and assign the label of
x0 as the majority label in these k points.
Take the data in the following picture as an example. There are two classes: black dots and red crosses. We
are interested in the class label at the two blue boxes (x1 and x2 ).
x1
x2
Assume we use a 3-NN classifier b c3−N N . At point x1 , its 3-NN contains two black dots and one red cross,
so b
c3−N N (x1 ) = black dot. At point x2 , its 3-NN has one black dots and two red crosses, so b c3−N N (x1 ) =
red cross. Note that if there are ties, we will randomly assign the class label attaining the tie.
The k-NN approach is simple and easy to operate. It can be easily generalize to multiple classes – the idea
is the same: we assign the class label according to the majority in the neighborhood.
How do we choose k? The choice of k is often done by a technique called cross-validation. The basic principle
is: we split the data into two parts, use one part to train our classifier and evaluate the performance on
the other part. Repeating the above procedure multiple times and applying it to each k, we can obtain an
13-1
13-2 Lecture 13: Classification: Multivariate Approaches
estimate of the performance. We then choose the k with the best performance. We will talk about this topic
later.
Decision tree is another common approach in classification. A decision tree is like a regression tree – we
partition the space of covariate into rectangular regions and then assign each region a class label. If the tree
is given, the class label at each region is determined by the majority vote (using the majority of labels in
that region).
The following picture provides an example of a decision tree using the same data as the one in the previous
section.
x2
8
x2
<5 ≥5
6
x1 x1
< 5.4 ≥ 5.4 <5 ≥5
< 6.6
x2 ≥ 6.6
4
4 6 8 x1
In the left panel, we display the scatterplot and regions separated by a decision tree. The background color
denotes the estimated label. The right panel displays the tree structure.
In the case of binary classification (the class label Y = 0 or 1), the decision tree can be written as follows.
Let (X1 , Y1 ), · · · , (Xn , Yn ) denotes the data and R1 , · · · , Rk is a rectangular partition of the space of the
covariates. Let R(x) denotes the rectangular regions where x falls within. The decision tree is
Pn
i=1 Yi I(Xi ∈ R(x)) 1
cDT (x) = I Pn > .
i=1 I(Xi ∈ R(x))
b
2
Here, you can see that the decision tree is essentially a classifier converted from a regression tree.
The logistic regression is a regression model that is commonly applied to classification problems as well.
Like the method of converting a regression estimator to a classifier, the logistic regression use a regression
function as an intermediate step and then form a classifier. We first talk about some interesting examples.
Example. In graduate school admission, we are wondering how a student’s GPA affects the chance that
this applicant received the admission. In this case, each observations is a student and the response variable
Y represents whether the student received admission (Y = 1) or not (Y = 0). GPA is the covariate X.
Thus, we can model the probability
Example. In medical research, people are often wondering if the heretability of the type-2 diabetes is related
to some mutation from of a gene. Researchers record if the subject has the type-2 diabetes (response) and
measure the mutation signature of genes (covariate X). Thus, the response variable Y = 1 if this subject
has the type-2 diabetes. A statistical model to associate the covariate X and the response Y is through
Thus, the function q(x) now plays a key role in determining how the response Y and the covariate X are
associated. The logistic regression provides a simple and elegant way to characterize the function q(x) in a
‘linear’ way. Because q(x) represents a probability, it ranges within [0, 1] so naively using a linear regression
will not work. However, consider the following quantity:
q(x) P (Y = 1|X = x)
O(x) = = ∈ [0, ∞).
1 − q(x) P (Y = 0|X = x)
The quantity O(x) is called the odds that measures the contrast between the event Y = 1 versus Y = 0.
When the odds is greater than 1, we have a higher change of getting Y = 1 than Y = 0. The odds
has an interesting asymmetric form– if P (Y = 1|X = x) = 2P (Y = 0|X = x), then O(x) = 2 but if
P (Y = 0|X = x) = 2P (Y = 1|X = x), then O(x) = 12 . To symmetrize the odds, a straight-forward
approach is to take (natural) logarithm of it:
q(x)
log O(x) = log .
1 − q(x)
This quantity is called log odds. The log odds has several beautiful properties, for instance when the two
probabilities are the same (P (Y = 1|X = x) = P (Y = 0|X = x)), log O(x) = 0, and
The logistic regression is to impose a linear model to the log odds. Namely, the logistic regression models
q(x)
log O(x) = log = β0 + β T x,
1 − q(x)
which leads to
T
eβ0 +β x
P (Y = 1|X = x) = q(x) = .
1 + eβ0 +β T x
Thus, the quantity q(x) = q(x; β0 , β) depends on the two parameter β0 , β. Here β0 behaves like the intercept
and β behaves like the slope vector (they are the intercept and slope in terms of the log odds).
When we observe data, how can we estimate these two parameters? In general, we will use the maximum
likelihood approach to estimate them. You can view the (minus) likelihood function as the loss function in
the classification (actually, we will use the log-likelihood function as the loss function). And the goal is to
find the parameter via minimizing such a loss.
Recall that we observe IID random sample:
(X1 , Y1 ), · · · , (Xn , Yn ).
Let pX (x) denotes the probability density of X; note that we will not use it in estimating β0 , β. For a given
pair Xi , Yi , recalled that the random variable Yi given Xi is just a Bernoulli random variable with parameter
13-4 Lecture 13: Classification: Multivariate Approaches
βb0 , βb does not have a closed-form solution in general so we cannot write down a simple expression of the
estimator. Despite this disadvantage, such a log-likelihood function can be optimized by a gradient ascent
approach such as the Newton-Raphson1 .
Even without any probabilistic model, classification problem can be viewed as an optimization problem. The
key element is to replace the risk function E(L(c(X), Y ) by the empirical risk
n
bn (c) = 1
X
R L(c(Xi ), Yi ).
n i=1
1 some references can be found: https://www.cs.princeton.edu/~bee/courses/lec/lec_jan24.pdf
Lecture 13: Classification: Multivariate Approaches 13-5
Consider a collection of classifiers C. Then the goal is to find the best classifier c∗ ∈ C such that the the
empirical risk R
bn (c) is minimized. Namely,
c∗ = argmin R
bn (c).
c∈C
where β0 is a number like the intercept and β is the vector like the slope of each covariate/feature. Namely,
how we assign the class label purely depends on the value of β0 + β T x. If this value is positive, then we give
it a label 1. If the value is negative, then we give it a label 0. Then the set
Clin = {cβ0 ,β : β0 ∈ R, β ∈ Rd }
is a collection of all linear classifier. The idea of empirical risk minimization is to find the classifier in Clin
such that the empirical risk R bn (·) is minimized. Because every classifier is indexed by the two quantities
(parameters) β0 , β, finding the one that minimizes the empirical risk is equivalent to finding the best β0 , β
minimizing R bn (·).
Logistic regression. Similar to the linear classifier, the logistic regression can be viewed as a classifier of
the form !
T
eβ0 +β x 1
cβ0 ,β (x) = I > .
1 + eβ0 +β T x
e
2
Then we can define the set
cβ0 ,β : β0 ∈ R, β ∈ Rd }
Clogistic = {e
as the collection of all classifiers from a logistic regression model. The MLE approach becomes an empirical
risk minimization method using a particular loss function – the log-likelihood function.
Decision tree (fixed k, fixed leaves). The decision tree classifier b cDT can be viewed as a classifier from
the empirical risk minimization as well. For simplicity, we assume that the regions/leaves of the decision
trees R1 , · · · , Rk are fixed. Then any decision tree classifier can be written as
k
X
cDT (x) = αj I(x ∈ Rj ),
j=1
where α1 , · · · , αk are quantities/parameters that determine how we will predict the class label of region
R1 , · · · , Rk , respectively. And the collection of all possible classifiers will be
Note that there are only 2k classifiers in the set CDT . The estimator b
cDT is just the one that minimizes the
bn (·) with a 0 − 1 loss.
empirical loss R
13-6 Lecture 13: Classification: Multivariate Approaches
Decision tree (fixed k, non-fixed leaves). When the regions/leaves are not fixed, the collection of
all possible decision tree classifiers is much more complex. Here is an abstract way of describing such a
collection. Recall that at each split of the tree, we pick one feature and a threshold (e.g., x1 > 10 versus
x1 ≤ 10), every split can be represented by two indices: the feature index (which feature is this split occurs),
and the threshold index (what is the split level). Thus, a split is characterized by a pair (m, λ), where
m ∈ {1, 2, · · · , d} and λ ∈ R. If a tree has k leaves, there will be k − 1 splits. So any decision tree classifier
is indexed by
α1 , · · · , αk , (m1 , λ1 ), · · · , (mk−1 , λk−1 ).
Namely, given a set of these values, we can construct a unique decision tree classifier. Thus, the collection
of all decision tree with k leaves can be written as
CDT (k) = {cDT (x) = cα,m,λ (x) : αj ∈ {0, 1}, (m` , λ` ) ∈ {1, 2, · · · , d} × R, j = 1, · · · , k, ` = 1, · · · , k − 1}.
In reality, when we train a decision tree classifier with a fixed number k, the regions are also computed from
the data. Thus, we are actually finding b cDT such that
cDT = argmin R
b bn (c).
c∈CDT (k)
Decision tree (both k andSleaves are non-fixed). If we train the classifier with k being un-specified,
then we are finding b cDT from k∈N CDT (k) that minimizes the empirical risk. However, if we really consider
all possible k, such an optimal decision tree is not unique and may be problematic. When k > n, we can make
each leave contains at most and the predicted label is just the label of that observation. Such a classifier
has 0 empirical risk but it may have very poor performance in future prediction because we are overfitting
the data. Overfitting the data implies that R bn (c) and R(c) are very different, even if R
bn (c) is an unbiased
estimator of R(c)!
Why will this happen? Rbn (c) is just the sample-average version of R(c), right? Is this contradict to the law
of large number that R
bn (c) converges to R(c)?
It is true that R bn (c) is an unbiased estimator of R(c) and yes indeed the law of large number is applicable
in this case. BUT a key requirement for using the law of large number is that we assume c is fixed. Namely,
if the classifier c is fixed, then the law of large number guarantees that the empirical risk R
bn (c) converges
to the true risk function R(c).
However, when we are finding the best classifier, we are consider many many many possible classifiers c.
Although for a given classifier c the law of large number works, it may not work when we consider many
classifiers. The empirical risk minimization works if
P
bn (c) − R(c) →
sup R 0
c∈C
Namely, the convergence is uniform for all classifiers in the collection that we are considering. In the next
few lectures we will be talking about how the above uniform convergence may be established.
Remark.
• Regression problem. The empirical risk minimization method can also be applied to regression
problem. We just replace the classifier by the regression function and the loss function can be chosen
as the L2 loss (squared distance). In this formulation, we obtain
R(m) = E kY − m(X)k2
Lecture 13: Classification: Multivariate Approaches 13-7
and
n
bn (m) = 1
X
R kYi − m(Xi )k2 .
n i=1
Estimating a regression function can thus be written as an empirical risk minimization – we minimizes
Rbn (m) for m ∈ M to obtain our regression estimator. The set M is a collection of many regression
functions. Similar to the classification problem, we need a uniform convergence of the empirical risk
to make sure we have a good regression estimator.
• Penalty function. Another approach to handle the difference supc∈C |R bn (c) − R(c)| is to add an
extra quantity to Rn (c) so that such a uniform difference is somewhat being controlled. Instead of
b
minimizing Rbn (c), we minimizes
Rbn (c) + Pλ (c),
where Pλ is a penalty function. This is just the penalized regression approach but now applying it to
a classification problem. When the penalty function is chosen in a good way, we can make sure the
optimal classifier/regression estimator from the (penalized) empirical risk minimization indeed has a
very small risk.