1.12.2024-BSC-301-CSBS-class Note - 2024-25
1.12.2024-BSC-301-CSBS-class Note - 2024-25
December 1, 2024
Abstract
Disclaimer: This Statistics course is an application oriented course designed for students of
computer science and business system. It deals with basics of multivariate statistical analy-
sis. This course is taught in various universities following different pedagogical point of view.
Keeping engineering students in mind, I personally feel that the applications of methodologies
specifically designed for dealing with multivariate data, have to be taught. This note is a sum-
mary of what has been taught in the class and in no way can be considered complete. The
students are encouraged to study a good and convenient book which tallies with the syllabus of
BSC-301.
This course is intended to explain the fundamentals of multivariate statistical techniques and
to apply the concepts in multivariate data analysis.
Class 1:
Pre-requisites 1
Multivariate data arise whenever an investigator, seeking to understand a social and physical phe-
nomenon, selects a number p ≥ 1 of variables or characters to record. Here xjk denotes measurement
*
sovikroy.tisl@gmail.com
1
of the k th − variable on j th − item. Therefore n measurements on p− variables can be displayed as
follows:
··· ···
x11 x12 x1k x1p
x21 x22 ··· x2k ··· x2p
··· ··· ··· ··· ··· ···
X=
xj1 xj2 ··· xjk ··· xjp
··· ··· ··· ··· ··· ···
xn1 xn2 ··· xnk ··· xnp
Before proceeding we come into an agreement to represent the variables column-wise and measure-
ments row-wise (i.e. first row indicates that 1st type of measurement has been taken on variables
1, 2, · · · , k, · · · , p and so on).
Just like in univariate case, for a data, we calculate the measure of central tendency (such as
mean), measure of dispersion (such as variance) and measures of linear association (covariance
and correlation), here in multivariate case also we will learn how these measures are taken into
consideration. Hence with respect to above matrix X, these measures are defined as follows:
n
1X
x̄k = xjk , k = 1, 2, · · · , p. (1)
n
j=1
Covariance reduces to sample variance when i = k. If large values for one variable are observed in
conjunction with large values for the other variable and the small values also occur together, then
sik will be positive. If large values from one variable occur with small values for the other variable,
then sik will be negative. If there is no particular association between the pair of variables, then
sik = 0. Also remember sik = ski , ∀i, k.
2
Another measure of linear association between a pair of variables, i.e. correlation is independent
of any unit. Sample correlation coefficient (or Pearson’s product moment correlation coefficient) is
given as
sik
rik = √ √ , i, k = 1, 2, · · · , p. (5)
sii skk
Note that rik = rki , ∀i, k.
These measures of mean, variance, covariance and correlation are known as descriptive statistics.
Thus we can consider the following matrices with reference to the matrix X (respectively known
as average matrix, variance-covariance matrix and correlation matrix).
x̄1
x̄2
X̄ = ···
x̄p
s11 s12 · · · s1p
s21 s22 · · · s2p
S̄n = ··· ··· ··· ···
(a) Represent thismultivariate data in matrix form, (b) Construct X̄, S̄n and R̄1 . [Hints: R̄ =
1 −0.36
.]
−0.36 1
Class 2:
Exercise
2. Consider the following
seven pairs of measurement (x1 , x2 ).
x1 3 4 2 6 8 2 5
Calculate the sample means x̄1 , x̄2 , the sample variance s11 , s22
x2 5 5.5 4 7 10 5 7.5
4.28 4.20 3.70
and the sample covariance s12 .[Hints: X = , Sn = ]
6.28 3.70 3.56
1
Lab teachers may show how to implement this problem using Python.
3
Exercise 3: A morning newspaper lists the following used car prices for a foreign compact with
age x1 measured in years and selling prices x2 measured in thousand dollars.
x1 1 2 3 3 4 5 6 8 9 11
x2 18.95 19.00 17.95 15.54 14.00 12.95 8.94 7.49 6.00 3.99
(a) Construct a scatter plot of the data and marginal dot diagrams as well.
(b) Infer the sign of the sample covariance s12 from the scatter plot.
(c) Compute the sample means’ matrix x̄1 , x̄2 , the sample variances s11 , s22 . Compute the sample
covariance s12 and the sample correlation coefficient r12 . Interpret these quantities.
Then if P = (x1 , x2 , · · · , xp ) is a p− dimensional point and then from origin O = (0, 0, · · · , 0) the
Euclidean distance is calculated as
q
d(O, P ) = x21 + x22 + · · · + x2p . (8)
Straight line or Euclidean distance is unsatisfactory for most statistical purposes. This is because
each coordinate contributes equally to the calculation of Euclidean distance which is not the case
in statistical analysis. When the coordinates represent measurements that are subject to random
fluctuations of different magnitudes, it is often desirable to weight coordinates subject to a great
deal of variability less heavily than those that are not highly variable. One way to proceed is to
divide each coordinate by the sample standard deviation. Therefore upon division by the standard
deviations, we have standardized coordinates. Thus a statistical distance of the point P =
(x1 , x2 , · · · , xp ) from the origin O = (0, 0, · · · , 0) can be computed from its standardized coordinates
as
s
x21 x2 x2p
ds (O, P ) = + 2 + ··· + . (10)
s11 s22 spp
Similarly statistical distance from an arbitrary point P = (x1 , x2 ) to any fixed point Q = (y1 , y2 )
is therefore given by
s
(x1 − y1 )2 (x2 − y2 )2
ds (O, P ) = + . (11)
s11 s22
2
All the above problems can be implemented in laboratory class.
4
In general statistical distance from an arbitrary point P = (x1 , x2 , · · · , xp ) to any fixed point
Q = (y1 , y2 , · · · , yp ) is then given as
s
(x1 − y1 )2 (x2 − y2 )2 (xp − yp )2
ds (P, Q) = + + ··· + , (12)
s11 s22 spp
where s11 , s22 , · · · , spp are the sample variances measured from p components x1 , x2 , · · · , xp .
Example 1: Consider the problem 2 above. Let the data be represented by random variable
X which take the individual values of the variables x1 , x2 pairwise (or we may consider this as
point P ). Let the origin be denoted by the random variable O. Then Euclidean distance and
Statistical distance of the point P from fixed origin are calculated and shown above.
Here 4.2 and 3.56 are the respectively the variances of x1 and x2 component of the point P.
Exercise 4: Consider the following data. Considering Q to be a fixed point, find Euclidean dis-
tance and Statistical distance of point P from Q. Now if in eq.(12) we set y1 = y2 = · · · = yp = 0,
P(x1 , x2 ) (3, 5) (4, 5.5) (2, 4) (6, 7) (8, 10) (2, 5) (5, 7.5)
Q(y1 , y2 ) (1, 2) (2.5, 3) (4, 3) (2, 1) (1, 2) (1, 0) (3, 4)
then the formula reduces to the form of eq.(10). However if s11 = s22 = · · · = spp the Euclidean
distance formula eq.(8) is appropriate.
In the above problem, for each block, one has to apply the required distance formula.
5
The above formulae are respectively known as Minkowski Metric, Canberra Metric and Czekanowski
metric.
It is, however, to be remembered that, eq.(12) is still not a universal complete formula as for a few
typical cases, it does not include some significant situations which may arise in statistical analysis.
Let us consider the following plot. The scatter plot in fig.1 below depicts a two-dimensional situa-
tion in which the x1 measurements do not vary independently of the x2 measurements. In fact the
coordinates of the pairs (x1 , x2 ) exhibits a tendency to be large or small together and the sample
correlation coefficient is positive. In such a situation more meaningful distance measure should be
implemented3 .
Figure 1: The figure represents the movement of two data sets one designated by blue dots and
the other designated by green dots. The figure also shows that the two data sets are positively
correlated.
3
Beyond the scope of our requirement.
6
Class 3
Multivariate Normal Distribution:
The multivariate Gaussian (or Normal) distribution is a generalization to two or more dimensions
of the univariate Gaussian distribution, which is often characterized by its resemblance to the shape
of a bell. In fact, in either of its univariate or multivariate incarnations, it is popularly referred to
as the Bell curve. It has many applications in theoretical and applied statistical research. Although
it is well-known that real data rarely obey the dictates of the Gaussian distribution, this deception
does provide us with a useful approximation to reality.
It is known that, if the real-valued univariate random variable X is said to have the Gaussian
distribution with mean µ and variance σ 2 (written as X follows N (µ, σ 2 )), then its density function
is given by the curve
1 1 2
f (x|µ, σ) = 1 e− 2σ2 (x−µ) , x ∈ R, (16)
(2πσ 2 ) 2
1
where ∞ < µ < ∞ and σ > 0. The constant multiplier term c = (2πσ 2 )− 2 is there to ensure that
the exponential function in the formula integrates to unity over the whole real line.
Now if we have r random variables X1 , X2 , · · · Xr , each defined on the real line, we can write
them as the r− dimensional column vector, X = (X1 , X2 , · · · , Xr )T , which we henceforth call as
4
“random P
T
Pj . Also if A is a j × j matrix and x is a j− vector, then a quadratic form is
r vector”
j
x Ax = k=1 l=1 Akl xk xl . A j × j matrix A is positive definite if for any j− vector x ̸= 0, the
quadratic form xT Ax > 0. A positive definite matrix however is a symmetric matrix where every
eigenvalue is positive.
1 2
Problem: Verify whether the given matrix A is positive definite, where A = X = !
2 1
The random r− vector X is said to have the r− variate Gaussian (or Normal) distribution with
mean r vector µ and positive definite, symmetric r × r covariance matrix Σ if its density function
is given by the curve
r − 21 1 T Σ−1 (x−µ)
f (x|µ, Σ) = (2π)− 2 Σ e− 2 (x−µ) , x ∈ Rr . (17)
when we mean that X has the above r- variate Gaussian (or Normal) distribution. If Σ is singular,
then, almost surely, X lives on some reduced dimensionality hyperplane so that its density function
does not exist; in that case, we say that, X has a singular Gaussian or singular Normal distribution.
4
AT is the transpose of the matrix A
7
A result due to Cramer and Wold states that the distribution of a random r−vector X is completely
determined by its one-dimensional linear projections, αT X, for any given r− vector α. This result
allows to make a more useful definition of the multivariate Gaussian distribution. The random r−
vector X has the multivariate Gaussian distribution if and only if every linear function of X has
the univariate Gaussian distribution.
and this is termed a spherical Gaussian density because (x − µ)T (x − µ) = a2 is the equation of
an r− dimensional sphere centred at µ. In general, the equation (x − µ)T Σ−1 (x − µ) = a2 is an
ellipsoid centred at µ, with Σ determining its orientation and shape, and the multivariate Gaussian
density function is constant along these ellipsoids.
When r = 2, the multivariate Gaussian density can be written out explicitly. Suppose
X = (X1 , X2 )T ∼ N2 µ, Σ , (21)
where
2 √ √
T σ11 σ12 σ11 ρ σ11 σ22
µ = (µ1 , µ2 ) , Σ= = √ √ 2 , (22)
σ21 σ22 ρ σ11 σ22 σ22
2 is the variance of X , σ 2 is the variance of X and
σ11 1 22 2
cov(X1 , X2 ) σ12
ρ= p p =√ √ , (23)
V (X1 ) V (X2 ) σ11 σ22
5 is the correlation between X1 and X2 . It is followed from eq.(22) that the determinant of Σ is
where
1 n x1 − µ1 2 x − µ x − µ x − µ 2 o
1 1 2 2 2 2
Q= − 2ρ12 + . (27)
1 − ρ212 σ11 σ11 σ22 σ22
5
V (X) is the variance of X and cov(X1 , X2 ) is the covariance between X1 and X2 .
8
If X1 and X2 are uncorrelated, ρ = 0, and the middle term in the expression (27) vanishes. In that
case, the bivariate Gaussian density function reduces to the product of two univariate Gaussian
densities,
1 2 1 2
− 2 (x1 −µ1 ) − 2 (x2 −µ2 )
2
f (x|µ1 , µ2 , σ11 2
, σ22 ) = (2πσ11 σ22 )−1 e 2σ11
e 2σ22
2 2
= f (x1 |µ1 , σ11 )f (x2 |µ2 , σ22 ), (28)
Figure 2: The figure represents the movement of two data sets one designated by blue dots and
the other designated by green dots. The figure also shows that the two data sets are positively
correlated.
1
Problem: If the exponent of e of a bivariate normal density is − 54 (x2 + 4y 2 + 2xy + 2x + 8y + 4),
find σ1 , σ2 and ρ given that µ1 = 0 and µ2 = −1.
Sol:
1 2 1 h x2 2ρx(y + 1) (y + 1)2 i
− (x + 4y 2 + 2xy + 2x + 8y + 4) = − − +
54 2(1 − ρ2 ) σ12 σ1 σ2 σ22
1 h
2 σ1 σ 2
1 2
i
= − x − 2ρ x(y + 1) + (y + 1)
2(1 − ρ2 )σ12 σ2 σ2
(29)
2(1 − ρ2 )σ12 = 54
σ
1
2ρ = −2
σ2
σ
1
= 4. (30)
σ2
9
Solving the set of eqs.(30) we get σ2 = 3, σ1 = 6 and ρ = −0.56 .
Definition: If X and Y are continuous random variables and f (x, y) is the value of their joint
probability density at (x, y), the function given by
Z ∞
g(x) = f (x, y)dy, −∞ < x < ∞ (31)
−∞
Now we know that a pair of random variables X and Y have a bivariate normal distribution if
their joint density function is of the form
h i
1 x−µ1 2 x−µ1 y−µ22 y−µ2 2
exp − 2(1−ρ 2 ) (( σ11 ) − 2ρ (
12 σ11 )( σ22 ) + ( σ22 ) )
12
f (x, y) = p , (33)
2πσ11 σ22 1 − ρ212
for −∞ < x < ∞, −∞ < y < ∞, σ1 , σ2 > 0 and −1 < ρ < 1. Eq.(33) is similar to that of eq.(26).
√ √
For simplicity we denote ρ12 by ρ and σ11 , σ22 simply by σ1 and σ2 respectively8 . Then marginal
density of X is given by
1 x−µ
− ( σ 1 )2 ∞
2(1−ρ2 )
Z y−µ x−µ y−µ
e 1 − 1
2(1−ρ2 )
[( σ 2 )2 −2ρ( σ 1 )( σ 2 )]
g(x) = p e 2 1 2 dy. (34)
2πσ1 σ2 1 − ρ2 −∞
6
One can take reciprocal of σ2 as common factor also. If we don’t take reciprocal of σ1 or σ2 as common factor,
then ρ adjustment can’t be done in this problem.
7
For discrete case however, integration is replaced by summation.
8
We generally denote standard deviation of a variable by σ. In this note however by σ11 we mean variance of the
variable from mean.
10
x−µ1 y−µ2
Substitute u = σ1 and v = σ2 such that σ1 du = dx and σ2 dv = dy, so that from eq.(35) we
get
1
− u2 ∞
2(1−ρ2 )
Z
e − 1
2(1−ρ2 )
[v 2 −2ρuv]
g(x) = p e σ2 dv
2πσ1 σ2 1 − ρ2 −∞
1
− u2 Z ∞
e 2(1−ρ2 ) − 1
[(v−ρu)2 −(ρu)2 ]
= p e 2(1−ρ2 ) dv
2πσ1 1 − ρ2 −∞
1
− u2 Z ∞
e 2(1−ρ2 ) − 1
(v−ρu)2 1
(ρu)2
= p e 2(1−ρ2 ) e 2(1−ρ2 ) dv
2πσ1 1 − ρ2 −∞
Z ∞
1 − 1 ( √v−ρu )2 − 1
[u2 (1−ρ2 )]
= p e 2 1−ρ2 e 2(1−ρ2 ) dv
2πσ1 1 − ρ2 −∞
1 2 Z ∞
e− 2 u − 1 ( √v−ρu )2
= √ √ p e 2 1−ρ2 dv
2π 2πσ1 1 − ρ2 −∞
1 2 Z ∞
e− 2 u h 1 − 1 ( √v−ρu )2
i
= √ √ p e 2 1−ρ2 dv
2πσ1 2π 1 − ρ2 −∞
(35)
Identifying the quantity in parentheses of eq. (35) as the integral of a normal density from −∞ to
∞ and hence equating to 1 we get
x−µ1 2
1 2 −1( )
e− 2 u e 2 σ1
g(x) = √ = √ , −∞ < x < ∞. (36)
σ1 2π σ1 2π
Hence marginal density of X is normal distribution with mean µ1 and standard deviation σ1 . By
symmetry one can easily show that the marginal density of Y is normal distribution with mean µ2
and standard deviation σ2 .
Definition: If f (x, y) is the value of the joint density of the continuous random variables X
and Y at (x, y), and h(y) is the value of the marginal density of Y at y, the function is given by
f (x, y)
f (x|y) = , h(y) ̸= 0, (37)
h(y)
for −∞ < x < ∞,is called the conditional density of X given Y = y. Correspondingly, if g(x)
is the value of the marginal density of X at x, the function given by
f (x, y)
w(y|x) = , g(x) ̸= 0, (38)
g(x)
Theorem: If X and Y have a bivariate normal distribution, the conditional density of Y given
X = x is a normal distribution with mean µY|x = µ2 + ρ σσ12 (x − µ1 ) and the variance is σY|x
2 =
2 2
σ2 (1 − ρ ), and the conditional density of X given Y = y is a normal distribution with the mean
µX|y = µ1 + ρ σσ21 (y − µ2 ) and the variance is σX|y
2 = σ12 (1 − ρ2 ).
11
x−µ1 y−µ2
Proof: We have, by considering u = σ1 and v = σ2
1
1√ − [u2 −2ρuv+v 2 ]
e 2(1−ρ2 )
2πσ1 σ2 1−ρ 2
w(Y|x) = 1 2
e− 2 u
√
2πσ1
1 − 1
[u2 −2ρuv+v 2 ] 1 2
= √ p e 2(1−ρ2 ) e2u
2πσ2 1− ρ2
(39)
Now
1 1 1 1
− [u2 −2ρuv+v 2 ] 1 2 − (v−ρu)2 (ρu)2 u2 1 2
e 2(1−ρ2 ) e2u = e 2(1−ρ2 ) e 2(1−ρ2 ) e −2(1−ρ2 ) e 2 u (40)
The r.h.s expression of eq.(40) comes when we write u2 − 2ρuv + v 2 as (v − ρu)2 − (ρu)2 . Then
after a little bit of simplification of eq.(40) we can write
1 1
− [u2 −2ρuv+v 2 ] 1 2 − (v−ρu)2
e 2(1−ρ2 ) e2u = e 2(1−ρ2 )
h i2
(v−ρu)
− 21 √ 2
= e 1−ρ (41)
The corresponding results for the conditional density of X given Y = y follows from symmetry.
12
and r × r covariance matrix of X is given by
ΣXX = cov(X, X)
= E{(X − µX )(X − µX )T }
= E{(X1 − µ1 , · · · , Xr − µr )(X1 − µ1 , · · · , Xr − µr )T }
2
σ11 σ12 · · · σ1r
2
σ21 σ22 · · · σ2r
··· ··· ··· ··· ,
= (48)
σr1 σr2 · · · σrr 2
where,
Suppose we have two random vectors, X and Y, where X has r components and Y has s compo-
nents. Let Z be the random r + s vector,
X
Z= . (51)
Y
where,
ΣXY = cov(X, Y)
= E{(X − µX )(Y − µZ )T }
= ΣTY X , (54)
is an (r × s) matrix.
13
If Y is linearly related to X in the sense that Y = AX + b, where A is a fixed s × r matrix
and b is a fixed s− vector, then the mean vector and covariance matrix of Y are given by
µY = AµX + b
ΣY Y = AΣXX AT , (55)
(56)
respectively.
Problem: Show that the moment generating function of the normal distribution is MX (t) =
1 2 2
eµt+ 2 σ t .
14
Problem: Show that if X and Y are standard normal variates with correlation coefficient ρ be-
tween them, then the correlation coefficient between X 2 and Y 2 is ρ2 .
Sol: Since X and Y are standard normal variates we have E(X) = E(Y ) = 0 and V (X) =
V (Y ) = 1. Now we know V (X) = E(X 2 ) − (E(X))2 = 1. Similarly V (Y ) = E(Y 2 ) − (E(Y ))2 = 1
so that E(X 2 ) = 1 and E(Y 2 ) = 1. It is also given that ρ is the correlation coefficient between X
and Y (whose computational formula is given in eq.(23)) i.e.
Cov(X, Y )
ρXY = p p , (61)
V (X) V (Y )
where Cov(X, Y ) = E[XY ] − E[X]E[Y ] and V (X) and V (Y ) are usually defined as above. Then
Cov(X 2 , Y 2 )
ρX 2 Y 2 = p p
V (X 2 ) V (Y 2 )
E[X 2 Y 2 ] − E[X 2 ]E[Y 2 ]
= p p
E((X 2 )2 ) − (E(X 2 ))2 E((Y 2 )2 ) − (E(Y 2 ))2
E[X 2 Y 2 ] − E(X 2 )E(Y 2 )
= p p (62)
E(X 4 ) − 1 E(Y 4 ) − 1
Now as the variables X and Y are having mean and variance 0 and 1, so using (63) we write the
moment generating function is
h1 i
MX,Y (t1 , t2 ) = exp (t21 + t22 + 2ρt1 t2 )
2
1 2 1 h1 2 i2 1 h1 2 i3
= 1 + (t1 + t22 + 2ρt1 t2 ) + (t1 + t22 + 2ρt1 t2 ) + (t1 + t22 + 2ρt1 t2 ) + · · ·
2 2! 2 3! 2
(63)
t21 t22 t41
From eq.(63) we get E(X 2 Y 2 ) =coefficient of 2
2! 2! which is 2ρ + 1. Coefficient of 4! is E(X )
4
t4
where the value is 3 while coefficient of 4!2 is E(Y 4 ) where the value is also 3. Substituting these
2 +1−1
values in eqs.(62) we get ρX 2 Y 2 = √2ρ √
3−1 3−1
= ρ2 . Q.E.D
Problem: The variables X and Y with zero means and standard deviations σ1 and σ2 are nor-
mally correlated with correlation coefficient ρ. Show that U and V defined as U = σX1 + σY2 and
V = σX1 − σY2 are independent normal variates with variances 2(1 + ρ) and 2(1 − ρ) respectively.
Sol: From eqs.(26) and (27) we see that the probability function (for bi-variate normal distri-
bution) is
1 h 1 n x − µ 2 y − µ 2 x − µ y − µ oi
1 2 1 2
dF (x, y) = p exp − 2)
+ − 2ρ .(64)
2πσ1 σ2 1 − ρ 2 2(1 − ρ σ 1 σ2 σ 1 σ 2
15
where ∞ < (x, y) < ∞. Now we assume u = x
σ1 + σy2 and v = x
σ1 − σy2 . Solving these two equations
we get
σ1
x = (u + v)
2
σ2
y = (u − v) (66)
2
The Jacobian is therefore calculated as
∂(x, y)
J =
∂(u, v)
∂x ∂x 1 1
∂u ∂v 2 σ1 2 σ1 σ1 σ2
= = =− . (67)
1 2
∂y
∂u
∂y
∂v 2 σ2 − 21 σ2
x2 y2 u2 + v 2
+ = , (68)
σ12 σ22 2
and
σ1 σ2 2
xy = (u − v 2 ). (69)
4
Therefore, by transformation of random variable,
1 h 1 n u2 + v 2 1 σ1 σ2 2 oi σ1 σ2
2
dF (u, v) = p exp − 2
− 2ρ (u − v ) − dudv
2πσ1 σ2 1 − ρ2 2(1 − ρ ) 2 σ1 σ2 4 2
(70)
Simplifying eq.(70)
1 1 h u2 i 1 1 h v2 i
dF (u, v) = √ p exp − du √ exp − dv
2(1 + ρ)2 2(1 − ρ)2
p
2π 2(1 + ρ) 2π 2(1 − ρ)
= f1 (u)du f2 (v)dv. (71)
Hence U and V are independently distributed, U as following N (0, 2(1+ρ)) and V as N (0, 2(1−ρ)).
Q.E.D
16
Sol. From the expression 3X1 −2X2 +X3 , it is clear that a⃗′ = [3, −2, 1] so that 3X1 −2X2 +X3 = a⃗′ X,
⃗
⃗ T ⃗
where X = [X1 , X2 , X2 ] . Hence it is easy to see that a µ′ ⃗ = 11. Also it is straightforward to see
that a⃗′ Σ⃗
⃗ a = 7.Hence the distribution of 3X1 − 2X2 + X3 is N3 (11, 7).
Results:
If X⃗1q ×1 and X⃗2q ×1 are independent, then Cov(X ⃗1 , X
⃗2 ) = ⃗0 (a q1 × q2 matrix of zeroes).
1 2
!
⃗1 µ⃗ Σ
X 1 11 Σ12
If ⃗2 is Nq1 +q2 , , then X1 and X2 are independent if and only
X µ⃗2 Σ21 Σ22
if Σ12 = 0 = Σ21 .
⃗1 and X
If X ⃗2 are independent and are distributed as Nq (µ1 , Σ11 ) and Nq (µ1 , Σ22 ) respec-
! 1 2
⃗1 µ⃗ Σ
X 1 11 0
tively, then ⃗2 has the multivariate normal distribution Nq1 +q2 , .
X µ⃗2 0 Σ22
1 −2 0
Problem: Let X ⃗ be N3 (⃗
µ, Σ) ⃗′ = [−3, 1, 4] and Σ
⃗ with µ ⃗ = −2 5 0 ,, (a) Are X1 and X2
0 0 2
independent? (b) Are X2 and X3 independent? (c) (X1 , X2 ) and X3 independent? (d) Are X1 +X
2
2
Sol: (a) From variance-covariance matrix (given) σ23 = 0 = σ32 = Cov(X2 , X3 ), so X2 and
X3 are independent. (b) Also σ12 = −2 = σ21 ̸= 0, hence X1 and X2 are not independent. (c)
(Hints:) Represent variance-covariance matrix as block matrix and the result will follow. (d) and
(e) require following theorem.
Theorem: Let X ⃗1 , X
⃗2 , · · · , X⃗n be mutually independent with X ⃗ j distributed as Np (µ⃗j , Σ)
⃗ [X
⃗j
has Pthe same covariance ⃗ ⃗ ⃗ ⃗
matrix Σ]. Then V1 = c1 X1 + c2 X2 + · · · + cn Xn is distributed as
Np ( nj=1 cj µj , ( nj=1 c2j )Σ).
⃗ Moreover V1 and V2 = b1 X ⃗1 + b2 X ⃗2 + · · · + bn X⃗n are jointly multi-
P
Pn !
( j=1 c2j )Σ⃗ (b⃗′⃗c)Σ
⃗
variate normal with covariance matrix ⃗ . Consequently V1 and V2 are
(b⃗′⃗c)Σ
⃗ ( nj=1 b2j )Σ
P
independent if b⃗′⃗c) = n cj bj = 0.
P
j=1
Now by above theorem, we have two linear combinations viz. V1 = 21 X1 + 21 X2 + 0.X3 and
h i 0
V2 = 0.X1 + 0.X2 + 1.X3 . Now as b⃗′⃗c = 12 12 0 0 = 0. So V1 and V2 are independent.
1
Again take U1 = 0.X1 + 1.X2 + 0.X3 and U2 = − 52 X1 + 1.X2 + (−1)X2 . As before, b⃗′⃗c =
5
h i −2
0 1 0 1 = 1 ̸= 0. Hence X2 and X2 − 25 X1 − X3 are not independent.
−1
17
0.2.2 What is Simple Linear Regression?
Regression analysis is used to model and explore empirical relationships between variables. The
relationships are non-deterministic in manner. Two types of variables are considered in regression
analysis, one is (a) predictor or regressor (we often call it independent variable) and another is (b)
dependent or response variable. Suppose in a locality you have considered various flats. You want
to find whether there is any relationship between square feet areas of these flats and the electricity
consumptions in each of these flats by the respective occupants. If there is any such relationship,
then this is obviously non-deterministic one. A particular flat with large square feet area may have
less electricity consumption then a flat with small square feet area and vice-versa.
In simple linear regression analysis there is a single regressor variable which we denote by x and one
response variable, denoted by Y . Suppose that a true relationship between x and Y is a straight
line and that the observation Y at each level of x is a random variable. This supposition is based
upon the scatter diagram of the bi-variate distribution. Let us consider the following case. In a
chemical distillation process one is interested in finding the non-deterministic relationship between
hydrocarbon level (%) and Purity (%). The chart is shown below. Let us draw the scatter diagram
taking values of regressor and response from Table 1 and it is shown in figure 1. By seeing from the
figure it is justified that the possible non-deterministic relationship between regressor and response
is linear in nature. We now assume that, the expected value of Y for each value of x is described
by the model
18
where ϵ is a random error with mean zero and (unknown) variance σ 2 and β0 and β1 are regression
coefficients. The estimation of these regression coefficients is done by the method of least squares.
For n pairs of observations (x1 , y1 ), (x2 , y2 ), · · · , (xn , yn ), the eq.(72) is re-written as
yi = β0 + β1 xi + ϵi , i = 1, 2, · · · , n. (73)
The sum of the squares of the deviations of the observations from the true regression line is
n
X n
X
L= ϵ2i = (yi − β0 − β1 xi )2 (74)
i=1 i=1
Let the estimators of β0 and β1 be denoted by βˆ0 and βˆ1 and for these two estimators to be least
square estimators the following conditions must hold.
n
∂L X
= −2 (yi − βˆ0 − βˆ1 xi ) = 0
∂β0 βˆ0 ,βˆ1
i=1
n
∂L X
= −2 (yi − βˆ0 − βˆ1 xi )xi = 0 (75)
∂β1 βˆ0 ,βˆ1
i=1
Pair of eqs. in (76) are known as least squares normal equations. The solution to the normal
equations in the least squares estimators βˆ0 and βˆ1 . Thus the least square estimates of the intercept
β0 and slope β1 in the simple linear regression model are
βˆ0 = ȳ − βˆ1 x̄
(77)
( n
P Pn
i=1 yi )( i=1 xi )
Pn
y i x i −
βˆ1 = i=1 P n 2
( n
, (78)
i=1 xi )
Pn 2
i=1 i x − n
Pn Pn
yi xi
where ȳ = i=1n and x̄ = i=1n . The numerator in βˆ1 is often denoted by Sxy while the denom-
Sxy
inator is denoted by Sxx , so that βˆ1 = Sxx .
Corresponding to the data given in Table 1 and by looking at the requirements of eq.(77), the fol-
lowing table should be constructed. Using the table 2 the values of the estimates are βˆ0 = 74.28341,
βˆ1 = 14.9474 and fitted regression line would be ŷ = 74.28341 + 14.9474 x9 . The fitted regression
line is shown below in figure 2. One can easily see from these plots that some of the scattered points
9
See Appendix for R codes.
19
Table 2: Hydrocarbon level vs Purity in a chemical distillation process
Observation Number Hydrocarbon Level xi (%) Purity yi (%) xi yi x2i
1 0.99 90.01 89.1099 0.9801
2 1.02 89.05 90.831 1.0404
3 1.15 91.43 105.1445 1.3225
4 1.29 93.74 120.9246 1.6641
5 1.46 96.73 141.2258 2.1316
6 1.36 94.45 128.452 1.8496
7 0.87 87.59 76.2033 0.7569
8 1.23 91.77 112.8771 1.5129
9 1.55 99.42 154.101 2.4025
10 1.40 93.65 131.11 1.96
11 1.19 93.54 111.3126 1.4161
12 1.15 92.52 106.398 1.3225
13 0.98 90.56 88.7488 0.9604
14 1.01 89.54 90.4354 1.0201
15 1.11 89.85 99.7335 1.2321
16 1.20 90.39 108.468 1.44
17 1.26 93.25 117.495 1.5876
18 1.32 93.41 123.3012 1.7424
19 1.43 94.98 135.8214 2.0449
20 0.95 87.33 82.9635 0.9025
(which are observed values) fall on the straight lines and some fall above or below the straight lines.
We next calculate the deviations by considering error sum of squares which is
n
X n
X
SSE = e2i = (yi − yˆi )2 . (79)
i=1 i=1
Dividing SSE by n − 2 gives the best estimate of σ 2 . Hence the unbiased estimator of σ 2 is
SSE
σ2 = . (80)
n−2
Another table is now constructed below which calculates the right hand side of eq.(79).
From table 3 we get the value for SSE = ni=1 (yi − yˆi )2 as 21.24982. Therefore σ 2 = 20−2
SSE
P
=
2
1.180545. The residual standard error at n − 2 degrees of freedom is square root of σ , which in
this case is 1.086528877 approx.
The following figure shows the plot of all the scattered points and how they are deviated from
the regression line.
It is a known fact that the population parameters can’t be deterministically found and so we
estimate them by sample statistic. A sample statistic θ is an unbiased estimator of popu-
lation parameter µ if E(θ) = µ. In this case the estimators βˆ0 and βˆ1 of β0 and β1 satisfy the
20
Table 3: Error estimation Hydrocarbon level vs Purity in a chemical distillation process
Observation Number yˆi (yi − yˆi )2
1 89.08134 0.862417
2 89.52976 0.230168
3 91.47292 0.001842
4 93.56556 0.030431
5 96.10661 0.38861
6 94.61187 0.026203
7 87.28765 0.091417
8 92.66871 0.807683
9 97.45188 3.873496
10 95.20977 2.432882
11 92.07082 2.158502
12 91.47292 1.096377
13 88.93186 2.650833
14 89.38028 0.025509
15 90.87502 1.050674
16 92.22029 3.349961
17 93.11713 0.017653
18 94.01398 0.364789
19 95.65819 0.459944
20 88.48344 1.330424
following relations.
E(βˆ1 ) = β1
E(βˆ0 ) = β0
σ2
V (βˆ1 ) =
Sxx
h1 x̄2 i
V (βˆ0 ) = σ 2 + . (81)
n Sxx
It can also be shown that the covariance of βˆ0 and βˆ1 is given by
σ 2 x̄
Cov(βˆ0 , βˆ1 ) = − . (82)
Sxx
Problem: With reference to the hydrocarbon-purity problem, determine Cov(βˆ0 , βˆ1 ). [Ans: 2.07]
Problem: The number of pounds of steam used per month by a chemical plant is thought to
be related to the average ambient temperature (in degree Fahrenheit) for that month. The past
year’s usage and temperature are shown in the following table: (a) Assuming that a simple linear
regression model is appropriate, fit the regression model relating steam usage (Y ) to the average
temperature (x). What is the estimate of σ 2 ? (b) What is the estimate of expected steam usage
when the average temperature is 550 F ? (c) Calculate all the fitted values and residuals. (d) Form
a scattered plot and show how the observed values are deviated from the fitted line (Using R).
21
Table 4: Usage vs Temperature Table
Months Temp. Usage/1000 Months Temp. Usage/1000
Jan. 21 185.79 Jul. 68 621.55
Feb. 24 214.47 Aug. 74 675.06
Mar. 32 288.03 Sep. 62 562.03
Apr. 47 424.84 Oct. 50 452.93
May 50 454.58 Nov. 41 369.95
Jun. 59 539.03 Dec. 30 273.98
(e) Suppose the monthly average temperature is 470 F . Calculate the fitted value of y and the
corresponding residual.
H0 : β1 = 0, H1 : β1 ̸= 0. (84)
Two test the hypothesis one uses t− test. The test statistic (with respect to eq.(83) is
βˆ1 − β1,0
T0 = q , (85)
σ̂ 2
Sxx
βˆ1
t0 = q . (86)
σ̂ 2
Sxx
With respect to thePnexample considered above, we find βˆ1 = 14.9474, n = 20, σ̂ 2 = 1.180545 and
Pn ( xi )2
Sxx = i=1 x2i − i=1 n = 0.68088, so that using eq.(86) the test statistic value is 11.35166777
(approx.). Similarly to test the hypothesis
H0 : β0 = 0, H1 : β0 ̸= 0, (89)
22
the test statistic is given by
βˆ0
t0 = r h i. (90)
x̄2
σ̂ 2 n1 + Sxx
Again consider the above example with table 2 and considering eq.(90) we get the value of the test
statistic as 46.61729725 (approx.)10 . Using statistical software such as R the P − value for the test
can easily be calculated11 . In this problem the P − value is 1.23 × 10−9 . The t− test formulae given
in eqs. (86) and (90), follows the t− distribution with n − 2 degrees of freedom. We would reject
the null hypothesis given in eq.(84) and (89) if |t0 | > t α2 ,n−2 , (here α is the level of significance).
In the above problem if we take α = 0.01, i.e. 1% level of significance, from table we find that,
for 20 − 2 = 18 degrees of freedom, t0.005,18 = 2.88. Hence for both the case of β0 and β1 , the
calculated t is greater than the tabulated value. The final decision would be to reject the both the
null hypotheses.
In addition to the point estimates of the slope and intercept, it is possible to obtain confidence
interval estimates of these parameters. The width of these confidence intervals is a measure of
the overall quality of the regression line. Under the assumption that the observations are normally
and independently distributed, a 100(1 − α)% confidence interval on the slope β1 in simple linear
regression is
s s
σ̂ 2 σ̂ 2
βˆ1 − t α2 ,n−2 ≤ β1 ≤ βˆ1 + t α2 ,n−2 , (91)
Sxx Sxx
and that on the slope β0 is
s s
h1 x̄2 i h1 x̄2 i
βˆ0 − t α2 ,n−2 σ̂ 2 + ≤ β0 ≤ βˆ0 + t α2 ,n−2 σ̂ 2 + . (92)
n Sxx n Sxx
Problem: Consider the hydrocarbon-purity problem considered above. Find a 95% confidence
interval on the slope of the regression line. Also find the 95% confidence interval on the inter-
cept β0 . Again find the 99% confidence interval on both intercept β0 and slope β1 . [Answer:
(12.1809, 17.7139), (11.5995, 18.2953), (11.1577, 18.7370), (10.7266, 19.5334).]
This is to be remembered that the limits (as shown in eq.(91) and in (92)) are only calculated
when β1 ̸= 0 and β0 ̸= 0 (that are justified by the hypothesis testing) and this β0 and β1 are the
intercept and slope i.e. they are the parameters of the linear regression model that may be fitted
to population data from which the sample has been selected.
23
where Cov(X, Y ) is the covariance between X and Y and V (X), V (Y ) are respectively the
variances of X and Y . Now if we assume the eq.(93) to represent population correlation coefficient
then it is best estimated by sample correlation coefficient which is defined as
Pn
i=1 Yi (Xi − X̄)
RXY = h i1 . (94)
Pn 2
Pn 2 2
i=1 (Xi − X̄) i=1 (Yi − Ȳ )
S
We already know from above sections that βˆ1 = Sxxxy
(which in fact in terms of the random variables
X and Y is βˆ1 = SXX
S
). Now as, SXX = i=1 (Xi − X̄)2 , SXY = ni=1 Yi (Xi − X̄) and SST =
n
XY
P P
Pn 2
i=1 (Yi − Ȳ ) so that eq.(94) can be re-written as
SXY
R= 1 . (95)
(SXX SST ) 2
Problem: Consider the problem based on hydrocarbon and purity (explained above). Defining
these two as random variables X and Y , find the correlation coefficient between them. [Answer:
0.9367154]
24
0.3.1 Least square estimation of the parameters:
Consider the following data. Suppose that there n > k observations available. Let xij denote the
ith observation or level of variable xj . The observations are (xi1 , xi2 , · · · , xik , yi ), i = 1, 2, · · · , n.
Each observation (xi1 , xi2 , · · · , xik , yi ) satisfies the model in eq.(98, i.e.
To minimize L with respect to β0 , β1 , · · · , βk , the least square estimates of these quantities must
satisfy the following conditions.
n k
∂L X X
= −2 yi − βˆ0 − βˆj xij = 0
∂β0 βˆ0 ,βˆ1 ,··· ,βˆk
i=1 j=1
n k
∂L X X
= −2 yi − βˆ0 − βˆj xij xij = 0, j = 1, 2, · · · , k. (102)
∂βj βˆ0 ,βˆ1 ,··· ,βˆk
i=1 j=1
n
X n
X n
X n
X
nβˆ0 + βˆ1 xi1 + βˆ2 xi2 + · · · + βˆk xik = yi
i=1 i=1 i=1 i=1
n
X n
X n
X n
X Xn
βˆ0 xi1 + βˆ1 x2i1 + βˆ2 xi1 xi2 + · · · + βˆk xi1 xik = xi1 yi
i=1 i=1 i=1 i=1 i=1
···
··· ··· ··· ··· ··· ··· ··· ···
n
X n
X n
X n
X n
X
βˆ0 xik + βˆ1 xik xi1 + βˆ2 xik xi2 + · · · + βˆk x2ik = xik yi (103)
i=1 i=1 i=1 i=1 i=1
before).
25
The following table shows the data on pull strength of a wire bond in a semiconductor manu-
facturing process, wire length and die height to illustrate building an empirical model. We try to
show the dependency of pull strength on wire length and die height by fitting a multiple linear
regression model which is assumed to be
Y = β0 + β1 x1 + β2 x2 + ϵ, (104)
where Y = pull strength, x1 = wire length and x2 = die height. 12 . The 3-dimensional scatter plot
of the above data is shown in figure 3. where we consider the Pull strength to be dependent on
Wire Length and Die Height.
Solving eq.(105) for the three variables βˆ0 , βˆ1 and βˆ2 , one gets
26
where,
n
X n
X
c1 = x2i1 − x¯1 xi1
i=1 i=1
n
X n
X
c2 = xi1 xi2 − x¯2 xi1
i=1 i=1
n
X n
X
d1 = xi1 xi2 − x¯1 xi2
i=1 i=1
n
X n
X
d2 = x2i2 − x¯2 xi2
i=1 i=1
Xn Xn
c = xi1 yi − ȳ xi1
i=1 i=1
Xn Xn
d = xi2 yi − ȳ xi2 (107)
i=1 i=1
Pn Pn Pn
yi xi1 xi2
and ȳ = i=1
n , x¯1 = i=1
n and x¯2 = i=1
n .
Pn
Using
Pn eq.(106) we have n = Total observation Pn = 25, i=1 xi1 = Sum total of x1 column
Pn = 206,
x = Sum total of x column = 8294, x 2 = Sum total of x2 column = 2396, 2
i=1 i2 2 i=1 i1 1 i=1 xi2 =
2
P n
Pn total of x2 column = 3531848, i=1 xi1 x
Sum Pi2n= Product of x1 column and x2 column = 77177,
y
i=1 i = Sum
Pn total of y column = 725.82, i=1 xi1 yi = Product of x1 column and y column =
8008.47 and i=1 xi2 yi = Product of x2 column and y column = 274816.71.
Solving set of equations given in (108) one gets βˆ0 = 2.26379, βˆ1 = 2.74427 and βˆ2 = 0.01253 so
that regression equation is given by ŷ = 2.26379 + 2.74427x1 + 0.01253x2 13 .
In this way, if we consider the set of eqs.(103), then for this the fitted regression line would be
k
X
yˆi = βˆ0 + βˆj xij , i = 1, 2, · · · , n. (109)
j=1
27
more than one input variables then it is called multiple linear regression analysis14 For validation
of model we make five key assumptions. They are
Linear relationship
No or little multicollinearity
Multivariate normality
No auto-correlation
Homoscedasticity
1) To check whether the model satisfies linearity we first consider the bi-variate (two-variable)
distribution and draw a scatter plot of the data. By seeing the movement of the response variable
in accordance with the movement of the input variable/s it is decided whether the linearity as-
sumption is justified. If we look at the figure 3 we find that the distribution of points is no doubt
linear. At the same time, the experimenter also needs to be aware of the fact that any outlier
must not creep in the data. An outlier is an observation that does not appear to fit with the rest
of the data. We must always get rid of any such outliers if they exist/s. Hence the identification of
outliers is also important before we proceed with our analysis15
2) In multiple regression problems we expect to find dependencies between the response variable Y
and the regressor/s xj . In most regression problems, however it is found that there are also some
sort of dependencies among the regressor variables xj . In situations where these dependencies are
strong, we then say that multicollinearity exists. A very useful measure of multicollinearity is
variance inflation factor. The larger the variance inflation factor is, the more severe the multi-
collinearity in the system. In one of the sections next, we will discuss about how variance inflation
factor is calculated.
28
The term on the left hand side of eq.(111) is called total sum of squares, denoted by SST , the
first and second term on the right hand side of eq.(111) are respectively known as regression sum
of squares (denoted by SSreg and residual sum of squares or error sum of squares denoted
by SSE . Thus eq.(111) can be re-written as
The eq.(112) is the fundamental analysis of variance identity for a regression model. Just like the
partitioning of variabilities made in eq.(112), the degrees of freedom are also partitioned as below.
One can now use the usual ANOVA F test to test the hypothesis H0 : β1 = 0 against H1 : β1 ̸= 0.
The F- statistic is given by
provided M Sreg >> M SE 16 . The statistic in eq.(114) follows F − distribution with (1, n − 2)
degrees of freedom. The design of the ANOVA table is shown below. If we consider the problem
of Hydrocarbon-Level versus Purity from Table 1, manually calculate the terms SST , SSReg and
SSE , then we subsequently get the following ANOVA table. Now at (1, 18) degree of freedom, the
tabulated F − statistic value is 4.41 at 5% level of significance. The calculated F is much larger
that the tabulated F and hence we reject the null hypothesis and consequently conclude that the
slope β1 ̸= 0. Another justification can be put in the following way. We know that the P − value is
the lowest level of significance at which we reject our null hypothesis. The P − value here is coming
out to be 0.00000000122 which is much less than the choice of our level of significance which is
0.05. So we reject our null hypothesis17 .
16 M SE
Remember that if M SE ≥ M Sreg , then F = M Sreg
17
P − value approach is mostly applied.
29
0.5.1 R code for performing ANOVA (for regression model testing):
(1) Feed the .csv data to R. (2) Find the regression model by lm function and associate an object
name (say “regline”). (3) Simply use the command anova(regline).
SSE
R2 = 1 − . (116)
SST
Since SST is a measure of the variability in y without considering the effect of the regressor variable
x and SSE is a measure of the variability in y remaining after x has been considered, R2 is often
called the proportion of variation explained by the regressor x. Values of R2 that are close to 1
imply that most of the variability in y is explained by the regression model18 . It is also easy to show
that 0 ≤ R2 ≤ 1. In case where X and Y are jointly distributed random variable, R2 is the square
of the correlation coefficient between X and Y 19 . Now for the hydrocarbon level versus purity
problem (discussed in the table (1)), it is found that coefficient of determination R2 = 0.8774.
MS
regression, the test statistic F = M SregE
follows F − distribution with (p, n − p − 1) degrees of
freedom. The null hypothesis states that β1 = β2 = · · · = βp = 0 against the alternative hypothesis
simply states that at least one of the parameters βj ̸= 0, j = 1, 2, · · · , p. Large values of the test
statistic provide evidence against the null hypothesis. The F test does not indicate which of the
18
The statistic R2 should be used with caution, since it is always possible to make R2 large by adding enough
terms to the model.
19
Remember X(Y ) is the random variable, x(y) are values of the corresponding random variable.
30
parameters βj ̸= 0, only that at least one of them is linearly related to the response variable. The
SS
ratio SSreg
T
is known as the squared multiple correlation coefficient. This value is the proportion of
the variation in the response variable that is explained by the response variables. The square root
of R2 is called the multiple correlation coefficient, the correlation between the observations yi and
the fitted values yˆi .
1 Multivariate Regression
2 Discriminant Analysis:
2.1 Introduction (Linear Discriminant Analysis):
One of the most important problems in the areas of machine learning, information retrieval etc.
is the dimensionality reduction technique. The main objective of the dimensionality reduction
techniques is to reduce the dimensions by removing the redundant and dependent features by
transforming the features from a higher dimensional space to a space with lower dimensions20 . Two
techniques are mainly used for this purpose, viz. (a) unsupervised techniques and (b) supervised
techniques. Among supervised techniques the most famous is the approach of linear discriminant
analysis (abbreviated as LDA)21 . This category of dimensionality reduction techniques are used in
biometrics, bio-informatics and chemistry. There are two types of LDA techniques to deal with
classes: class independent and class dependent techniques. The difference between these two types
is that, in class independent LDA, each class is considered as a separate class against the other
classes. In this type, there is just one lower dimensional space for all classes to project their data
on it. On the other hand, in the class dependent LDA, one separate lower dimensional space is
calculated for each class to project its data on it. Although these two techniques are mostly used,
they have their own drawbacks too. In the class independent LDA approach, if different classes are
non-linearly separable, the LDA cannot discriminate between these classes. The problem is called
linearity problem. On the other hand, the class dependent LDA fails to find the lower dimensional
space if the dimensions are much higher than the number of samples in the data matrix. Thus, the
within class matrix becomes singular22 . This problem is known as Small Sample Problem (SSS).
However, the problems have their respective solutions too. While the linearity problem is tackled
by kernel function method, the SSS problem is resolved (i) by removing the null space of within
class matrix or (ii) by converting the within class matrix to a full rank matrix.
31
T defined on V with respect to that basis (which we denote by [T ]β ) is a diagonal matrix? (b) If
such an ordered basis is there, how can one find it? The answer to these questions lead us to the
concept of eigenvalues and eigenvectors. Let us first technically define the term diagonalizability
from the linear algebraic point of view.
Definition: A linear operator T defined on a finite dimensional vector space V is called diag-
onalizable if there exists an ordered basis β for V such that [T ]β is a diagonal matrix. A square
matrix A is called diagonalizable if LA is diagonalizable23 .
Let us suppose that β = {v1 , v2 , · · · , vn } is the ordered basis for the vector space V and T is
the linear operator defined on V such that [T ]β is a diagonal matrix. Then for each vector vj ∈ V ,
we have
n
X
T (vj ) = Dij vi = Djj vj = λj vj , λj = Djj . (117)
i=1
Now for the choice of the basis β = {v1 , v2 , · · · , vn } and the linear operator, as before, let T (vj ) =
λj vj for some scalars λj , where j = 1, 2, · · · , n. Then by the definition of the matrix representation
of the linear transformation we clearly have
λ1 0 ··· 0
0 λ2 · · · 0
[T ]β =
··· ··· ··· ···
(118)
0 0 · · · λn
It is clear from the above discussion that each vector v in the basis β satisfies the condition that
T (v) = λv for some scalar λ. Moreover, as v lies in a basis, so v is always non-zero. It is because a
basis is the set of linearly independent vectors and contrast to this fact a zero vector makes the set
linearly dependent. This vector v is basically the eigenvector. Let us now put forward the formal
definition below.
Definition: Let T be the linear operator defined on a vector space V . A non-zero vector v ∈ V
is called an eigenvector of T if there exists a scalar λ such that T (v) = λv. The scalar λ is called
the eigenvalue corresponding to the eigenvector v. Another perspective is as follows. Let A be in
Mn×n (F )24 . A non-zero vector v ∈ F n25 is called an eigenvector of A if v is an eigenvector of LA ;
that is Av = λv for some scalar λ. The scalar λ is called the eigenvalue of A corresponding to the
eigenvector v.
32
and only if there exists an ordered basis β for V consisting of eigenvectors of T . Furthermore, if T
is diagonalizable, β = {v1 , v2 , · · · , vn } is an ordered basis of eigenvectors of T , and D = [T ]β , then
D is a diagonal matrix and Djj is the eigenvalue corresponding to vj , for 1 ≤ j ≤ n.
It is clear from the above theorem that to diagonalize a matrix or a linear operator we need
to find a basis of eigenvectors and its corresponding eigenvalues. Below we jot down a few the-
orems/definitions and the working principle to find the eigenvalues and eigenvectors for a given
matrix or a linear operator.
Definition (eigen,a): Let A ∈ Mn×n (F ). The polynomial f (t) = det(A − tIn ) is called the
characteristic polynomial of A.
Theorem (eigen, a) above states that the eigenvalues of a matrix are the zeros of its character-
istic polynomial. When determining the eigenvalues of a matrix or a linear operator, we normally
compute its characteristic polynomial.
Problem: Let T be a linear operator defined on P2 (R) defined by T (f (x)) = f (x) + (x + 1)f ′ (x).
Determine the eigenvalues of T .
Theorem (eigen,c): Let T be a linear operator on a vector space V , and let λ be an eigen-
value of T . A vector v ∈ V is an eigenvector of T corresponding to λ if and only if v =
̸ 0 and
v ∈ N (T − λI)27 i.e. to find a vector/s such that (T − λI)v = 0.
Solve this characteristic equation to find the roots, which are eigenvalues.
33
Let us now define a linear operator T 29 on a finite dimensional vector space R2 by T (x, y) = (y, x).
We know that the standard ordered basis β of R2 is {(1, 0), (0, 1)}. We will check first how this
basis vectors change under the given linear transformation. Below we write the transformation of
the basis vectors (1, 0) and (0, 1) and represent the outcome in terms of (1, 0) and (0, 1) once again
as range has also the same basis.
Thus we have
0 1
[T ]β = . (120)
1 0
The reduced row echelon form of the coefficient matrix of eq.(121) is30
1 1 v1 0
([T ]β − λ1 I2 )v = = , (122)
0 0 v2 0
solving which we get v1 = −v2 . The set of eigenvectors is thus given by Eλ1 = {a(−1, 1)T : a ∈
R, a ̸= 0}. Each vector of this set Eλ1 is eigenvector corresponding to the eigenvalue λ1 = −1, which
means that corresponding to an eigenvalue there exist infinitely many eigenvectors. Proceeding in
the similar manner, we assume an eigenvector v ′ = (v1′ , v2′ )T as the eigenvector corresponding to
eigenvalue λ2 = 1 so that the eigenvalue-eigenvector equation would be
′
−1 1 v1 0
([T ]β − λ2 I2 )v ′ = = . (123)
1 −1 v2′ 0
solving which we get v1′ = v2′ . As before, the set of eigenvectors is Eλ2 = {b(1, 1)T : b ∈ R, b ̸= 0}.
Each vector of the set Eλ2 is eigenvector corresponding to the eigenvalue λ2 = 1, which again
implies that there are infinitely many choices31 .
Now as we have found the eigenvalues and eigenvectors, let us go back to the discussion of di-
agonalizability with which we started this section.
29
T is afunctionfrom R2 to R2 . This operator is called flip operator.
30 1 0
I2 = .
0 1
31
For λ1 = −1, the solution space will however be Sλ1 = {a(−1, 1)T , a ∈ R} while For λ2 = 1, the solution space
is Sλ2 = {a(1, 1)T , a ∈ R}.
34
Theorem: Let T be a linear operator defined on a vector space V , and let λ1 , λ2 , · · · , λk be
distinct eigenvalues of T . If v1 , v2 , · · · , vk are eigenvectors of T such that λi corresponds to vi
(1 ≤ i ≤ k), then {v1 , v2 , · · · , vk } is linearly independent.
If we consider the above problem in which the linear transformation is represented by eq.(119),
and if we construct the set β1 = {(−1, 1)T , (1, 1)T }, it is easy to verify that the set β1 is linearly
independent.32 By the above corollary we immediately can say then that the linear operator T is
diagonalizable, as dim(R2 ) = 2 and β1 contains two linearly independent vectors which are eigen-
vectors.
Let us now try to understand geometrically what is happening with respect to the linear oper-
ator T defined in eq.(119). At first the operator takes the point (1, 0) to (0, 1) and vice-versa.
These vectors are orthogonal to each other (i.e. they are perpendicular). Let us now consider the
following figure. If we carefully look at the figures 1 and 2, we observe the following geometrical
facts. These points (1, 0), (0, 1), (−1, 1), (1, 1) are all lying on the two dimensional plane R2 . But
there is a difference among these points with respect to the linear transformation T of eq.(119) act-
ing on them. We see that the vectors (1, 0), (0, 1) are being transformed to the vectors (0, 1), (1, 0)
respectively, i.e. the original and transformed points are non-collinear, while the vector (−1, 1) is
transformed to (1, −1) and the vector (1, 1) does not change at all, all under the linear operator T
(here the original and transformed points are collinear to one another). The eigenvalues give by how
much unit the length of the vectors (which are collinear to their respective original counterpart)
have increased, decreased or does not change at all.
Case III: If 0 < λ < 1, then T moves vectors in V closer to origin by a factor of λ.
35
Case V: If λ < 0, then T reverses the orientation of vectors in V , i.e. T moves vectors in V
from one side of the origin to the other33 .
In the example that we have considered above, case V happened with vector (−1, 1) and case II
happened with vector (1, 1).
Problem: Consider the transformation T (x, y) = (x, −y) defined on R2 . Find eigenvalues and
eigenvector of T . Is T diagonalizable? Geometrically interpret your result. This transformation is
also known as reflection about the x−axis.
Problem: Consider the transformation T (x, y) = i(−y, x) defined on R2 . Find eigenvalues and
eigenvector of T . Is T diagonalizable? Geometrically interpret your result.(Remember that i is the
complex number)
Note: The matrix representations of these transformations discussed above as an example and
as set in the problem are very important. These matrices are of order 2 and they are Pauli spin
matrices which play a fundamental role in designing quantum circuits in quantum computation and
quantum information34 .
Problem: Consider the transformation T (x, y) = (−x, y) defined on R2 . Find eigenvalues and
eigenvector of T . Is T diagonalizable? This transformation is reflection about y−axis.
Problem: Consider the transformation T (x, y) = (x, 0) defined on R2 . Find eigenvalues and
eigenvector of T . Is T diagonalizable? This is called projection on the x−axis.
Problem: Consider the transformation T (x, y) = (0, y) defined on R2 . Find eigenvalues and
eigenvector of T . Is T diagonalizable? This is called projection on the y−axis.
Let T be the linear operator on R2 that rotates each vector in the plane through an angle of
π
2 . It is clear geometrically that for any non-zero vector v, the vectors v and T (v) are not collinear.
Hence T (v) is not a multiple of v. Therefore T has no eigenvectors and, consequently, no eigenval-
ues. Thus there exist operators (and matrices) with no eigenvalues or eigenvectors. Of course such
operators and matrices are not diagonalizable.
Now as we know how to identify the diagonalizability of an operator, the next question is to
how to find the diagonal form of the given operator/matrix. For this we need to know the following
theorem and it’s corresponding corollary.
Theorem: Let T be a linear operator on a finite dimensional vector space V , and let β and
β ′ be ordered bases for V . Suppose that Q is the change of coordinate matrix that changes
β ′ −coordinates into β−coordinates. Then we have
[T ]β ′ = Q−1 [T ]β Q. (125)
36
We have already seen that the operator T (x, y) = (y, x) is diagonalizable and the matrix rep-
resentation of this operator with respect to the standard ordered basis of R2 is given in eq.(120).
With the eigenvectors {(−1, 1)T , (1, 1)T } of this operator we construct the matrix Q as
−1 1
Q= , (126)
1 1
where the first column of the matrix contains the eigenvector associated with eigenvalue −1 and
the second column of the matrix Q contains the eigenvector associated with the eigenvalue 1. By
applying the above corollary we thus find the diagonalized form of the operator of eq.(120) as
−1 0
D= . (127)
0 1
It is to be noted that in the principal diagonal of the matrix in eq.(127) the eigenvalues of the
operator that is −1 and 1 are appearing.
Problem: Find the diagonalized forms of all the operators of the problems discussed above.
Definition: A polynomial f (t) in P (F )35 splits over F if there are scalars c, a1 , a2 , · · · , an (not
necessarily distinct) in F such that
If we consider the linear operator defined in eq.(120) the characteristic polynomial f (t) = t2 − 1 =
(t − 1)(t + 1). Comparing this with eq.(128) we find that c = 1 and a1 = 1, a2 = −1.
Proof: Let T be a diagonalizable linear operator defined on the n−dimensional vector space
V , and let β be an ordered basis for V such that [T ]β = D is a diagonal matrix. Suppose
λ1 0 ··· 0
0 λ2 · · · 0
that D = · · · · · · · · · · · · and let f (t) be the characteristic polynomial of T . Then
0 0 · · · λn
λ1 − t 0 ··· 0
0 λ2 − t · · · 0
f (t) = det(D − tI) = ···
= (λ1 − t)(λ2 − t) · · · (λn − t)
··· ··· ···
0 0 · · · λn − t
= (−1)n (t − λ1 )(t − λ2 ) · · · (t − λn ).Q.E.D
The algebraic multiplicity of both the eigenvalues of the linear operator defined in eq.(120) is
1.
35
The set of all polynomials with coefficients from field F is a vector space, which we denote by P (F ).
37
Problem: Find the algebraic multiplicity of each eigenvalue of the linear operators discussed
in the preceding paragraphs.
Clearly, Eλ is a subspace of V consisting of the zero vector and the eigenvectors of T corresponding
to the eigenvalue λ. The maximum number of linearly independent eigenvectors of T corresponding
to the eigenvalue λ is therefore the dimension of Eλ .
Theorem: Let T be a linear operator on a finite dimensional vector space V , and let λ be an
eigenvalue of T having multiplicity m. Then 1 ≤ dim(Eλ ) ≤ m37 .
Theorem: Let T be a linear operator on a finite dimensional vector space V such that the
characteristic polynomial of T splits. Let λ1 , λ2 , · · · , λk be the distinct eigenvalues of T . Then
(a) T is diagonalizable if and only if the multiplicity of λi is equal to dim(Eλi ) for all i. (b) If T is
diagonalizable and βi is an ordered basis for Eλi for each i, then β = β1 ∪ β2 ∪ · · · ∪ βk is an ordered
basis for V consisting of eigenvectors of T .
For the linear operator defined in eq.(120), we have two eigenvalues, respectively λ1 = −1 and
λ2 = 1. Then by definition Eλ1 = {(0, 0)T , k(−1, 1)T , k ∈ R}. The maximum number of lin-
early independent eigenvectors of T corresponding to the eigenvalue −1 is one i.e. (−1, 1)T . Thus
dim(Eλ1 ) = 1 which is equal to the multiplicity of −1. It can also be shown that the multiplicity
of the other eigenvalue 1 is also equal to dim(Eλ2 ). Thus, by the above theorem we can deduce
that T is diagonalizable.
1 1
Problem: Prove that is not diagonalizable. [Hints: Show that above theorem doesn’t
0 1
hold in this case].
36
Identity operator defined on a vector space V .
37
dim stands for dimension.
38
2.3 LDA methodology:
The goal of the LDA technique is to project the original data matrix onto a lower dimensional
space. Three steps are needed to be performed for this purpose.
To calculate the separability between different classes (i.e. the distance between the means
of different classes). This is called between class variance or between class matrix.
To calculate the distance between the mean and the samples of each class. This is known as
within class variance or within class matrix.
To construct the lower dimensional space which maximizes the between class variance and
minimizes the within class variance.
39
SB is of order M × M .
Compute within class variance (or within class matrix) as
nk
k X
X
SW = (xik − µk )T (xik − µk ), (133)
l=1 i=1
To understand the above algorithm we below consider one numerical example. Suppose the data
matrix is given as
1 2
2 3
3 3
4 5
5 5
X= 4 2 , (135)
5 0
5 2
3 2
5 3
6 3 11×2
40
where we divide the matrix into two classes ω1 and ω2 . Here ω1 is the matrix with first 5 rows and
ω2 is the matrix with the last 6 rows. The total row thus is N = 11. Each class is having samples
of dimension 2. Thus we have
1 2
2 3
ω1 =
3 3
(136)
4 5
5 5 5×2
and
4 2
5 0
5 2
ω2 = . (137)
3 2
5 3
6 3 6×2
The eqs. (136) and (137) are the two classes respectively extracted from X. We shall now imple-
ment the steps of the above algorithm38
See here the size n1 = 5 of class ω1 and that of ω2 is n2 = 6. Here we have used eq.(130).
41
Using eq.(132) we will calculate the between class variance matrix SB and for that we need
to calculate SB1 and SB2 for each of the two classes.
SB = SB1 + SB2
= n1 (µ1 − µ)T (µ1 − µ) + n2 (µ2 − µ)T (µ2 − µ). (141)
In this way we have
= 5 × { 3 3.6 − 3.91 2.727 }T { 3 3.6 −
SB1 3.91 2.727 }
−0.91
= 5× × −0.91 0.873
0.873
4.14 −3.97
= (142)
−3.97 3.81
and
= 6 × { 4.67 2 − 3.91 2.727 }T { 4.67 2 −
SB2 3.91 2.727 }
0.76
= 6× × 0.76 −0.727
−0.727
3.47 −3.32
= . (143)
−3.32 3.17
Consequently the between class matrix is
SB = SB1 + SB2
7.61 −7.29
= . (144)
−7.29 6.98
Using eq.(133) we calculate now the within class matrix for both the classes ω1 and ω2 , which
we denote by SW1 and SW1 and consequently will calculate total within class variance which
we denote by SW .
SW1 = { 1 2 − 3 3.6 }T { 1 2 − 3 3.6 }
+{ 2 3 − 3 3.6 }T { 2 3 − 3 3.6 }
+{ 3 3 − 3 3.6 }T { 3 3 − 3 3.6 } +
{ 4 5 − 3 3.6 }T { 4 5 − 3 3.6 } +
{ 5 5 − 3 3.6 }T { 5 5 − 3 3.6 },
(145)
so that we have
10 8
SW1 = . (146)
8 7.2
Again for class 2 we have,
4.67 2 }T {
SW2 = { 4 2 − 4 2 −
4.67 2 }
0 − 4.67 2 }T {
+{ 5 5 0 − 4.67 2 }
− 4.67 2 }T { 5
+{ 5 2 2 − 4.67 2 } +
− 4.67 2 }T { 3
{ 3 2 2 − 4.67 2 } +
− 4.67 2 }T { 5
{ 5 3 3 − 4.67 2 } +
3 − 4.67 2 }T {
{ 6 6 3 − 4.67 2 }, (147)
42
And consequently we have
5.34 1
SW2 = . (148)
1 6
SW = SW1 + SW2
15.34 9
= . (149)
9 13.2
We now find the eigenvalues and eigenvectors of the matrix W from eq.(151).The eigenvalues
−0.693
are λ1 = −0.0027 and λ2 = 2.783. The corresponding eigenvectors are v1 = and
−0.721
0.677
v2 = respectively.
−0.736
From the previous step it is clear λ2 > λ1 such that, the second eigenvector V2 corresponding
to λ2 is more robust than the first one i.e. V1 corresponding to λ1 . Therefore it is selected to
construct the lower dimensional space which will be a good discriminator between the classes.
The original data is projected on the lower dimensional space, as follows, yi = ωi v2 , where
yi of order ni × 1 (here n1 = 5 for class 1 and n2 = 6 for class 2) represents the data after
projection of the ith class and it’s values will be as follows:
1 2
2 3
0.6772
y 1 = w1 v 2 = 3 3 ×
4 5
−0.7358 2×1
5 5 5×2
−0.7944
−0.8530
= −0.1758
(152)
−0.9702
−0.2930 5×1
43
and
4 2
5 0
5 2 0.6772
y2 = w2 v 2 = ×
3
2
−0.7358 2×1
5 3
6 3 6×2
1.2372
3.3860
1.9144
=
0.5600
. (153)
1.1786
1.8558 6×1
Compute the total mean µ of all data, which is (where k denotes the number of classes and
each class having samples of dimension M , and N is the total number of samples considering
all the classes),
N k
1 X X ni
µ= xi = µi , (156)
N N
i=1 i=1
where µ is of order 1 × M .
SB is of order M × M .
44
for all class i, i = 1, 2, · · · , c, do
Compute within class matrix of each class (that we denote by SWi and is of order M × M )
and which is given by
X
SWi = (xi − µk )T (xi − µk ). (158)
xi ∈ωk
The eigenvalues λi and the eigenvectors V i of each transformation matrix Wi are then
calculated, where λi and the eigenvectors V i represent the calculated eigenvalues and eigen-
vectors of the ith class, respectively.
Sorting the eigenvectors in descending order according to their corresponding eigenvalues.
The first k eigenvectors are then used to construct a lower dimensional space for each class
Vki .
Project the samples of each class (ωi ) onto their lower dimensional space Vki , as follows:
Ωj = xi Vkj , xi ∈ ωj , (160)
end for
Eqs. (154)-(157) aee similar to those of eqs. (129)-(132). Using these eqs. we get the SB matrix
as before which is given in eq.(144). We now use the next set of eqs. given in class dependent
algorithm to proceed to find the within class matrix for each class. For the problem that we have
considered above, SW1 and SW2 will correspond to the within class matrices for classes ω1 and ω2
respectively. The within class matrices are then given by
−1 0.9 −1 −1 0.1933 −0.0322
SW1 = , SW2 = . (161)
−1 1.25 −0.0322 0.1720
Consequently we have
14.13 −13.54 1.705 −1.634
[W1 ]sw1 = , [W2 ]sw2 = , (162)
−16.72 16.01 −1.499 1.435
−1 −1
where [W1 ]sw1 = SW S and [W2 ]sw2 = SW
1 B
S . The eigenvalues-eigenvectors of [W1 ]sw1 are
2 B
−0.6916 0.6456
λ1 = −0.0056 and λ2 = 30.145 and V1 = and V2 = while the
−0.7221 −0.7637
1 2 1 0.7508
eigenvalues-eigenvectors of [W2 ]sw2 are λ = 3.1389 and λ = 0.00105 and V =
−0.6605
0.6913
and V 2 = .
−0.7226
We see that the second eigenvector V2 for the class ω1 has corresponding eigenvalue (i.e. λ2 =
30.145) more than the first one; thus, the second eigenvector is used as a lower dimensional space
45
for the first class. However in the class ω2 , the first eigenvector has corresponding eigenvalue (i.e.
λ1 = 3.1389) more than the second one. Hence the first eigenvector is used to project the data of
the second class to lower dimensional space. Thus the projected vectors for the classes ω1 and ω2
will respectively be
1 2 −0.88
2 3 −0.99
0.6456
Ωω1 = ω1 V2 = 3 3
= −0.35 , (163)
4 5 −0.7637
−1.24
5 5 −0.59
and
4 2 1.68
5 0
3.75
5 2 0.7508 2.43
Ωω2 = ω2 V 1 =
−0.6605 = . (164)
3 2
0.93
5 3 1.77
6 3 2.52
If we compare the two LDA methods (i.e. class dependent (CDLDA)and class independent (CILDA)),
we find that, although the two LDA methods are used to calculate lower dimensional space (or ld
space), CDLDA calculates separate lower dimensional spaces for each class which has two main
limitations: (1) it needs more CPU time and calculations more than CILDA; (2) it may lead to SSS
(Small Sample Size) problem because the number of samples in each class affects the singularity of
SW2 . These findings reveal that the standard LDA technique used the CILDA method rather than
i
using the CDLDA method.
We now try to understand the generalized scenario about what we have learned so far and to
know about the prior probabilities and the linear discriminant function parameters.
We consider below, the binary classification problem with two classes (i.e. k = 2) where we
wish to discriminate between two classes ω1 and ω2 . Let P (X ∈ ωi ) = Πi , i = 1, 2, be the prior
probabilities that a randomly selected observation X = x belongs to either ω1 and ω2 . Suppose
also that the conditional multivariate probability density of X for ith class is
46
The function fi on the right hand side of eq.(165) may be discrete, continuous or be finite mix
distributions or even have singular covariance matrices. Bayes’ theorem yields posterior probability
as
P (ωi |x) = P (X ∈ ωi |X = x)
P [(X ∈ ωi ) ∩ (X = x)]
=
P (X = x)
fi (x)Πi
= , (166)
f1 (x)Π1 + f2 (x)Π2
P (ω1 |x)
> 1, (167)
P (ω2 |x)
(ω1 |x)
and we assign x to ω2 otherwise. The ratio PP (ω 2 |x)
is referred to as the odds-ratio that ω1 rather
than ω2 is the correct class given the information in x. Substituting eq.(166) into eq.(167), the
Bayes’ rule classifier assigns x to ω1 if ff21 (x)
(x)
>Π
Π1 and to ω2 otherwise.
2
Note: In the dimensionality reduction problem considered in the previous section, the proba-
5 6
bilities i.e. P (X ∈ ω1 ) = Π1 = 11 = 0.4545 and P (X ∈ ω2 ) = Π2 = 11 = 0.5454 are the prior
probabilities.
3 Cluster Analysis:
An ability that distinguishes humans from other creatures is their ability to classify things. The
classification is fundamental to most branches of science and is extremely important for the study
of that branch. In biology, the theory and practice of classifying organisms is generally known
as taxonomy (we have often encountered this term, in Mendel’s work, Darwin’s work and so on).
Day by day the amount of data is becoming large and so is the importance to manage this huge
data sets. In market research analysis we have seen that online portals such as Amazon, Flipkart
often try to guess the consumer choices and accordingly they send us the hints of products that we
would like to buy. The exploration of such databases using cluster analysis and other multivariate
analysis techniques is now often called data mining. In the 21st century, data mining has become of
particular interest for investigating material on the World Wide Web, where the aim is to extract
useful information or knowledge from web page contents. At one point, it seems that the terms
cluster, group and class are similar kind of words. But there are thin line differences among these
terms (in Mathematical sense). It is not possible to give a universally unique definition for clusters.
It is not entirely clear how a ‘cluster’ is recognized when displayed in the plane, but one feature of
the recognition process would appear to involve assessment of the relative distances between
points. The following fig 9 represents a few schematic representation of clusters which often human
minds can process.
47
3.1 A few clustering techniques:
Hierarchical clustering:
In a hierarchical classification the data are not partitioned into a particular number of classes or
clusters at a single step. Instead the classification consists of a series of partitions, which may run
from a single cluster containing all individuals, to n clusters each containing a single individual.
Hierarchical clustering techniques may be subdivided into agglomerative methods, which proceed
by a series of successive fusions of the n individuals into groups, and divisive methods, which sep-
arate the n individuals successively into finer groupings. The following figure 10 shows schematic
of hierarchical clustering. From the figure 10, it is observed that, the initially the black dots were
all considered together. Then gradual grouping/clustering were applied until all the black dots
become individualistic (shown to reside inside the red rectangular box).With hierarchical methods,
divisions or fusions, once made, are irrevocable so that when an agglomerative algorithm has joined
two individuals they cannot subsequently be separated, and when a divisive algorithm has made a
split it cannot be undone.
Analysis:
Let us first analyze the above distance matrix. What does it interpret? It gives an impression of five
points numbered 1, 2, · · · , 5 scattered around. The distance between the points are represented.
Let us take, for example cell (1, 1) where the value is 0.0. It means the distance of point 1 from 1
is zero. Hence we see that all the diagonal values of the table 10 are zero. Take again for example
cell (4, 5) where the value is 3.0. This means the that the distance between points 4 and 5 is 3. You
see that some cells are empty. They have not been mentioned just. Take for example, cell (4, 5)
which seems to have no value. But you know dist(4, 5) = dist(5, 4) and the value of dist(5, 4) is 3.0
and so on. You can read the table values from row to column or from column to rows. We begin
our process now.
48
We club points 1 and 2 in a single cluster and then from this cluster we calculate the distances
of other three points 3, 4 and 5 This because in this cell (1, 2), we have smallest non-zero entry
2.0 for the distance. It is shown below. dist(12)3 = min{dist13 , dist23 } = min{6.0, 5.0} = 5.0.
In a similar manner we compute the other distances and we get subsequently the following
table.
In the table 11, the lowest non zero entry is 3.0 which appears in cell (4, 5) or (5, 4). So we
club points 4 and 5 and make another cluster (4, 5). So now we have two clusters (1, 2) and
(4, 5). We will calculate the following distances dist(12)3 , dist(12)(45) and dist(45)3 . (We show
one example below and likewise we will proceed). So dist(12)(45) = min{dist(12)4 , dist(12)5 } =
min{9.0, 8.0} = 8.0 and so on. Thus we get the distance matrix as table 12.
The smallest entry is now d(45)3 in table 12 and individual 3 is added to the cluster containing
individuals 4 and 5. Finally the groups containing individuals (1, 2) and (3, 4, 5) are combined
into a single cluster. We calculate the distance dist(12)(345) which easily one can calculate to
be 5.0.
So what we have seen is that we have formed two clusters altogether. The above methodology of
forming clustering can be plotted using dendogram. Look at figure 11 and tally with the tables
of the single linkage method. See how dendogram is made (Try to understand the similarity!).
Q: What are various types of Hierarchial clustering? Explain it! What is dendogram? [2023-
24 ODD]
49
Ans: Hierarchial clustering techniques proceed by either a series of successive mergers or a se-
ries of successive divisions. Agglomerative hierarchial methods start with the individual objects.
Thus, there are initially as many clusters as objects. The most similar objects are first grouped and
then these initial groups are merged according to their similarities. Eventually, as the similarity
decreases, all subgroups are fused into a single cluster (see figure 10, the direction is from right to
left). Divisive hierarchial methods work in the opposite direction. An initial single group of objects
is divided into two subgroups such that the objects in one subgroup are far from the objects in the
other. These subgroups are then further divided into dissimilar subgroups; the process continues
until there are as many subgroups as objects-that is, until each object forms a group (see figure 10,
direction as shown). The results of both agglomerative and divisive methods may be displayed in
the form of a two-dimensional diagram, known as dendogram. Linkage methods are suitable for
clustering items, as well as variables. This is not true for all hierarchial agglomerative procedures.
Some of the agglomerative processes are (i) single linkage method, (ii) complete linkage method,
and (iii) average linkage method40 .
Q: What is agglomerative and divisive analysis in hierarchial clustering methods? [2023-24 ODD]
Ans: Write the same answer as above. Here you don’t have to discuss the methodologies in
details. Just state the difference between the two techniques.
Q: Solve the following distance matrix by complete linkage technique and construct dendogram
of it [2023-24 ODD].
Sol: In the given table among all the values the minimum is 2 occurring at cell (c, e). So at
first stage object c and e are merged. This gives the cluster (ce). At stage II, we compute the
following.
dist(ce)a = max{distca , distea } = max{3, 11} = 11,
dist(ce)b = max{distcb , disteb } = 10,
dist(ce)d = max{distcd , disted } = 9. (169)
40
Students are now supposed to write in brief each of these methods as required in the MAKAUT question paper.
50
Points a b c d e
a 0 9 3 6 11
b 9 0 7 5 10
c 3 7 0 9 2
d 6 5 9 0 8
e 11 10 2 8 0
and the modified distance matrix is as follows: The next merger occurs between the most similar
Points (ce) a b d
(ce) 0
a 11 0
b 10 9 0
d 9 6 5 0
groups b and d, to give cluster (bd) as the minimum value is 5 appearing at the cell (b, d). At stage
III, we have
Consequently the distance matrix is You see that in the distance matrix minimum element is 9
appearing at cell (a, bd), hence we merge these two and we get the cluster as (abd). The distance
is calculated as
This completes the solution to the problem. The dendogram is shown in the following figure 12.
Calibrate them with points. Horizontal axis will be calibrated with points which need to be
clustered. Vertical axis is calibrated with distances.
Mark the minimum distance from the cell of the distance matrix. Mark it on vertical axis.
Mark the cell number (row-wise and column-wise) on the horizontal axis. Raise two vertical
bars from these cell numbers on horizontal axis and goes up to the value of the minimum
distance. Join them overhead.
51
After construction of the distance matrix mark the minimum distance. Mark the cell number.
Club the new cell number with old ones. Make new cluster.
Repeat the step 3, until all clusters are made and you get the minimum distance.
Note: For both simple linkage method and complete linkage method, the process of drawing den-
dogram is same. The only difference is in the construction of distance matrix, in the simple linkage
we take minimum of all the distances and in the complete linkage we take maximum all the distances.
52
Purity vs hydrocarbon level
98
96
94
Purity
92
90
88
Hydrocarbon level
Figure 3: The figure shows that the data has a linear tendency in increasing direction.
53
Purity vs hydrocarbon level
98
96
94
Purity
92
90
88
Hydrocarbon level
Figure 4: The figure shows fitted regression line with scattered points.
54
100.0
97.5
95.0
Purity
92.5
90.0
87.5
600
500
400
data$Die.Height
300
data$Wire.Length
200
20
15
100
10
5
0
0
0 10 20 30 40 50 60 70
data$Pull.Strength
Figure 6: The figure shows the 3D-scatter plot considering Pull strength as response variable and
Wire Length and Die Height as input variables.
Figure 7: This figures shows how T transforms the vectors (1, 0) and (0, 1) in the coordinate plane.
56
Figure 8: This figures shows how T transforms the vectors (−1, 1) and (1, 1) in the coordinate
plane.
57
Figure 11: Dendogram of hierarchial clustering (single linkage method)
58