[go: up one dir, main page]

0% found this document useful (0 votes)
56 views58 pages

1.12.2024-BSC-301-CSBS-class Note - 2024-25

Uploaded by

anubhabdas860
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views58 pages

1.12.2024-BSC-301-CSBS-class Note - 2024-25

Uploaded by

anubhabdas860
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Computational Statistics (BSC-301)

Session: 2024-25 (ODD)


CSE (CSBS) - 3rd Semester

Dr. Sovik Roy1 *


1
Department of Mathematics, Techno Main Salt Lake Engg. College

December 1, 2024

Abstract
Disclaimer: This Statistics course is an application oriented course designed for students of
computer science and business system. It deals with basics of multivariate statistical analy-
sis. This course is taught in various universities following different pedagogical point of view.
Keeping engineering students in mind, I personally feel that the applications of methodologies
specifically designed for dealing with multivariate data, have to be taught. This note is a sum-
mary of what has been taught in the class and in no way can be considered complete. The
students are encouraged to study a good and convenient book which tallies with the syllabus of
BSC-301.
This course is intended to explain the fundamentals of multivariate statistical techniques and
to apply the concepts in multivariate data analysis.

After attending the course students will be able to:

BSC301.1: Organize multidimensional / multivariate data.


BSC301.2: Identify the utility of multivariate normal distribution to multivariate statistics..
BSC301.3: Apply multiple regression techniques to build empirical models to engineering and
scientific data..
BSC301.4: Apply dimensionality reduction technique to multivariate data
BSC301.5: Make use of clustering techniques to the multi-dimensional/multivariate data.

Keywords: Multivariate Normal Distribution, Multiple Regression, Discriminant Analysis


Principal Component Analysis, Factor Analysis, Cluster Analysis, Python

Class 1:
Pre-requisites 1
Multivariate data arise whenever an investigator, seeking to understand a social and physical phe-
nomenon, selects a number p ≥ 1 of variables or characters to record. Here xjk denotes measurement
*
sovikroy.tisl@gmail.com

1
of the k th − variable on j th − item. Therefore n measurements on p− variables can be displayed as
follows:

··· ···
 
x11 x12 x1k x1p

 x21 x22 ··· x2k ··· x2p 

 ··· ··· ··· ··· ··· ··· 
X= 

 xj1 xj2 ··· xjk ··· xjp 

 ··· ··· ··· ··· ··· ··· 
xn1 xn2 ··· xnk ··· xnp
Before proceeding we come into an agreement to represent the variables column-wise and measure-
ments row-wise (i.e. first row indicates that 1st type of measurement has been taken on variables
1, 2, · · · , k, · · · , p and so on).

Just like in univariate case, for a data, we calculate the measure of central tendency (such as
mean), measure of dispersion (such as variance) and measures of linear association (covariance
and correlation), here in multivariate case also we will learn how these measures are taken into
consideration. Hence with respect to above matrix X, these measures are defined as follows:
n
1X
x̄k = xjk , k = 1, 2, · · · , p. (1)
n
j=1

Eq.(1) calculates the means of each column of matrix X.


n
1 Xn o2
skk = xjk − x̄k , k = 1, 2, · · · , p. (2)
n
j=1

Eq.(2) calculates the spread of multivariate data for each variable, while skk is the sample stan-
dard deviation.

To calculate the linear association between a pair of variables


 we
  look atthe sub-data
 of the
x11 x21 xn1
matrix X as (say as if we are only taking variables 1 and 2) , , ···, . The
x12 x22 xn2
sample covariance between pair of variables 1 and 2 is defined as
n
1 Xn on o
s12 = xj1 − x̄1 xj2 − x̄2 . (3)
n
j=1

In general for any two pair of variables i and k we have,


n
1 Xn on o
sik = xji − x̄i xjk − x̄k , i, k = 1, 2, · · · , p. (4)
n
j=1

Covariance reduces to sample variance when i = k. If large values for one variable are observed in
conjunction with large values for the other variable and the small values also occur together, then
sik will be positive. If large values from one variable occur with small values for the other variable,
then sik will be negative. If there is no particular association between the pair of variables, then
sik = 0. Also remember sik = ski , ∀i, k.

2
Another measure of linear association between a pair of variables, i.e. correlation is independent
of any unit. Sample correlation coefficient (or Pearson’s product moment correlation coefficient) is
given as
sik
rik = √ √ , i, k = 1, 2, · · · , p. (5)
sii skk
Note that rik = rki , ∀i, k.

These measures of mean, variance, covariance and correlation are known as descriptive statistics.
Thus we can consider the following matrices with reference to the matrix X (respectively known
as average matrix, variance-covariance matrix and correlation matrix).
 
x̄1
 x̄2 
X̄ =   ··· 

x̄p
 
s11 s12 · · · s1p
 s21 s22 · · · s2p 
S̄n =   ··· ··· ··· ··· 

sp1 sp2 · · · spp


 
1 r12 · · · r1p
 r21 1 · · · r2p 
R̄ =   ··· ··· ··· ··· .
 (6)
rp1 rp2 · · · 1
Exercise 1. A selection of four receipts from a university book-store was obtained in order to
investigate the nature of book sales. Each receipt provided among other things, the number of
books sold and the total amount of each sale. Let the first variable be total dollar sales and the
second variable be number of books sold. Then one can regard the corresponding numbers on the
receipt as four measurements on two variables.

Variable 1 (dollar sales): 42 52 48 58


Variable 2 (the no. of books sold): 4 5 4 3

(a) Represent thismultivariate data in matrix form, (b) Construct X̄, S̄n and R̄1 . [Hints: R̄ =

1 −0.36
.]
−0.36 1

Class 2:
Exercise
 2. Consider the following
 seven pairs of measurement (x1 , x2 ).
x1 3 4 2 6 8 2 5
Calculate the sample means x̄1 , x̄2 , the sample variance s11 , s22
x2 5 5.5 4 7 10 5 7.5
   
4.28 4.20 3.70
and the sample covariance s12 .[Hints: X = , Sn = ]
6.28 3.70 3.56
1
Lab teachers may show how to implement this problem using Python.

3
Exercise 3: A morning newspaper lists the following used car prices for a foreign compact with
age x1 measured in years and selling prices x2 measured in thousand dollars.
 
x1 1 2 3 3 4 5 6 8 9 11
x2 18.95 19.00 17.95 15.54 14.00 12.95 8.94 7.49 6.00 3.99
(a) Construct a scatter plot of the data and marginal dot diagrams as well.

(b) Infer the sign of the sample covariance s12 from the scatter plot.

(c) Compute the sample means’ matrix x̄1 , x̄2 , the sample variances s11 , s22 . Compute the sample
covariance s12 and the sample correlation coefficient r12 . Interpret these quantities.

(d) Display the sample mean array X̄, the


 sample variance-covariance
 array
 Sn and the sample

h 5.2 9.56 −15.9 1 −0.97 2
correlation array R. Hints : X = , Sn = ,R= ]
12.4 −15.9 27.9 −0.97 1
In multi-variate statistics sometimes there is a significant requirement of distance measurement. If
P = (x1 , x2 ) is any point and O = (0, 0) is the origin. Then Euclidean distance is calculated as
q
d(O, P ) = x21 + x22 . (7)

Then if P = (x1 , x2 , · · · , xp ) is a p− dimensional point and then from origin O = (0, 0, · · · , 0) the
Euclidean distance is calculated as
q
d(O, P ) = x21 + x22 + · · · + x2p . (8)

Also if P = (x1 , x2 , · · · , xp ) is any point and Q = (y1 , y2 , · · · , yp ) is a fixed point, then


q
d(O, P ) = (x1 − y1 )2 + (x2 − y2 )2 + · · · + (xp − yp )2 . (9)

Straight line or Euclidean distance is unsatisfactory for most statistical purposes. This is because
each coordinate contributes equally to the calculation of Euclidean distance which is not the case
in statistical analysis. When the coordinates represent measurements that are subject to random
fluctuations of different magnitudes, it is often desirable to weight coordinates subject to a great
deal of variability less heavily than those that are not highly variable. One way to proceed is to
divide each coordinate by the sample standard deviation. Therefore upon division by the standard
deviations, we have standardized coordinates. Thus a statistical distance of the point P =
(x1 , x2 , · · · , xp ) from the origin O = (0, 0, · · · , 0) can be computed from its standardized coordinates
as
s
x21 x2 x2p
ds (O, P ) = + 2 + ··· + . (10)
s11 s22 spp
Similarly statistical distance from an arbitrary point P = (x1 , x2 ) to any fixed point Q = (y1 , y2 )
is therefore given by
s
(x1 − y1 )2 (x2 − y2 )2
ds (O, P ) = + . (11)
s11 s22
2
All the above problems can be implemented in laboratory class.

4
In general statistical distance from an arbitrary point P = (x1 , x2 , · · · , xp ) to any fixed point
Q = (y1 , y2 , · · · , yp ) is then given as
s
(x1 − y1 )2 (x2 − y2 )2 (xp − yp )2
ds (P, Q) = + + ··· + , (12)
s11 s22 spp

where s11 , s22 , · · · , spp are the sample variances measured from p components x1 , x2 , · · · , xp .

Example 1: Consider the problem 2 above. Let the data be represented by random variable
X which take the individual values of the variables x1 , x2 pairwise (or we may consider this as
point P ). Let the origin be denoted by the random variable O. Then Euclidean distance and
Statistical distance of the point P from fixed origin are calculated and shown above.

P(x1 , x2 ) O(0, 0) d(P, O) s


√ qd (P, O)
32 2
(3, 5) (0, 0) 32 + 52 + 5
√ q 4.2 3.56
42 5.52
(4, 5.5) (0, 0) 42 + 5.52 4.2 + 3.56
√ q
22 42
(2, 4) (0, 0) 22 + 42 4.2 + 3.56
√ q
62 72
(6, 7) (0, 0) 62 + 72 4.2 + 3.56
√ q
82 102
(8, 10) (0, 0) 82 + 102 4.2 + 3.56
√ q
22 2
(2, 5) (0, 0) 22 + 52 + 5
√ q 4.2 3.56
52 7.52
(5, 7.5) (0, 0) 52 + 7.52 4.2 + 3.56

Here 4.2 and 3.56 are the respectively the variances of x1 and x2 component of the point P.

Exercise 4: Consider the following data. Considering Q to be a fixed point, find Euclidean dis-
tance and Statistical distance of point P from Q. Now if in eq.(12) we set y1 = y2 = · · · = yp = 0,

P(x1 , x2 ) (3, 5) (4, 5.5) (2, 4) (6, 7) (8, 10) (2, 5) (5, 7.5)
Q(y1 , y2 ) (1, 2) (2.5, 3) (4, 3) (2, 1) (1, 2) (1, 0) (3, 4)

then the formula reduces to the form of eq.(10). However if s11 = s22 = · · · = spp the Euclidean
distance formula eq.(8) is appropriate.

In the above problem, for each block, one has to apply the required distance formula.

Note: Some other important distance measures are as follows:


1
dm (P, Q) = (Σpi=1 |xi − yi |m ) m , (13)
|xi − yi |
dc (P, Q) = Σpi=1 , (14)
(xi + yi )
Σp min(xi , yi )
dcz (P, Q) = 1 − 2 i=1 . (15)
Σpi=1 (xi + yi )

5
The above formulae are respectively known as Minkowski Metric, Canberra Metric and Czekanowski
metric.

It is, however, to be remembered that, eq.(12) is still not a universal complete formula as for a few
typical cases, it does not include some significant situations which may arise in statistical analysis.
Let us consider the following plot. The scatter plot in fig.1 below depicts a two-dimensional situa-
tion in which the x1 measurements do not vary independently of the x2 measurements. In fact the
coordinates of the pairs (x1 , x2 ) exhibits a tendency to be large or small together and the sample
correlation coefficient is positive. In such a situation more meaningful distance measure should be
implemented3 .

Figure 1: The figure represents the movement of two data sets one designated by blue dots and
the other designated by green dots. The figure also shows that the two data sets are positively
correlated.

3
Beyond the scope of our requirement.

6
Class 3
Multivariate Normal Distribution:
The multivariate Gaussian (or Normal) distribution is a generalization to two or more dimensions
of the univariate Gaussian distribution, which is often characterized by its resemblance to the shape
of a bell. In fact, in either of its univariate or multivariate incarnations, it is popularly referred to
as the Bell curve. It has many applications in theoretical and applied statistical research. Although
it is well-known that real data rarely obey the dictates of the Gaussian distribution, this deception
does provide us with a useful approximation to reality.

It is known that, if the real-valued univariate random variable X is said to have the Gaussian
distribution with mean µ and variance σ 2 (written as X follows N (µ, σ 2 )), then its density function
is given by the curve
1 1 2
f (x|µ, σ) = 1 e− 2σ2 (x−µ) , x ∈ R, (16)
(2πσ 2 ) 2

1
where ∞ < µ < ∞ and σ > 0. The constant multiplier term c = (2πσ 2 )− 2 is there to ensure that
the exponential function in the formula integrates to unity over the whole real line.

Now if we have r random variables X1 , X2 , · · · Xr , each defined on the real line, we can write
them as the r− dimensional column vector, X = (X1 , X2 , · · · , Xr )T , which we henceforth call as
4
“random P
T
Pj . Also if A is a j × j matrix and x is a j− vector, then a quadratic form is
r vector”
j
x Ax = k=1 l=1 Akl xk xl . A j × j matrix A is positive definite if for any j− vector x ̸= 0, the
quadratic form xT Ax > 0. A positive definite matrix however is a symmetric matrix where every
eigenvalue is positive.
 
1 2
Problem: Verify whether the given matrix A is positive definite, where A = X = !
2 1
The random r− vector X is said to have the r− variate Gaussian (or Normal) distribution with
mean r vector µ and positive definite, symmetric r × r covariance matrix Σ if its density function
is given by the curve

r − 21 1 T Σ−1 (x−µ)
f (x|µ, Σ) = (2π)− 2 Σ e− 2 (x−µ) , x ∈ Rr . (17)

The square root, ∆, of the quadratic form,

∆2 = (x − µ)T Σ−1 (x − µ), (18)

is referred to as the Mahalanobis distance from x to µ. The multivariate Gaussian density is


unimodal, always positive and integrates to unity. We write
 
X ∼ Nr µ, Σ , (19)

when we mean that X has the above r- variate Gaussian (or Normal) distribution. If Σ is singular,
then, almost surely, X lives on some reduced dimensionality hyperplane so that its density function
does not exist; in that case, we say that, X has a singular Gaussian or singular Normal distribution.
4
AT is the transpose of the matrix A

7
A result due to Cramer and Wold states that the distribution of a random r−vector X is completely
determined by its one-dimensional linear projections, αT X, for any given r− vector α. This result
allows to make a more useful definition of the multivariate Gaussian distribution. The random r−
vector X has the multivariate Gaussian distribution if and only if every linear function of X has
the univariate Gaussian distribution.

If Σ = σ 2 Ir , then the multivariate Gaussian density function (17) reduces to


r 1 T (x−µ)
f (x|µ, σ) = (2π)− 2 σ −1 e− 2σ2 (x−µ) , (20)

and this is termed a spherical Gaussian density because (x − µ)T (x − µ) = a2 is the equation of
an r− dimensional sphere centred at µ. In general, the equation (x − µ)T Σ−1 (x − µ) = a2 is an
ellipsoid centred at µ, with Σ determining its orientation and shape, and the multivariate Gaussian
density function is constant along these ellipsoids.

When r = 2, the multivariate Gaussian density can be written out explicitly. Suppose
 
X = (X1 , X2 )T ∼ N2 µ, Σ , (21)

where
   2 √ √ 
T σ11 σ12 σ11 ρ σ11 σ22
µ = (µ1 , µ2 ) , Σ= = √ √ 2 , (22)
σ21 σ22 ρ σ11 σ22 σ22
2 is the variance of X , σ 2 is the variance of X and
σ11 1 22 2

cov(X1 , X2 ) σ12
ρ= p p =√ √ , (23)
V (X1 ) V (X2 ) σ11 σ22

5 is the correlation between X1 and X2 . It is followed from eq.(22) that the determinant of Σ is

|Σ| = (1 − ρ212 )σ11


2 2
σ22 , (24)

so that the inverse of Σ is


1
− σ11ρσ22
!
1 2
σ11
Σ−1 = . (25)
1 − ρ212 − σ11ρσ22 1
2
σ22

The bi-variate Gaussian density function of X is therefore given by


1 1
f (x|µ, Σ) = p
2
e− 2 Q , (26)
2πσ11 σ22 1 − ρ12

where
1 n x1 − µ1 2  x − µ  x − µ   x − µ 2 o
1 1 2 2 2 2
Q= − 2ρ12 + . (27)
1 − ρ212 σ11 σ11 σ22 σ22
5
V (X) is the variance of X and cov(X1 , X2 ) is the covariance between X1 and X2 .

8
If X1 and X2 are uncorrelated, ρ = 0, and the middle term in the expression (27) vanishes. In that
case, the bivariate Gaussian density function reduces to the product of two univariate Gaussian
densities,
1 2 1 2
− 2 (x1 −µ1 ) − 2 (x2 −µ2 )
2
f (x|µ1 , µ2 , σ11 2
, σ22 ) = (2πσ11 σ22 )−1 e 2σ11
e 2σ22

2 2
= f (x1 |µ1 , σ11 )f (x2 |µ2 , σ22 ), (28)

implying that X1 and X2 are independent.

Gaussian density curve (or normal curve) in bi-variate system is given by

Figure 2: The figure represents the movement of two data sets one designated by blue dots and
the other designated by green dots. The figure also shows that the two data sets are positively
correlated.

1
Problem: If the exponent of e of a bivariate normal density is − 54 (x2 + 4y 2 + 2xy + 2x + 8y + 4),
find σ1 , σ2 and ρ given that µ1 = 0 and µ2 = −1.

Sol:
1 2 1 h x2 2ρx(y + 1) (y + 1)2 i
− (x + 4y 2 + 2xy + 2x + 8y + 4) = − − +
54 2(1 − ρ2 ) σ12 σ1 σ2 σ22
1 h
2 σ1  σ 2
1 2
i
= − x − 2ρ x(y + 1) + (y + 1)
2(1 − ρ2 )σ12 σ2 σ2
(29)

Comparing the coefficients we get

2(1 − ρ2 )σ12 = 54
σ 
1
2ρ = −2
σ2
σ 
1
= 4. (30)
σ2

9
Solving the set of eqs.(30) we get σ2 = 3, σ1 = 6 and ρ = −0.56 .

Definition: If X and Y are continuous random variables and f (x, y) is the value of their joint
probability density at (x, y), the function given by
Z ∞
g(x) = f (x, y)dy, −∞ < x < ∞ (31)
−∞

is called the marginal density of X. Correspondingly, the function given by


Z ∞
h(y) = f (x, y)dx, −∞ < x < ∞ (32)
−∞

is called the marginal density of Y7 .

Now we know that a pair of random variables X and Y have a bivariate normal distribution if
their joint density function is of the form
h i
1 x−µ1 2 x−µ1 y−µ22 y−µ2 2
exp − 2(1−ρ 2 ) (( σ11 ) − 2ρ (
12 σ11 )( σ22 ) + ( σ22 ) )
12
f (x, y) = p , (33)
2πσ11 σ22 1 − ρ212

for −∞ < x < ∞, −∞ < y < ∞, σ1 , σ2 > 0 and −1 < ρ < 1. Eq.(33) is similar to that of eq.(26).
√ √
For simplicity we denote ρ12 by ρ and σ11 , σ22 simply by σ1 and σ2 respectively8 . Then marginal
density of X is given by
1 x−µ
− ( σ 1 )2 ∞
2(1−ρ2 )
Z y−µ x−µ y−µ
e 1 − 1
2(1−ρ2 )
[( σ 2 )2 −2ρ( σ 1 )( σ 2 )]
g(x) = p e 2 1 2 dy. (34)
2πσ1 σ2 1 − ρ2 −∞

6
One can take reciprocal of σ2 as common factor also. If we don’t take reciprocal of σ1 or σ2 as common factor,
then ρ adjustment can’t be done in this problem.
7
For discrete case however, integration is replaced by summation.
8
We generally denote standard deviation of a variable by σ. In this note however by σ11 we mean variance of the
variable from mean.

10
x−µ1 y−µ2
Substitute u = σ1 and v = σ2 such that σ1 du = dx and σ2 dv = dy, so that from eq.(35) we
get
1
− u2 ∞
2(1−ρ2 )
Z
e − 1
2(1−ρ2 )
[v 2 −2ρuv]
g(x) = p e σ2 dv
2πσ1 σ2 1 − ρ2 −∞
1
− u2 Z ∞
e 2(1−ρ2 ) − 1
[(v−ρu)2 −(ρu)2 ]
= p e 2(1−ρ2 ) dv
2πσ1 1 − ρ2 −∞
1
− u2 Z ∞
e 2(1−ρ2 ) − 1
(v−ρu)2 1
(ρu)2
= p e 2(1−ρ2 ) e 2(1−ρ2 ) dv
2πσ1 1 − ρ2 −∞
Z ∞
1 − 1 ( √v−ρu )2 − 1
[u2 (1−ρ2 )]
= p e 2 1−ρ2 e 2(1−ρ2 ) dv
2πσ1 1 − ρ2 −∞
1 2 Z ∞
e− 2 u − 1 ( √v−ρu )2
= √ √ p e 2 1−ρ2 dv
2π 2πσ1 1 − ρ2 −∞
1 2 Z ∞
e− 2 u h 1 − 1 ( √v−ρu )2
i
= √ √ p e 2 1−ρ2 dv
2πσ1 2π 1 − ρ2 −∞
(35)

Identifying the quantity in parentheses of eq. (35) as the integral of a normal density from −∞ to
∞ and hence equating to 1 we get
x−µ1 2
1 2 −1( )
e− 2 u e 2 σ1
g(x) = √ = √ , −∞ < x < ∞. (36)
σ1 2π σ1 2π
Hence marginal density of X is normal distribution with mean µ1 and standard deviation σ1 . By
symmetry one can easily show that the marginal density of Y is normal distribution with mean µ2
and standard deviation σ2 .

Definition: If f (x, y) is the value of the joint density of the continuous random variables X
and Y at (x, y), and h(y) is the value of the marginal density of Y at y, the function is given by

f (x, y)
f (x|y) = , h(y) ̸= 0, (37)
h(y)

for −∞ < x < ∞,is called the conditional density of X given Y = y. Correspondingly, if g(x)
is the value of the marginal density of X at x, the function given by

f (x, y)
w(y|x) = , g(x) ̸= 0, (38)
g(x)

for −∞ < y < ∞, is called the conditional density of Y given X = x.

Theorem: If X and Y have a bivariate normal distribution, the conditional density of Y given
X = x is a normal distribution with mean µY|x = µ2 + ρ σσ12 (x − µ1 ) and the variance is σY|x
2 =
2 2
σ2 (1 − ρ ), and the conditional density of X given Y = y is a normal distribution with the mean
µX|y = µ1 + ρ σσ21 (y − µ2 ) and the variance is σX|y
2 = σ12 (1 − ρ2 ).

11
x−µ1 y−µ2
Proof: We have, by considering u = σ1 and v = σ2
1
1√ − [u2 −2ρuv+v 2 ]
e 2(1−ρ2 )
2πσ1 σ2 1−ρ 2
w(Y|x) = 1 2
e− 2 u

2πσ1
1 − 1
[u2 −2ρuv+v 2 ] 1 2
= √ p e 2(1−ρ2 ) e2u
2πσ2 1− ρ2
(39)

Now
1 1 1 1
− [u2 −2ρuv+v 2 ] 1 2 − (v−ρu)2 (ρu)2 u2 1 2
e 2(1−ρ2 ) e2u = e 2(1−ρ2 ) e 2(1−ρ2 ) e −2(1−ρ2 ) e 2 u (40)

The r.h.s expression of eq.(40) comes when we write u2 − 2ρuv + v 2 as (v − ρu)2 − (ρu)2 . Then
after a little bit of simplification of eq.(40) we can write
1 1
− [u2 −2ρuv+v 2 ] 1 2 − (v−ρu)2
e 2(1−ρ2 ) e2u = e 2(1−ρ2 )
h i2
(v−ρu)
− 21 √ 2
= e 1−ρ (41)

Thereby from eq.(39) we get


h i2
(v−ρu)
1 − 12 √ 2
w(Y|x) = √ p e 1−ρ . (42)
2πσ2 1 − ρ2
v − ρu can however be expressed as
y − µ2 x − µ1
v − ρu = −ρ
σ2 σ1
σ2
y µ 2 + ρ σ1 (x − µ1 )
= − (43)
σ2 σ2
Using (43) in eq.(42)we get
 
h y− µ2 +ρ σσ2 (x−µ1 ) i2
1
1 −1 √
w(Y|x) = √ p e 2 σ2 1−ρ2 (44)
2πσ2 1 − ρ2
−∞ < x < ∞ and it can be seen by inspection that this is a normal density with mean and variance
as
σ2
µY|x = µ2 + ρ (x − µ1 ) (45)
σ1
2
σY|x = σ22 (1 − ρ2 ). (46)

The corresponding results for the conditional density of X given Y = y follows from symmetry.

Note(Remember): If X is a random r−vector with values in Rr , then its expected value is


the r− vector

µX = E(X) = (E(X1 ), · · · , E(Xr ))T = (µ1 , · · · , µr )T , (47)

12
and r × r covariance matrix of X is given by

ΣXX = cov(X, X)
= E{(X − µX )(X − µX )T }
= E{(X1 − µ1 , · · · , Xr − µr )(X1 − µ1 , · · · , Xr − µr )T }
 2 
σ11 σ12 · · · σ1r
2
 σ21 σ22 · · · σ2r 
 ··· ··· ··· ··· ,
=   (48)
σr1 σr2 · · · σrr 2

where,

σii2 = V ar(Xi ) = E{(Xi − µi )2 } (49)

is the variance of Xi , i = 1, 2, · · · , r, and

σij = cov(Xi , Xj ) = E{(Xi − µi )(Xj − µj )} (50)

is the covariance matrix Xi and Xj , i, j = 1, 2, · · · , r, (i ̸= j).

Problem: Show that ΣXX = E(XXT ) − µX µX T .

Problem: Deduce eqs.(24) - (27).

Suppose we have two random vectors, X and Y, where X has r components and Y has s compo-
nents. Let Z be the random r + s vector,
 
X
Z= . (51)
Y

Then, the expected value of Z is the (r + s)−vector,


   
E(X) µX
µZ = E(Z) = = , (52)
E(Y) µY

where the covariance matrix of Z is the partitioned (r + s) × (r + s) matrix,

ΣZZ = E{(Z − µZ )(Z − µZ )T }


 
cov(X, X) cov(X, Y)
=
cov(Y, X) cov(Y, Y)
 
ΣXX ΣXY
= , (53)
ΣY X ΣY Y

where,

ΣXY = cov(X, Y)
= E{(X − µX )(Y − µZ )T }
= ΣTY X , (54)

is an (r × s) matrix.

13
If Y is linearly related to X in the sense that Y = AX + b, where A is a fixed s × r matrix
and b is a fixed s− vector, then the mean vector and covariance matrix of Y are given by

µY = AµX + b
ΣY Y = AΣXX AT , (55)
(56)

respectively.

0.1 Moment about origin (univariate case):


The rth moment about the origin ofPa random variable X denoted by µ′r is expected value of X r .
r
R xrx f (x), X is discrete
n
Symbolically µ′r = E(X r ) = The moment generating func-
x x f (x), x is continuous
tion of a random variable X where it exists is given by-
P tx
R xtxe f (x), X is discrete .
n
tX
MX (t) = E(e ) =
x e f (x), x is continuous
(tx)2 (tx)r
Now as etx = 1 + (tx) + 2! + ··· + r! + ···,
X (tx)2 (tx)r
MX (t) = [1 + (tx) + + ··· + + · · · ]f (x)
2! r!
X X t2 X 2 tr X r
= f (x) + t xf (x) + x f (x) + · · · + x f (x) + · · ·
2! r!
t2 tr
= 1 + µt + µ′2 + · · · + µ′r + · · · (57)
2! r!
tr
From this it is clear that E(X r ) = µ′r (which is coefficient of r! ).

Problem: Show that the moment generating function of the normal distribution is MX (t) =
1 2 2
eµt+ 2 σ t .

0.2 Moment generating function in bivariate distribution:


Let X and Y be two random variables having joint probability density function f (x, y), then the
moment generating function (m.g.f) of X and Y is
Z Z
MX,Y (t1 , t2 ) = E[exp(t1 X + t2 Y )] = exp(t1 X + t2 Y )f (x, y)dxdy. (58)
R2

The m.g.f of bivariate normal distribution is


h 1 i
MX,Y (t1 , t2 ) = exp t1 µX + t2 µX + (t21 σX
2
+ t22 σY2 + 2ρXY t1 t2 σX σY ) , (59)
2
where ρXY is the correlation coefficient between X and Y . Also σ 2 and µ are respectively the
variance and mean of the random variables. When X and Y are uncorrelated then ρXY = 0 and
consequently
h 1 i
MX,Y (t1 , t2 ) = exp t1 µX + t2 µX + (t21 σX
2
+ t22 σY2 ) . (60)
2

14
Problem: Show that if X and Y are standard normal variates with correlation coefficient ρ be-
tween them, then the correlation coefficient between X 2 and Y 2 is ρ2 .

Sol: Since X and Y are standard normal variates we have E(X) = E(Y ) = 0 and V (X) =
V (Y ) = 1. Now we know V (X) = E(X 2 ) − (E(X))2 = 1. Similarly V (Y ) = E(Y 2 ) − (E(Y ))2 = 1
so that E(X 2 ) = 1 and E(Y 2 ) = 1. It is also given that ρ is the correlation coefficient between X
and Y (whose computational formula is given in eq.(23)) i.e.

Cov(X, Y )
ρXY = p p , (61)
V (X) V (Y )

where Cov(X, Y ) = E[XY ] − E[X]E[Y ] and V (X) and V (Y ) are usually defined as above. Then

Cov(X 2 , Y 2 )
ρX 2 Y 2 = p p
V (X 2 ) V (Y 2 )
E[X 2 Y 2 ] − E[X 2 ]E[Y 2 ]
= p p
E((X 2 )2 ) − (E(X 2 ))2 E((Y 2 )2 ) − (E(Y 2 ))2
E[X 2 Y 2 ] − E(X 2 )E(Y 2 )
= p p (62)
E(X 4 ) − 1 E(Y 4 ) − 1

Now as the variables X and Y are having mean and variance 0 and 1, so using (63) we write the
moment generating function is
h1 i
MX,Y (t1 , t2 ) = exp (t21 + t22 + 2ρt1 t2 )
2
1 2 1 h1 2 i2 1 h1 2 i3
= 1 + (t1 + t22 + 2ρt1 t2 ) + (t1 + t22 + 2ρt1 t2 ) + (t1 + t22 + 2ρt1 t2 ) + · · ·
2 2! 2 3! 2
(63)
t21 t22 t41
From eq.(63) we get E(X 2 Y 2 ) =coefficient of 2
2! 2! which is 2ρ + 1. Coefficient of 4! is E(X )
4
t4
where the value is 3 while coefficient of 4!2 is E(Y 4 ) where the value is also 3. Substituting these
2 +1−1
values in eqs.(62) we get ρX 2 Y 2 = √2ρ √
3−1 3−1
= ρ2 . Q.E.D

Problem: The variables X and Y with zero means and standard deviations σ1 and σ2 are nor-
mally correlated with correlation coefficient ρ. Show that U and V defined as U = σX1 + σY2 and
V = σX1 − σY2 are independent normal variates with variances 2(1 + ρ) and 2(1 − ρ) respectively.

Sol: From eqs.(26) and (27) we see that the probability function (for bi-variate normal distri-
bution) is

1 h 1 n x − µ 2  y − µ 2  x − µ  y − µ oi
1 2 1 2
dF (x, y) = p exp − 2)
+ − 2ρ .(64)
2πσ1 σ2 1 − ρ 2 2(1 − ρ σ 1 σ2 σ 1 σ 2

According to the given condition µ1 = 0 and µ2 = 0, so that


1 h 1 n x 2  y 2  x  y oi
dF (x, y) = exp − + − 2ρ , (65)
2(1 − ρ2 ) σ1
p
2πσ1 σ2 1 − ρ2 σ2 σ1 σ2

15
where ∞ < (x, y) < ∞. Now we assume u = x
σ1 + σy2 and v = x
σ1 − σy2 . Solving these two equations
we get
σ1
x = (u + v)
2
σ2
y = (u − v) (66)
2
The Jacobian is therefore calculated as
∂(x, y)
J =
∂(u, v)
∂x ∂x 1 1
∂u ∂v 2 σ1 2 σ1 σ1 σ2
= = =− . (67)
1 2
∂y
∂u
∂y
∂v 2 σ2 − 21 σ2

We also calculate the following:

x2 y2 u2 + v 2
+ = , (68)
σ12 σ22 2

and
σ1 σ2 2
xy = (u − v 2 ). (69)
4
Therefore, by transformation of random variable,

1 h 1 n u2 + v 2 1 σ1 σ2 2 oi σ1 σ2
2
dF (u, v) = p exp − 2
− 2ρ (u − v ) − dudv
2πσ1 σ2 1 − ρ2 2(1 − ρ ) 2 σ1 σ2 4 2
(70)

Simplifying eq.(70)

1 1 h u2 i 1 1 h v2 i
dF (u, v) = √ p exp − du √ exp − dv
2(1 + ρ)2 2(1 − ρ)2
p
2π 2(1 + ρ) 2π 2(1 − ρ)
= f1 (u)du f2 (v)dv. (71)

Hence U and V are independently distributed, U as following N (0, 2(1+ρ)) and V as N (0, 2(1−ρ)).
Q.E.D

0.2.1 A few important facts:


⃗ is distributed as Np (⃗
If X ⃗ then any linear combination of variables a⃗′ X
µ, Σ), ⃗ = a1 X1 + a2 X2 + · · · +
ap Xp is distributed as N (a⃗′ µ
⃗ , a⃗′ Σ⃗
⃗ a). Also if a⃗′ X
⃗ is distributed as N (a⃗′ µ
⃗ , a⃗′ Σ⃗
⃗ a) for every ⃗a, then X

must follow Np (⃗ ⃗
µ, Σ).
 
1 1 1
Problem: Let X ⃗ be N3 (⃗ ⃗ with µ
µ, Σ) ⃗′ = [2, −3, 1] and Σ ⃗ =  1 3 2 , find the distribu-
1 2 2
tion of 3X1 − 2X2 + X3 .

16
Sol. From the expression 3X1 −2X2 +X3 , it is clear that a⃗′ = [3, −2, 1] so that 3X1 −2X2 +X3 = a⃗′ X,

⃗ T ⃗
where X = [X1 , X2 , X2 ] . Hence it is easy to see that a µ′ ⃗ = 11. Also it is straightforward to see
that a⃗′ Σ⃗
⃗ a = 7.Hence the distribution of 3X1 − 2X2 + X3 is N3 (11, 7).

Results:
ˆ If X⃗1q ×1 and X⃗2q ×1 are independent, then Cov(X ⃗1 , X
⃗2 ) = ⃗0 (a q1 × q2 matrix of zeroes).
1 2
!
⃗1   µ⃗   Σ 
X 1 11 Σ12
ˆ If ⃗2 is Nq1 +q2 , , then X1 and X2 are independent if and only
X µ⃗2 Σ21 Σ22
if Σ12 = 0 = Σ21 .
⃗1 and X
ˆ If X ⃗2 are independent and are distributed as Nq (µ1 , Σ11 ) and Nq (µ1 , Σ22 ) respec-
! 1 2
⃗1   µ⃗   Σ 
X 1 11 0
tively, then ⃗2 has the multivariate normal distribution Nq1 +q2 , .
X µ⃗2 0 Σ22
 
1 −2 0
Problem: Let X ⃗ be N3 (⃗
µ, Σ) ⃗′ = [−3, 1, 4] and Σ
⃗ with µ ⃗ =  −2 5 0 ,, (a) Are X1 and X2
0 0 2
independent? (b) Are X2 and X3 independent? (c) (X1 , X2 ) and X3 independent? (d) Are X1 +X
2
2

and X3 independent? (e) Are X2 and X2 − 25 X1 − X3 independent?

Sol: (a) From variance-covariance matrix (given) σ23 = 0 = σ32 = Cov(X2 , X3 ), so X2 and
X3 are independent. (b) Also σ12 = −2 = σ21 ̸= 0, hence X1 and X2 are not independent. (c)
(Hints:) Represent variance-covariance matrix as block matrix and the result will follow. (d) and
(e) require following theorem.

Theorem: Let X ⃗1 , X
⃗2 , · · · , X⃗n be mutually independent with X ⃗ j distributed as Np (µ⃗j , Σ)
⃗ [X
⃗j
has Pthe same covariance ⃗ ⃗ ⃗ ⃗
matrix Σ]. Then V1 = c1 X1 + c2 X2 + · · · + cn Xn is distributed as
Np ( nj=1 cj µj , ( nj=1 c2j )Σ).
⃗ Moreover V1 and V2 = b1 X ⃗1 + b2 X ⃗2 + · · · + bn X⃗n are jointly multi-
P
Pn !
( j=1 c2j )Σ⃗ (b⃗′⃗c)Σ

variate normal with covariance matrix ⃗ . Consequently V1 and V2 are
(b⃗′⃗c)Σ
⃗ ( nj=1 b2j )Σ
P

independent if b⃗′⃗c) = n cj bj = 0.
P
j=1

Now by above theorem, we have two linear combinations viz. V1 = 21 X1 + 21 X2 + 0.X3 and
 
h i 0
V2 = 0.X1 + 0.X2 + 1.X3 . Now as b⃗′⃗c = 12 12 0  0  = 0. So V1 and V2 are independent.
1
Again take U1 = 0.X1 + 1.X2 + 0.X3 and U2 = − 52 X1 + 1.X2 + (−1)X2 . As before, b⃗′⃗c =
 5 
h i −2
0 1 0  1  = 1 ̸= 0. Hence X2 and X2 − 25 X1 − X3 are not independent.
−1

17
0.2.2 What is Simple Linear Regression?
Regression analysis is used to model and explore empirical relationships between variables. The
relationships are non-deterministic in manner. Two types of variables are considered in regression
analysis, one is (a) predictor or regressor (we often call it independent variable) and another is (b)
dependent or response variable. Suppose in a locality you have considered various flats. You want
to find whether there is any relationship between square feet areas of these flats and the electricity
consumptions in each of these flats by the respective occupants. If there is any such relationship,
then this is obviously non-deterministic one. A particular flat with large square feet area may have
less electricity consumption then a flat with small square feet area and vice-versa.

In simple linear regression analysis there is a single regressor variable which we denote by x and one
response variable, denoted by Y . Suppose that a true relationship between x and Y is a straight
line and that the observation Y at each level of x is a random variable. This supposition is based
upon the scatter diagram of the bi-variate distribution. Let us consider the following case. In a
chemical distillation process one is interested in finding the non-deterministic relationship between
hydrocarbon level (%) and Purity (%). The chart is shown below. Let us draw the scatter diagram

Table 1: Hydrocarbon level vs Purity in a chemical distillation process


Observation Number Hydrocarbon Level x (%) Purity y(%)
1 0.99 90.01
2 1.02 89.05
3 1.15 91.43
4 1.29 93.74
5 1.46 96.73
6 1.36 94.45
7 0.87 87.59
8 1.23 91.77
9 1.55 99.42
10 1.40 93.65
11 1.19 93.54
12 1.15 92.52
13 0.98 90.56
14 1.01 89.54
15 1.11 89.85
16 1.20 90.39
17 1.26 93.25
18 1.32 93.41
19 1.43 94.98
20 0.95 87.33

taking values of regressor and response from Table 1 and it is shown in figure 1. By seeing from the
figure it is justified that the possible non-deterministic relationship between regressor and response
is linear in nature. We now assume that, the expected value of Y for each value of x is described
by the model

E(Y |x) = β0 + β1 x + ϵ, (72)

18
where ϵ is a random error with mean zero and (unknown) variance σ 2 and β0 and β1 are regression
coefficients. The estimation of these regression coefficients is done by the method of least squares.
For n pairs of observations (x1 , y1 ), (x2 , y2 ), · · · , (xn , yn ), the eq.(72) is re-written as

yi = β0 + β1 xi + ϵi , i = 1, 2, · · · , n. (73)

The sum of the squares of the deviations of the observations from the true regression line is
n
X n
X
L= ϵ2i = (yi − β0 − β1 xi )2 (74)
i=1 i=1

Let the estimators of β0 and β1 be denoted by βˆ0 and βˆ1 and for these two estimators to be least
square estimators the following conditions must hold.

n
∂L X
= −2 (yi − βˆ0 − βˆ1 xi ) = 0
∂β0 βˆ0 ,βˆ1
i=1
n
∂L X
= −2 (yi − βˆ0 − βˆ1 xi )xi = 0 (75)
∂β1 βˆ0 ,βˆ1
i=1

Simplifying these equations of (75) yields


n
X n
X
nβˆ0 + βˆ1 xi = yi
i=1 i=1
n
X Xn Xn
βˆ0 xi + βˆ1 x2i = yi xi . (76)
i=1 i=1 i=1

Pair of eqs. in (76) are known as least squares normal equations. The solution to the normal
equations in the least squares estimators βˆ0 and βˆ1 . Thus the least square estimates of the intercept
β0 and slope β1 in the simple linear regression model are

βˆ0 = ȳ − βˆ1 x̄
(77)
( n
P Pn
i=1 yi )( i=1 xi )
Pn
y i x i −
βˆ1 = i=1 P n 2
( n
, (78)
i=1 xi )
Pn 2
i=1 i x − n
Pn Pn
yi xi
where ȳ = i=1n and x̄ = i=1n . The numerator in βˆ1 is often denoted by Sxy while the denom-
Sxy
inator is denoted by Sxx , so that βˆ1 = Sxx .

Corresponding to the data given in Table 1 and by looking at the requirements of eq.(77), the fol-
lowing table should be constructed. Using the table 2 the values of the estimates are βˆ0 = 74.28341,
βˆ1 = 14.9474 and fitted regression line would be ŷ = 74.28341 + 14.9474 x9 . The fitted regression
line is shown below in figure 2. One can easily see from these plots that some of the scattered points
9
See Appendix for R codes.

19
Table 2: Hydrocarbon level vs Purity in a chemical distillation process
Observation Number Hydrocarbon Level xi (%) Purity yi (%) xi yi x2i
1 0.99 90.01 89.1099 0.9801
2 1.02 89.05 90.831 1.0404
3 1.15 91.43 105.1445 1.3225
4 1.29 93.74 120.9246 1.6641
5 1.46 96.73 141.2258 2.1316
6 1.36 94.45 128.452 1.8496
7 0.87 87.59 76.2033 0.7569
8 1.23 91.77 112.8771 1.5129
9 1.55 99.42 154.101 2.4025
10 1.40 93.65 131.11 1.96
11 1.19 93.54 111.3126 1.4161
12 1.15 92.52 106.398 1.3225
13 0.98 90.56 88.7488 0.9604
14 1.01 89.54 90.4354 1.0201
15 1.11 89.85 99.7335 1.2321
16 1.20 90.39 108.468 1.44
17 1.26 93.25 117.495 1.5876
18 1.32 93.41 123.3012 1.7424
19 1.43 94.98 135.8214 2.0449
20 0.95 87.33 82.9635 0.9025

(which are observed values) fall on the straight lines and some fall above or below the straight lines.
We next calculate the deviations by considering error sum of squares which is
n
X n
X
SSE = e2i = (yi − yˆi )2 . (79)
i=1 i=1

Dividing SSE by n − 2 gives the best estimate of σ 2 . Hence the unbiased estimator of σ 2 is
SSE
σ2 = . (80)
n−2
Another table is now constructed below which calculates the right hand side of eq.(79).

From table 3 we get the value for SSE = ni=1 (yi − yˆi )2 as 21.24982. Therefore σ 2 = 20−2
SSE
P
=
2
1.180545. The residual standard error at n − 2 degrees of freedom is square root of σ , which in
this case is 1.086528877 approx.

The following figure shows the plot of all the scattered points and how they are deviated from
the regression line.

It is a known fact that the population parameters can’t be deterministically found and so we
estimate them by sample statistic. A sample statistic θ is an unbiased estimator of popu-
lation parameter µ if E(θ) = µ. In this case the estimators βˆ0 and βˆ1 of β0 and β1 satisfy the

20
Table 3: Error estimation Hydrocarbon level vs Purity in a chemical distillation process
Observation Number yˆi (yi − yˆi )2
1 89.08134 0.862417
2 89.52976 0.230168
3 91.47292 0.001842
4 93.56556 0.030431
5 96.10661 0.38861
6 94.61187 0.026203
7 87.28765 0.091417
8 92.66871 0.807683
9 97.45188 3.873496
10 95.20977 2.432882
11 92.07082 2.158502
12 91.47292 1.096377
13 88.93186 2.650833
14 89.38028 0.025509
15 90.87502 1.050674
16 92.22029 3.349961
17 93.11713 0.017653
18 94.01398 0.364789
19 95.65819 0.459944
20 88.48344 1.330424

following relations.

E(βˆ1 ) = β1
E(βˆ0 ) = β0
σ2
V (βˆ1 ) =
Sxx
h1 x̄2 i
V (βˆ0 ) = σ 2 + . (81)
n Sxx

It can also be shown that the covariance of βˆ0 and βˆ1 is given by

σ 2 x̄
Cov(βˆ0 , βˆ1 ) = − . (82)
Sxx

Problem: With reference to the hydrocarbon-purity problem, determine Cov(βˆ0 , βˆ1 ). [Ans: 2.07]

Problem: The number of pounds of steam used per month by a chemical plant is thought to
be related to the average ambient temperature (in degree Fahrenheit) for that month. The past
year’s usage and temperature are shown in the following table: (a) Assuming that a simple linear
regression model is appropriate, fit the regression model relating steam usage (Y ) to the average
temperature (x). What is the estimate of σ 2 ? (b) What is the estimate of expected steam usage
when the average temperature is 550 F ? (c) Calculate all the fitted values and residuals. (d) Form
a scattered plot and show how the observed values are deviated from the fitted line (Using R).

21
Table 4: Usage vs Temperature Table
Months Temp. Usage/1000 Months Temp. Usage/1000
Jan. 21 185.79 Jul. 68 621.55
Feb. 24 214.47 Aug. 74 675.06
Mar. 32 288.03 Sep. 62 562.03
Apr. 47 424.84 Oct. 50 452.93
May 50 454.58 Nov. 41 369.95
Jun. 59 539.03 Dec. 30 273.98

(e) Suppose the monthly average temperature is 470 F . Calculate the fitted value of y and the
corresponding residual.

0.2.3 Hypothesis design and testing:


An important part of assessing the adequacy of a linear regression model is testing statistical
hypotheses about the model parameters and constructing certain confidence intervals. The general
design of null and alternative hypothesis would be

H0 : β1 = β1, 0 , H1 : β1 ̸= β1, 0 , (83)

the special case of which is however

H0 : β1 = 0, H1 : β1 ̸= 0. (84)

Two test the hypothesis one uses t− test. The test statistic (with respect to eq.(83) is

βˆ1 − β1,0
T0 = q , (85)
σ̂ 2
Sxx

where as under the assumption of eq.(84) the test statistic is defined as

βˆ1
t0 = q . (86)
σ̂ 2
Sxx

With respect to thePnexample considered above, we find βˆ1 = 14.9474, n = 20, σ̂ 2 = 1.180545 and
Pn ( xi )2
Sxx = i=1 x2i − i=1 n = 0.68088, so that using eq.(86) the test statistic value is 11.35166777
(approx.). Similarly to test the hypothesis

H0 : β0 = β0, 0 , H1 : β1 ̸= β0, 0 , (87)

the test statistic used is


βˆ0 − β0,0
T0 = r h i. (88)
2 1 x̄2
σ̂ n + Sxx

Consequently as a special case whenever we consider the hypothesis as

H0 : β0 = 0, H1 : β0 ̸= 0, (89)

22
the test statistic is given by
βˆ0
t0 = r h i. (90)
x̄2
σ̂ 2 n1 + Sxx

Again consider the above example with table 2 and considering eq.(90) we get the value of the test
statistic as 46.61729725 (approx.)10 . Using statistical software such as R the P − value for the test
can easily be calculated11 . In this problem the P − value is 1.23 × 10−9 . The t− test formulae given
in eqs. (86) and (90), follows the t− distribution with n − 2 degrees of freedom. We would reject
the null hypothesis given in eq.(84) and (89) if |t0 | > t α2 ,n−2 , (here α is the level of significance).
In the above problem if we take α = 0.01, i.e. 1% level of significance, from table we find that,
for 20 − 2 = 18 degrees of freedom, t0.005,18 = 2.88. Hence for both the case of β0 and β1 , the
calculated t is greater than the tabulated value. The final decision would be to reject the both the
null hypotheses.

In addition to the point estimates of the slope and intercept, it is possible to obtain confidence
interval estimates of these parameters. The width of these confidence intervals is a measure of
the overall quality of the regression line. Under the assumption that the observations are normally
and independently distributed, a 100(1 − α)% confidence interval on the slope β1 in simple linear
regression is
s s
σ̂ 2 σ̂ 2
βˆ1 − t α2 ,n−2 ≤ β1 ≤ βˆ1 + t α2 ,n−2 , (91)
Sxx Sxx
and that on the slope β0 is
s s
h1 x̄2 i h1 x̄2 i
βˆ0 − t α2 ,n−2 σ̂ 2 + ≤ β0 ≤ βˆ0 + t α2 ,n−2 σ̂ 2 + . (92)
n Sxx n Sxx
Problem: Consider the hydrocarbon-purity problem considered above. Find a 95% confidence
interval on the slope of the regression line. Also find the 95% confidence interval on the inter-
cept β0 . Again find the 99% confidence interval on both intercept β0 and slope β1 . [Answer:
(12.1809, 17.7139), (11.5995, 18.2953), (11.1577, 18.7370), (10.7266, 19.5334).]

This is to be remembered that the limits (as shown in eq.(91) and in (92)) are only calculated
when β1 ̸= 0 and β0 ̸= 0 (that are justified by the hypothesis testing) and this β0 and β1 are the
intercept and slope i.e. they are the parameters of the linear regression model that may be fitted
to population data from which the sample has been selected.

0.2.4 Correlation coefficient in bi-variate distribution:


One of the most important part in bi-variate analysis is to study the correlation coefficient between
the two variables involved in the distribution. The correlation between random variables X and Y ,
denoted as ρxy , is
Cov(X, Y ) σXY
ρXY = p = , (93)
V (X)V (Y ) σX σY
10
All these results will be reflected in Appendix which shows R code for the same.
11
The P − value is the smallest level o significance that would lead to rejection of the null hypothesis H0 with the
given data.

23
where Cov(X, Y ) is the covariance between X and Y and V (X), V (Y ) are respectively the
variances of X and Y . Now if we assume the eq.(93) to represent population correlation coefficient
then it is best estimated by sample correlation coefficient which is defined as
Pn
i=1 Yi (Xi − X̄)
RXY = h i1 . (94)
Pn 2
Pn 2 2
i=1 (Xi − X̄) i=1 (Yi − Ȳ )

S
We already know from above sections that βˆ1 = Sxxxy
(which in fact in terms of the random variables
X and Y is βˆ1 = SXX
S
). Now as, SXX = i=1 (Xi − X̄)2 , SXY = ni=1 Yi (Xi − X̄) and SST =
n
XY
P P
Pn 2
i=1 (Yi − Ȳ ) so that eq.(94) can be re-written as

SXY
R= 1 . (95)
(SXX SST ) 2
Problem: Consider the problem based on hydrocarbon and purity (explained above). Defining
these two as random variables X and Y , find the correlation coefficient between them. [Answer:
0.9367154]

0.2.5 Correlation coefficient and it’s connection to regression parameters:


From eq.(77) and assuming random variables X and Y for the given set of data we already know
that
βˆ0 = Ȳ − βˆ1 X̄
SXY
βˆ1 = , (96)
SXX
where SXX and SXY have their usual meanings. Using eqs.(95) and (96) we finally obtain
 SS  1
ˆ T 2
β1 = R. (97)
SXX
Remember that there are two lines of regression. One is regression equation of Y on X where the
regression coefficient is defined as in eq.(96), while for regression equation of X on Y one must use
 1
βˆ1 = SSS
2
XX
T
R.

0.3 What is Multiple Linear Regression?


Multiple linear regression modelling is the straightforward generalization of simple linear regression
model encountered in the previous section. Here, the dependent variable or response Y may be
related to k number of independent or regressor variables. The model
Y = β 0 + β 1 x1 + β 2 x2 + · · · + β k xk + ϵ (98)
is called a multiple linear regression model with k regressor variables. The parameters βj , j =
0, 1, , · · · , k, are called the regression coefficients. As a special case, one can consider cubic polyno-
mial which will be governed by the equation
Y = β0 + β1 x1 + β2 x2 + β3 x3 + ϵ, (99)
which is a multiple linear regression model with three regressor variables.

24
0.3.1 Least square estimation of the parameters:
Consider the following data. Suppose that there n > k observations available. Let xij denote the

Table 5: Multivariable Data


y x1 x2 · · · xk
y1 x11 x22 · · · x1k
y2 x21 x22 · · · x2k
··· ··· ··· ··· ···
yn xn1 xn2 · · · xnk

ith observation or level of variable xj . The observations are (xi1 , xi2 , · · · , xik , yi ), i = 1, 2, · · · , n.
Each observation (xi1 , xi2 , · · · , xik , yi ) satisfies the model in eq.(98, i.e.

yi = β0 + β1 xi1 + β2 xi2 + · · · + βk xik + ϵi


Xk
= β0 + βj xij + ϵi , i = 1, 2, · · · , n. (100)
j=1

The least square function is


n
X n
X k
X
L= ϵ2 = (yi − β0 − βj xij )2 . (101)
i=1 i=1 j=1

To minimize L with respect to β0 , β1 , · · · , βk , the least square estimates of these quantities must
satisfy the following conditions.
n  k
∂L X X 
= −2 yi − βˆ0 − βˆj xij = 0
∂β0 βˆ0 ,βˆ1 ,··· ,βˆk
i=1 j=1
n  k
∂L X X 
= −2 yi − βˆ0 − βˆj xij xij = 0, j = 1, 2, · · · , k. (102)
∂βj βˆ0 ,βˆ1 ,··· ,βˆk
i=1 j=1

Simplifying eq.(102) we obtain the least squares normal equations as

n
X n
X n
X n
X
nβˆ0 + βˆ1 xi1 + βˆ2 xi2 + · · · + βˆk xik = yi
i=1 i=1 i=1 i=1
n
X n
X n
X n
X Xn
βˆ0 xi1 + βˆ1 x2i1 + βˆ2 xi1 xi2 + · · · + βˆk xi1 xik = xi1 yi
i=1 i=1 i=1 i=1 i=1
···
··· ··· ··· ··· ··· ··· ··· ···
n
X n
X n
X n
X n
X
βˆ0 xik + βˆ1 xik xi1 + βˆ2 xik xi2 + · · · + βˆk x2ik = xik yi (103)
i=1 i=1 i=1 i=1 i=1

As a straightforward generalization to simple linear regression model, in multiple linear regression


with p parameters, a logical estimator for σ 2 is σ̂ 2 = SS
n−p (where SSE has the usual meaning as
E

before).

25
The following table shows the data on pull strength of a wire bond in a semiconductor manu-
facturing process, wire length and die height to illustrate building an empirical model. We try to
show the dependency of pull strength on wire length and die height by fitting a multiple linear
regression model which is assumed to be

Y = β0 + β1 x1 + β2 x2 + ϵ, (104)

where Y = pull strength, x1 = wire length and x2 = die height. 12 . The 3-dimensional scatter plot

Table 6: Multivariable Chart


O.N P. S (y) W. L (x1 ) D. H (x2 ) O.N P.S(y) W.L (x1 ) D. H (x2 )
1 9.95 2 50 14 11.66 2 360
2 24.45 8 110 15 21.65 4 205
3 31.75 11 120 16 17.89 4 400
4 35.00 10 550 17 69.00 20 600
5 25.02 8 295 18 10.30 1 585
6 16.86 4 200 19 34.93 10 540
7 14.38 2 375 20 46.59 15 250
8 9.60 2 52 21 44.88 15 290
9 24.35 9 100 22 54.12 16 510
10 27.50 8 300 23 56.63 17 590
11 17.08 4 412 24 22.13 6 100
12 37.00 11 400 25 21.15 5 400
13 41.95 12 500 −−− −−− −−− −−−

of the above data is shown in figure 3. where we consider the Pull strength to be dependent on
Wire Length and Die Height.

Using eq.(103) the least square normal equations are given by


n
X n
X n
X
nβˆ0 + βˆ1 xi1 + βˆ2 xi2 = yi
i=1 i=1 i=1
n
X n
X n
X n
X
βˆ0 xi1 + βˆ1 x2i1 + βˆ2 xi1 xi2 = xi1 yi
i=1 i=1 i=1 i=1
n
X n
X n
X n
X
βˆ0 xi2 + βˆ1 xi1 xi2 + βˆ2 x2i2 = xi2 yi (105)
i=1 i=1 i=1 i=1

Solving eq.(105) for the three variables βˆ0 , βˆ1 and βˆ2 , one gets

βˆ0 = ȳ − {β1 x¯1 + β2 x¯2 }


d1 c − c1 d
βˆ1 =
c2 d1 − c1 d2
d2 c − c2 d
βˆ2 = , (106)
c1 d2 − d1 c2
12
O.N = Observation Number, P.S=Pull Strength, D.H=Die Height and W.L = Wire Length

26
where,
n
X n
X
c1 = x2i1 − x¯1 xi1
i=1 i=1
n
X n
X
c2 = xi1 xi2 − x¯2 xi1
i=1 i=1
n
X n
X
d1 = xi1 xi2 − x¯1 xi2
i=1 i=1
n
X n
X
d2 = x2i2 − x¯2 xi2
i=1 i=1
Xn Xn
c = xi1 yi − ȳ xi1
i=1 i=1
Xn Xn
d = xi2 yi − ȳ xi2 (107)
i=1 i=1
Pn Pn Pn
yi xi1 xi2
and ȳ = i=1
n , x¯1 = i=1
n and x¯2 = i=1
n .
Pn
Using
Pn eq.(106) we have n = Total observation Pn = 25, i=1 xi1 = Sum total of x1 column
Pn = 206,
x = Sum total of x column = 8294, x 2 = Sum total of x2 column = 2396, 2
i=1 i2 2 i=1 i1 1 i=1 xi2 =
2
P n
Pn total of x2 column = 3531848, i=1 xi1 x
Sum Pi2n= Product of x1 column and x2 column = 77177,
y
i=1 i = Sum
Pn total of y column = 725.82, i=1 xi1 yi = Product of x1 column and y column =
8008.47 and i=1 xi2 yi = Product of x2 column and y column = 274816.71.

Inserting the computed summations in eq.(105) we get

25βˆ0 + 206βˆ1 + 8294βˆ2 = 725.82


206βˆ0 + 2396βˆ1 + 77177βˆ2 = 8008.47
8294βˆ0 + 77177βˆ1 + 3531848βˆ2 = 274816.71 (108)

Solving set of equations given in (108) one gets βˆ0 = 2.26379, βˆ1 = 2.74427 and βˆ2 = 0.01253 so
that regression equation is given by ŷ = 2.26379 + 2.74427x1 + 0.01253x2 13 .

In this way, if we consider the set of eqs.(103), then for this the fitted regression line would be
k
X
yˆi = βˆ0 + βˆj xij , i = 1, 2, · · · , n. (109)
j=1

0.4 Validation of model assumption in linear regression analysis:


In linear regression analysis we try to find whether the response variable is governed by one or
more input variables. If response variable is governed by one predictor (input) variable then it is
known as simple linear regression analysis. On the other hand if response variable is governed by
13
For 3d scatter plot and fitting of regression plane see Appendix

27
more than one input variables then it is called multiple linear regression analysis14 For validation
of model we make five key assumptions. They are

ˆ Linear relationship

ˆ No or little multicollinearity

ˆ Multivariate normality

ˆ No auto-correlation

ˆ Homoscedasticity

1) To check whether the model satisfies linearity we first consider the bi-variate (two-variable)
distribution and draw a scatter plot of the data. By seeing the movement of the response variable
in accordance with the movement of the input variable/s it is decided whether the linearity as-
sumption is justified. If we look at the figure 3 we find that the distribution of points is no doubt
linear. At the same time, the experimenter also needs to be aware of the fact that any outlier
must not creep in the data. An outlier is an observation that does not appear to fit with the rest
of the data. We must always get rid of any such outliers if they exist/s. Hence the identification of
outliers is also important before we proceed with our analysis15

2) In multiple regression problems we expect to find dependencies between the response variable Y
and the regressor/s xj . In most regression problems, however it is found that there are also some
sort of dependencies among the regressor variables xj . In situations where these dependencies are
strong, we then say that multicollinearity exists. A very useful measure of multicollinearity is
variance inflation factor. The larger the variance inflation factor is, the more severe the multi-
collinearity in the system. In one of the sections next, we will discuss about how variance inflation
factor is calculated.

0.5 Analysis of Variance (ANOVA) and Regression model testing:


Analysis of Variance (ANOVA)consists of calculations that provide information about the levels of
variability within a regression model and form a basis for test of significance. ANOVA is based
upon partitioning of total variability in the response variable. We begin by the following identity.

yi − ȳ = (yˆi − ȳ) + (yi − yˆi )


n
X X n Xn n
X
2 2 2
⇒ (yi − ȳ) = (yˆi − ȳ) + (yi − yˆi ) + 2 (yˆi − ȳ)(yi − yˆi ). (110)
i=1 i=1 i=1 i=1
Pn Pn Pn
This
Pn is to be noted
Pn however that 2 i=1 ( y
ˆi − ȳ)(yi − y
ˆi ) = 2 i=1 y
ˆi (yi − y
ˆi ) − 2ȳ i=1 (yi − yˆi ) =
2 i=1 yˆi ei − 2ȳ i=1 ei = 0. This is because the sum of the residuals (or errors) is always zero and
the sum of the residuals weighted by the corresponding fitted value yˆi is also zero. Then from eq.
(110) we get
n
X n
X n
X
2 2
(yi − ȳ) = (yˆi − ȳ) + (yi − yˆi )2 . (111)
i=1 i=1 i=1
14
Remember, with respect to the data given basic assumption that we make is that the model is linear.
15
See Appendix to know how to identify the outliers!

28
The term on the left hand side of eq.(111) is called total sum of squares, denoted by SST , the
first and second term on the right hand side of eq.(111) are respectively known as regression sum
of squares (denoted by SSreg and residual sum of squares or error sum of squares denoted
by SSE . Thus eq.(111) can be re-written as

SST = SSreg + SSE . (112)

The eq.(112) is the fundamental analysis of variance identity for a regression model. Just like the
partitioning of variabilities made in eq.(112), the degrees of freedom are also partitioned as below.

d.fT = d.freg + d.fE


n − 1 = 1 + (n − 2). (113)

One can now use the usual ANOVA F test to test the hypothesis H0 : β1 = 0 against H1 : β1 ̸= 0.
The F- statistic is given by

SSreg /d.freg SSreg /1 M Sreg


F = = = , (114)
SSE /d.fE SSE /(n − 2) M SE

provided M Sreg >> M SE 16 . The statistic in eq.(114) follows F − distribution with (1, n − 2)
degrees of freedom. The design of the ANOVA table is shown below. If we consider the problem

Table 7: ANOVA table I


Source of variation Sum of squares Degree of freedom Mean Squares F
MS
Regression SSreg 1 M Sreg F = M Sreg
E
Residual (error) SSE n−2 M SE −−−
Total SST n−1 −−− −−−

of Hydrocarbon-Level versus Purity from Table 1, manually calculate the terms SST , SSReg and
SSE , then we subsequently get the following ANOVA table. Now at (1, 18) degree of freedom, the

Table 8: ANOVA table II


Source of variation Sum of squares Degree of freedom Mean Squares F
Regression 152.1253891 1 152.1253891 F = 128.860381
Residual (error) 21.2497975 18 1.180544306 −−−
Total 173.376895 19 −−− −−−

tabulated F − statistic value is 4.41 at 5% level of significance. The calculated F is much larger
that the tabulated F and hence we reject the null hypothesis and consequently conclude that the
slope β1 ̸= 0. Another justification can be put in the following way. We know that the P − value is
the lowest level of significance at which we reject our null hypothesis. The P − value here is coming
out to be 0.00000000122 which is much less than the choice of our level of significance which is
0.05. So we reject our null hypothesis17 .
16 M SE
Remember that if M SE ≥ M Sreg , then F = M Sreg
17
P − value approach is mostly applied.

29
0.5.1 R code for performing ANOVA (for regression model testing):
(1) Feed the .csv data to R. (2) Find the regression model by lm function and associate an object
name (say “regline”). (3) Simply use the command anova(regline).

0.6 Coefficient of Determination:


An important measure to judge adequacy of regression model is called coefficient of determination
and is denoted by R2 . This is defined as
SSreg
R2 = . (115)
SST

Using eq.(112) it can be easily shown that

SSE
R2 = 1 − . (116)
SST
Since SST is a measure of the variability in y without considering the effect of the regressor variable
x and SSE is a measure of the variability in y remaining after x has been considered, R2 is often
called the proportion of variation explained by the regressor x. Values of R2 that are close to 1
imply that most of the variability in y is explained by the regression model18 . It is also easy to show
that 0 ≤ R2 ≤ 1. In case where X and Y are jointly distributed random variable, R2 is the square
of the correlation coefficient between X and Y 19 . Now for the hydrocarbon level versus purity
problem (discussed in the table (1)), it is found that coefficient of determination R2 = 0.8774.

0.7 ANOVA for Multiple Linear Regression:


Multiple linear regression attempts to fit a regression line for a response variable using more than
one explanatory variable. The ANOVA calculations for multiple regression are nearly identical to
the calculations for simple linear regression, except that the degrees of freedom are adjusted to
reflect the number of explanatory variables included in the model. For p explanatory (or input)
variables the model (or regression) degrees of freedom (SSreg ) are equal to p, the error degrees of
freedom (SSE ) is equal to (n − p − 1), and the total degrees of freedom (SST ) is equal to (n − 1),
which is the sum of SSreg and SSE . The corresponding ANOVA table is shown below: In multiple

Table 9: ANOVA table II


Source of variation Sum of squares Degree of freedom Mean Squares F
2 MS
F = M Sreg
P
Regression SSreg = i (yˆi − ȳ) p M Sreg E
SSE = i (yi − yˆi )2
P
Residual (error) n−p−1 M SE −−−
SST = i (yi − ȳ)2
P
Total n−1 −−− −−−

MS
regression, the test statistic F = M SregE
follows F − distribution with (p, n − p − 1) degrees of
freedom. The null hypothesis states that β1 = β2 = · · · = βp = 0 against the alternative hypothesis
simply states that at least one of the parameters βj ̸= 0, j = 1, 2, · · · , p. Large values of the test
statistic provide evidence against the null hypothesis. The F test does not indicate which of the
18
The statistic R2 should be used with caution, since it is always possible to make R2 large by adding enough
terms to the model.
19
Remember X(Y ) is the random variable, x(y) are values of the corresponding random variable.

30
parameters βj ̸= 0, only that at least one of them is linearly related to the response variable. The
SS
ratio SSreg
T
is known as the squared multiple correlation coefficient. This value is the proportion of
the variation in the response variable that is explained by the response variables. The square root
of R2 is called the multiple correlation coefficient, the correlation between the observations yi and
the fitted values yˆi .

1 Multivariate Regression

2 Discriminant Analysis:
2.1 Introduction (Linear Discriminant Analysis):
One of the most important problems in the areas of machine learning, information retrieval etc.
is the dimensionality reduction technique. The main objective of the dimensionality reduction
techniques is to reduce the dimensions by removing the redundant and dependent features by
transforming the features from a higher dimensional space to a space with lower dimensions20 . Two
techniques are mainly used for this purpose, viz. (a) unsupervised techniques and (b) supervised
techniques. Among supervised techniques the most famous is the approach of linear discriminant
analysis (abbreviated as LDA)21 . This category of dimensionality reduction techniques are used in
biometrics, bio-informatics and chemistry. There are two types of LDA techniques to deal with
classes: class independent and class dependent techniques. The difference between these two types
is that, in class independent LDA, each class is considered as a separate class against the other
classes. In this type, there is just one lower dimensional space for all classes to project their data
on it. On the other hand, in the class dependent LDA, one separate lower dimensional space is
calculated for each class to project its data on it. Although these two techniques are mostly used,
they have their own drawbacks too. In the class independent LDA approach, if different classes are
non-linearly separable, the LDA cannot discriminate between these classes. The problem is called
linearity problem. On the other hand, the class dependent LDA fails to find the lower dimensional
space if the dimensions are much higher than the number of samples in the data matrix. Thus, the
within class matrix becomes singular22 . This problem is known as Small Sample Problem (SSS).
However, the problems have their respective solutions too. While the linearity problem is tackled
by kernel function method, the SSS problem is resolved (i) by removing the null space of within
class matrix or (ii) by converting the within class matrix to a full rank matrix.

2.2 Eigenvalues and Eigenvectors


One of the most important concepts in matrix analysis is the diagonalizability problem. It is always
easier to deal with a diagonal matrix than to with a non-diagonal one. It is easy to verify that even
when the order of a square matrix is too large and the matrix is diagonal then the determinant
of the matrix is the product of its diagonal elements. This is not the case when the matrix is
non-diagonal. It is not always the case that a given matrix is in its diagonal form, but however
can be brought into this by some well-defined operations. Hence it is a fundamental problem in
linear algebra to get to know whether a given matrix is diagonalizable! More technically and in the
language of linear algebra the question can be put forward in a beautiful way. (a) Can we get an
ordered basis β for the vector space V such that the matrix representation of the linear operator
20
Linear Discriminant Analysis: A Detailed Tutorial, by A. Tharwat, T. Gaber, A Ibrahim and A. E. Hassanien.
21
Among various unsupervised techniques the most famous is PCA, abbreviation for Principal Component Analysis.
22
Between class matrix and within class matrix are two different matrices which are constructed in LDA.

31
T defined on V with respect to that basis (which we denote by [T ]β ) is a diagonal matrix? (b) If
such an ordered basis is there, how can one find it? The answer to these questions lead us to the
concept of eigenvalues and eigenvectors. Let us first technically define the term diagonalizability
from the linear algebraic point of view.

Definition: A linear operator T defined on a finite dimensional vector space V is called diag-
onalizable if there exists an ordered basis β for V such that [T ]β is a diagonal matrix. A square
matrix A is called diagonalizable if LA is diagonalizable23 .

This is to be remembered that if A is an m × n matrix with entries from a field F , then by


LA we mean the mapping from F n to F m defined by LA (x) = Ax (i.e. the matrix product of A
and x) for each column vector x ∈ F n .

Let us suppose that β = {v1 , v2 , · · · , vn } is the ordered basis for the vector space V and T is
the linear operator defined on V such that [T ]β is a diagonal matrix. Then for each vector vj ∈ V ,
we have
n
X
T (vj ) = Dij vi = Djj vj = λj vj , λj = Djj . (117)
i=1

Now for the choice of the basis β = {v1 , v2 , · · · , vn } and the linear operator, as before, let T (vj ) =
λj vj for some scalars λj , where j = 1, 2, · · · , n. Then by the definition of the matrix representation
of the linear transformation we clearly have
 
λ1 0 ··· 0
 0 λ2 · · · 0 
[T ]β = 
 ··· ··· ··· ··· 
 (118)
0 0 · · · λn

It is clear from the above discussion that each vector v in the basis β satisfies the condition that
T (v) = λv for some scalar λ. Moreover, as v lies in a basis, so v is always non-zero. It is because a
basis is the set of linearly independent vectors and contrast to this fact a zero vector makes the set
linearly dependent. This vector v is basically the eigenvector. Let us now put forward the formal
definition below.

Definition: Let T be the linear operator defined on a vector space V . A non-zero vector v ∈ V
is called an eigenvector of T if there exists a scalar λ such that T (v) = λv. The scalar λ is called
the eigenvalue corresponding to the eigenvector v. Another perspective is as follows. Let A be in
Mn×n (F )24 . A non-zero vector v ∈ F n25 is called an eigenvector of A if v is an eigenvector of LA ;
that is Av = λv for some scalar λ. The scalar λ is called the eigenvalue of A corresponding to the
eigenvector v.

Thus we have the following theorem.

Theorem: A linear operator T defined on a finite dimensional vector space V is diagonalizable if


23
LA is defined in the previous section.
24
Mn×n (F ) is the vector space of all square matrices of order n, where the elements by which the matrices are
constructed have been chosen from the field F .
25
Remember that the set of all n−tuples with entries from a field F is denoted by F n .

32
and only if there exists an ordered basis β for V consisting of eigenvectors of T . Furthermore, if T
is diagonalizable, β = {v1 , v2 , · · · , vn } is an ordered basis of eigenvectors of T , and D = [T ]β , then
D is a diagonal matrix and Djj is the eigenvalue corresponding to vj , for 1 ≤ j ≤ n.

It is clear from the above theorem that to diagonalize a matrix or a linear operator we need
to find a basis of eigenvectors and its corresponding eigenvalues. Below we jot down a few the-
orems/definitions and the working principle to find the eigenvalues and eigenvectors for a given
matrix or a linear operator.

Theorem (eigen,a): Let A ∈ Mn×n (F ). Then a scalar λ is an eigenvalue of A if and only


det(A − λIn ) = 026 .

Definition (eigen,a): Let A ∈ Mn×n (F ). The polynomial f (t) = det(A − tIn ) is called the
characteristic polynomial of A.

Theorem (eigen, a) above states that the eigenvalues of a matrix are the zeros of its character-
istic polynomial. When determining the eigenvalues of a matrix or a linear operator, we normally
compute its characteristic polynomial.

Definition (eigen,b): Let T be a linear operator on an n−dimensional vector space V with


ordered basis β. We define the characteristic polynomial f (t) of T to be the characteristic polyno-
mial of A = [T ]β . That is f (t) = det(A − tIn ).

Problem: Let T be a linear operator defined on P2 (R) defined by T (f (x)) = f (x) + (x + 1)f ′ (x).
Determine the eigenvalues of T .

Theorem (eigen,b): Let A ∈ Mn×n (F ). (a) The characteristic polynomial of A is a polyno-


mial of degree n with leading coefficient (−1)n . (b) A has at most n distinct eigenvalues.

Theorem (eigen,c): Let T be a linear operator on a vector space V , and let λ be an eigen-
value of T . A vector v ∈ V is an eigenvector of T corresponding to λ if and only if v =
̸ 0 and
v ∈ N (T − λI)27 i.e. to find a vector/s such that (T − λI)v = 0.

Working Principle to find eigenvalue/s and eigenvector/s:


ˆ Given a matrix A, find the characteristic polynomial and hence the characteristic equation.

ˆ Solve this characteristic equation to find the roots, which are eigenvalues.

ˆ Assume a vector v for each eigenvalue λ so that Av = λv.

ˆ Av = λv is equivalent to writing (A − λI)v = 0 (we call it eigenvalue-eigenvector equation).


Bring the coefficient matrix (A − λI) into row reduced echelon form28 . Let the row reduced
echelon form be denoted by B such that the we have Bv = 0.

ˆ Solve and find the eigenvector/s.


26
det is the determinant function.
27
For a linear operator T , N (T ) is the null-space of T . Null space of a linear operator T on V , is the set of all
vectors x in V such that T (x) = 0.
28
Sometimes is also known as reduced row echelon form.

33
Let us now define a linear operator T 29 on a finite dimensional vector space R2 by T (x, y) = (y, x).
We know that the standard ordered basis β of R2 is {(1, 0), (0, 1)}. We will check first how this
basis vectors change under the given linear transformation. Below we write the transformation of
the basis vectors (1, 0) and (0, 1) and represent the outcome in terms of (1, 0) and (0, 1) once again
as range has also the same basis.

T (1, 0) = (0, 1) = 0.(1, 0) + 1.(0, 1)


T (0, 1) = (1, 0) = 1.(1, 0) + 0.(0, 1). (119)

Thus we have
 
0 1
[T ]β = . (120)
1 0

The characteristic equation is det([T ]β − λI2 ) = λ2 − 1 = 0 (using defintion(eigen,b)) whose roots


are −1 and 1. Now let v = (v1 , v2 )T be the eigenvector corresponding to eigenvalue λ1 = −1 so
that the eigenvalue-eigenvector equation would be (using theorem(eigen,c))
    
1 1 v1 0
([T ]β − λ1 I2 )v = = . (121)
1 1 v2 0

The reduced row echelon form of the coefficient matrix of eq.(121) is30
    
1 1 v1 0
([T ]β − λ1 I2 )v = = , (122)
0 0 v2 0

solving which we get v1 = −v2 . The set of eigenvectors is thus given by Eλ1 = {a(−1, 1)T : a ∈
R, a ̸= 0}. Each vector of this set Eλ1 is eigenvector corresponding to the eigenvalue λ1 = −1, which
means that corresponding to an eigenvalue there exist infinitely many eigenvectors. Proceeding in
the similar manner, we assume an eigenvector v ′ = (v1′ , v2′ )T as the eigenvector corresponding to
eigenvalue λ2 = 1 so that the eigenvalue-eigenvector equation would be
  ′   
−1 1 v1 0
([T ]β − λ2 I2 )v ′ = = . (123)
1 −1 v2′ 0

The reduced row echelon form of the coefficient matrix of eq.(123) is


  ′   
′ 1 −1 v1 0
([T ]β − λ2 I2 )v = = , (124)
0 0 v2′ 0

solving which we get v1′ = v2′ . As before, the set of eigenvectors is Eλ2 = {b(1, 1)T : b ∈ R, b ̸= 0}.
Each vector of the set Eλ2 is eigenvector corresponding to the eigenvalue λ2 = 1, which again
implies that there are infinitely many choices31 .

Now as we have found the eigenvalues and eigenvectors, let us go back to the discussion of di-
agonalizability with which we started this section.
29
T is afunctionfrom R2 to R2 . This operator is called flip operator.
30 1 0
I2 = .
0 1
31
For λ1 = −1, the solution space will however be Sλ1 = {a(−1, 1)T , a ∈ R} while For λ2 = 1, the solution space
is Sλ2 = {a(1, 1)T , a ∈ R}.

34
Theorem: Let T be a linear operator defined on a vector space V , and let λ1 , λ2 , · · · , λk be
distinct eigenvalues of T . If v1 , v2 , · · · , vk are eigenvectors of T such that λi corresponds to vi
(1 ≤ i ≤ k), then {v1 , v2 , · · · , vk } is linearly independent.

This theorem has a corollary which we note down below.

Corollary: Let T be a linear operator defined on an n− dimensional vector space V . If T has n


distinct eigenvalues, then T is diagonalizable.

Proof: Suppose T has n− distinct eigenvalues λ1 , λ2 , · · · , λn . For each i, choose an eigenvec-


tor vi corresponding to λi . Then by the above theorem, {v1 , v2 , · · · , vn } is linearly independent
and since dim(V ) = n, {v1 , v2 , · · · , vn } is a basis, containing however the eigenvectors. Thus T is
diagonalizable. Q.E.D

If we consider the above problem in which the linear transformation is represented by eq.(119),
and if we construct the set β1 = {(−1, 1)T , (1, 1)T }, it is easy to verify that the set β1 is linearly
independent.32 By the above corollary we immediately can say then that the linear operator T is
diagonalizable, as dim(R2 ) = 2 and β1 contains two linearly independent vectors which are eigen-
vectors.

We shall later find the diagonalized representation of T of eq.(119).

Let us now try to understand geometrically what is happening with respect to the linear oper-
ator T defined in eq.(119). At first the operator takes the point (1, 0) to (0, 1) and vice-versa.
These vectors are orthogonal to each other (i.e. they are perpendicular). Let us now consider the
following figure. If we carefully look at the figures 1 and 2, we observe the following geometrical
facts. These points (1, 0), (0, 1), (−1, 1), (1, 1) are all lying on the two dimensional plane R2 . But
there is a difference among these points with respect to the linear transformation T of eq.(119) act-
ing on them. We see that the vectors (1, 0), (0, 1) are being transformed to the vectors (0, 1), (1, 0)
respectively, i.e. the original and transformed points are non-collinear, while the vector (−1, 1) is
transformed to (1, −1) and the vector (1, 1) does not change at all, all under the linear operator T
(here the original and transformed points are collinear to one another). The eigenvalues give by how
much unit the length of the vectors (which are collinear to their respective original counterpart)
have increased, decreased or does not change at all.

How eigenvalues work? If suppose λ is an eigenvalue corresponding to any linear operator,


then we can have the following cases.

ˆ Case I: If λ > 1, then T moves vectors in V farther from origin by a factor of λ.

ˆ Case II: If λ = 1, then T acts as the identity operator on V .

ˆ Case III: If 0 < λ < 1, then T moves vectors in V closer to origin by a factor of λ.

ˆ Case IV: If λ = 0, then T acts as the zero transformation on V .


32
If the vectors in a set are linearly independent then the set is linearly independent.

35
ˆ Case V: If λ < 0, then T reverses the orientation of vectors in V , i.e. T moves vectors in V
from one side of the origin to the other33 .
In the example that we have considered above, case V happened with vector (−1, 1) and case II
happened with vector (1, 1).

Problem: Consider the transformation T (x, y) = (x, −y) defined on R2 . Find eigenvalues and
eigenvector of T . Is T diagonalizable? Geometrically interpret your result. This transformation is
also known as reflection about the x−axis.

Problem: Consider the transformation T (x, y) = i(−y, x) defined on R2 . Find eigenvalues and
eigenvector of T . Is T diagonalizable? Geometrically interpret your result.(Remember that i is the
complex number)

Note: The matrix representations of these transformations discussed above as an example and
as set in the problem are very important. These matrices are of order 2 and they are Pauli spin
matrices which play a fundamental role in designing quantum circuits in quantum computation and
quantum information34 .

Problem: Consider the transformation T (x, y) = (−x, y) defined on R2 . Find eigenvalues and
eigenvector of T . Is T diagonalizable? This transformation is reflection about y−axis.

Problem: Consider the transformation T (x, y) = (x, 0) defined on R2 . Find eigenvalues and
eigenvector of T . Is T diagonalizable? This is called projection on the x−axis.

Problem: Consider the transformation T (x, y) = (0, y) defined on R2 . Find eigenvalues and
eigenvector of T . Is T diagonalizable? This is called projection on the y−axis.

Let T be the linear operator on R2 that rotates each vector in the plane through an angle of
π
2 . It is clear geometrically that for any non-zero vector v, the vectors v and T (v) are not collinear.
Hence T (v) is not a multiple of v. Therefore T has no eigenvectors and, consequently, no eigenval-
ues. Thus there exist operators (and matrices) with no eigenvalues or eigenvectors. Of course such
operators and matrices are not diagonalizable.

Now as we know how to identify the diagonalizability of an operator, the next question is to
how to find the diagonal form of the given operator/matrix. For this we need to know the following
theorem and it’s corresponding corollary.

Theorem: Let T be a linear operator on a finite dimensional vector space V , and let β and
β ′ be ordered bases for V . Suppose that Q is the change of coordinate matrix that changes
β ′ −coordinates into β−coordinates. Then we have

[T ]β ′ = Q−1 [T ]β Q. (125)

Corollary: Suppose that β is a basis for F n consisting of eigenvectors of A. Then if Q is the n × n


matrix whose columns are the vectors in β, then Q−1 AQ is a diagonal matrix.
33
The word origin is used with respect to R2 , but in general we can term it as zero vector in case of any arbitrary
example of vector space.
34
Identity operator is also a kind of Pauli spin operator.

36
We have already seen that the operator T (x, y) = (y, x) is diagonalizable and the matrix rep-
resentation of this operator with respect to the standard ordered basis of R2 is given in eq.(120).
With the eigenvectors {(−1, 1)T , (1, 1)T } of this operator we construct the matrix Q as
 
−1 1
Q= , (126)
1 1

where the first column of the matrix contains the eigenvector associated with eigenvalue −1 and
the second column of the matrix Q contains the eigenvector associated with the eigenvalue 1. By
applying the above corollary we thus find the diagonalized form of the operator of eq.(120) as
 
−1 0
D= . (127)
0 1

It is to be noted that in the principal diagonal of the matrix in eq.(127) the eigenvalues of the
operator that is −1 and 1 are appearing.

Problem: Find the diagonalized forms of all the operators of the problems discussed above.

Definition: A polynomial f (t) in P (F )35 splits over F if there are scalars c, a1 , a2 , · · · , an (not
necessarily distinct) in F such that

f (t) = c(t − a1 )(t − a2 ) · · · (t − an ). (128)

If we consider the linear operator defined in eq.(120) the characteristic polynomial f (t) = t2 − 1 =
(t − 1)(t + 1). Comparing this with eq.(128) we find that c = 1 and a1 = 1, a2 = −1.

Theorem: The characteristic polynomial of any diagonalizable linear operator splits.

Proof: Let T be a diagonalizable linear operator defined on the n−dimensional vector space
V , and let β be an ordered basis for V such that [T ]β = D is a diagonal matrix. Suppose
 
λ1 0 ··· 0
 0 λ2 · · · 0 
that D =   · · · · · · · · · · · ·  and let f (t) be the characteristic polynomial of T . Then

0 0 · · · λn
 
λ1 − t 0 ··· 0
 0 λ2 − t · · · 0 
f (t) = det(D − tI) =  ···
 = (λ1 − t)(λ2 − t) · · · (λn − t)
··· ··· ··· 
0 0 · · · λn − t
= (−1)n (t − λ1 )(t − λ2 ) · · · (t − λn ).Q.E.D

Definition: Let λ be an eigenvalue of a linear operator or matrix with characteristic polyno-


mial f (t). The algebraic multiplicity of λ is the largest positive integer k for which (t − λ)k is a
factor of f (t).

The algebraic multiplicity of both the eigenvalues of the linear operator defined in eq.(120) is
1.
35
The set of all polynomials with coefficients from field F is a vector space, which we denote by P (F ).

37
Problem: Find the algebraic multiplicity of each eigenvalue of the linear operators discussed
in the preceding paragraphs.

If T is a diagonalizable linear operator on a finite dimensional vector space V , then there is an


ordered basis β for V consisting of eigenvectors of T . Also it is known that if [T ]β is a diagonal
matrix, the diagonal entries are the eigenvalues of T . Since the characteristic polynomial of T is
det([T ]β − tI), it is easily seen that each eigenvalue of T must occur as a diagonal entry
 of [T ]β ex-

3 1 3
actly as many times as its multiplicity. (As for example if we consider a matrix A =  0 3 4 ,
0 0 4
2
this has characteristic polynomial f (t) = −(t − 3) (t − 4). Here λ = 3 is an eigenvalue of A
with multiplicity 2, and λ = 4 is an eigenvalue of A with multiplicity 1). Hence β contains as
many (linearly independent) eigenvectors corresponding to an eigenvalue as the multiplicity of that
eigenvalue. Therefore the number of linearly independent eigenvectors corresponding to a given
eigenvalue is of interest in determining whether an operator can be diagonalized.

Definition: Let T be a linear operator on a vector space V , and let λ be an eigenvalue of T .


Define Eλ = {x ∈ V : T (x) = λx} = N (T − λIV )36 . The set Eλ is called the eigenspace of T
corresponding to the eigenvalue λ. Analogously, we define the eigenspace of a square matrix A to
be the eigenspace of LA .

Clearly, Eλ is a subspace of V consisting of the zero vector and the eigenvectors of T corresponding
to the eigenvalue λ. The maximum number of linearly independent eigenvectors of T corresponding
to the eigenvalue λ is therefore the dimension of Eλ .

Theorem: Let T be a linear operator on a finite dimensional vector space V , and let λ be an
eigenvalue of T having multiplicity m. Then 1 ≤ dim(Eλ ) ≤ m37 .

Theorem: Let T be a linear operator on a finite dimensional vector space V such that the
characteristic polynomial of T splits. Let λ1 , λ2 , · · · , λk be the distinct eigenvalues of T . Then
(a) T is diagonalizable if and only if the multiplicity of λi is equal to dim(Eλi ) for all i. (b) If T is
diagonalizable and βi is an ordered basis for Eλi for each i, then β = β1 ∪ β2 ∪ · · · ∪ βk is an ordered
basis for V consisting of eigenvectors of T .

For the linear operator defined in eq.(120), we have two eigenvalues, respectively λ1 = −1 and
λ2 = 1. Then by definition Eλ1 = {(0, 0)T , k(−1, 1)T , k ∈ R}. The maximum number of lin-
early independent eigenvectors of T corresponding to the eigenvalue −1 is one i.e. (−1, 1)T . Thus
dim(Eλ1 ) = 1 which is equal to the multiplicity of −1. It can also be shown that the multiplicity
of the other eigenvalue 1 is also equal to dim(Eλ2 ). Thus, by the above theorem we can deduce
that T is diagonalizable.
 
1 1
Problem: Prove that is not diagonalizable. [Hints: Show that above theorem doesn’t
0 1
hold in this case].
36
Identity operator defined on a vector space V .
37
dim stands for dimension.

38
2.3 LDA methodology:
The goal of the LDA technique is to project the original data matrix onto a lower dimensional
space. Three steps are needed to be performed for this purpose.
ˆ To calculate the separability between different classes (i.e. the distance between the means
of different classes). This is called between class variance or between class matrix.
ˆ To calculate the distance between the mean and the samples of each class. This is known as
within class variance or within class matrix.
ˆ To construct the lower dimensional space which maximizes the between class variance and
minimizes the within class variance.

2.3.1 Algorithm (LDA class independent method):


ˆ Given a set of N samples [xi ]N
i=1 , each of which is represented as a row of length M and the
matrix X is given as follows, which is of course of the order N × M .
 
x(1,1) x(1,2) · · · x(1,M )
 x(2,1) x(2,2) · · · x(2,M ) 
X=  ···
 (129)
··· ··· ··· 
x(N,1) x(N,2) · · · x(N,M ) N ×M
One can think of a class of samples which contain 10 different individuals and each individual
is having 5 physical features counted. We can divide the class into two subclasses. So if we
consider the subscripts i, j, k, then i = 1, 2, · · · , 10, j = 1, 2, · · · , 5 and k = 1, 2.
ˆ We divide the samples into k classes. Compute the mean µk of each class ωk as
1 X
µk = xi , (130)
nk x ∈ω
i k

where µk ’s are of order 1 × M (remember j = 1, 2, · · · , M ) and nk is the size of k th class. It


is to be noted that the sizes of each class may or may not be the same.
ˆ Compute the total mean µ of all data, which is (where k denotes the number of classes and
each class having samples of dimension M , and N is the total number of samples considering
all the classes),
N k
1 X X ni
µ= xi = µi , (131)
N N
i=1 i=1
where µ is of order 1 × M .
ˆ The term (µk − µ)T (µk − µ) represents the separation distance between the mean of the k th
class (µk ) and the total mean µ. We calculate this quantity for each class and denote this by
SBk , which is the between class variance of class k. Similarly we can calculate this for other
classes. Hence the total between class variance (or between class matrix) is
k
X
SB = SBl
l=1
Xk
= nl (µk − µ)T (µk − µ). (132)
l=1

39
SB is of order M × M .
ˆ Compute within class variance (or within class matrix) as
nk
k X
X
SW = (xik − µk )T (xik − µk ), (133)
l=1 i=1

where xik represents the ith sample in the k th class. SW is of order M × M .


ˆ After calculating the between class variance (SB ) and within class variance (SW ) matrices,
−1
the transformation matrix W can be calculated as W = SW SB . The eigenvalues λ and
eigenvectors V of W are then calculated. The solution of this problem can be obtained by
calculating the eigenvalues (λ = {λ1 , λ2 , , · · · , λM }) and eigenvectors (V = {v1 , v2 , · · · vM })
−1
of W = SW SB , provided SW is non-singular. The eigenvalues are scalar values, whereas the
eigenvectors are non-zero vectors, which provides us with the information about the LDA
space. The eigenvectors represent the directions of the new space, and the corresponding
eigenvalues represent the scaling factor, length, or the magnitude of the eigenvectors. Thus
each eigenvector represents one axis of the LDA space, and the associated eigenvalue repre-
sents the robustness of this eigenvector. The robustness of the eigenvector reflects its ability
to discriminate between different classes, i.e. increase the between class variance, and decrease
the within class variance of each class, hence meets the LDA objective.
ˆ We next sort eigenvectors in descending order according to their corresponding eigenvalues.
The first k eigenvectors are then used as a lower dimensional space Vk . The eigenvectors
with the k highest eigenvalues are used to construct a lower dimensional space (Vk ), while
the other eigenvectors ({vk+1 , vk+2 , vM }) are neglected.
ˆ Project all original samples (X) onto the lower dimensional space of LDA. In the lower
dimensional space of the LDA technique, the dimension of the original data matrix (X ∈
RN ×M ) is reduced by projecting it onto the lower dimensional space of LDA (Vk ∈ RM ×k ).
The dimension of the data after projection is k; hence M − k features are ignored or deleted
from each sample. Thus each sample (xi ) which was represented as a point in an M −
dimensional space by projecting it onto the lower dimensional space (Vk ) as follows yi = xi Vk .
In general
Y = XVk . (134)

To understand the above algorithm we below consider one numerical example. Suppose the data
matrix is given as
 
1 2
 2 3 
 
 3 3 
 
 4 5 
 
 5 5
 

X= 4 2 , (135)
 

 5 0
 

 5 2
 

 
 3 2 
 
 5 3 
6 3 11×2

40
where we divide the matrix into two classes ω1 and ω2 . Here ω1 is the matrix with first 5 rows and
ω2 is the matrix with the last 6 rows. The total row thus is N = 11. Each class is having samples
of dimension 2. Thus we have
 
1 2
 2 3 
 
ω1 = 
 3 3 
 (136)
 4 5 
5 5 5×2

and
 
4 2

 5 0 

 5 2 
ω2 =   . (137)

 3 2 

 5 3 
6 3 6×2

The eqs. (136) and (137) are the two classes respectively extracted from X. We shall now imple-
ment the steps of the above algorithm38

ˆ The sample mean µ1 of class ω1 is


1     
µ1 = { 1 2 + 2 3 + 3 3 + 4 5 + 5 5 }
5 
= 3 3.6 . (138)

The sample mean µ2 of class ω2 is


1      
µ2 = { 4 2 + 5 0 + 5 2 + 3 2 + 5 3 + 6 3 }
6 
= 4.67 2 . (139)

See here the size n1 = 5 of class ω1 and that of ω2 is n2 = 6. Here we have used eq.(130).

ˆ Using eq.(131) we calculate the grand mean which is


n 1 µ1 n 2 µ2
µ = +
N N
1   
= { 1 2 + 2 3 + ··· + 6 3 }
11
5 6
 
= 3 3.6 + 4.67 2
11  11
= 3.91 2.727 . (140)

In the above calculations µ1 , µ2 and µ can be considered as matrices of order 1 × 2.


38
It is to be remembered that although in the following calculations of µ1 ,µ2 and µ we considered the elements
column-wise, we can also consider the sample points of each class as row vectors. When sample points are taken as
column vector we use the form (µ1 − µ)(µ1 − µ)T whereas if we consider the sample points as row vectors then one
must use (µ1 − µ)T (µ1 − µ).

41
ˆ Using eq.(132) we will calculate the between class variance matrix SB and for that we need
to calculate SB1 and SB2 for each of the two classes.
SB = SB1 + SB2
= n1 (µ1 − µ)T (µ1 − µ) + n2 (µ2 − µ)T (µ2 − µ). (141)
In this way we have
= 5 × { 3 3.6 − 3.91 2.727 }T { 3 3.6 −
   
SB1 3.91 2.727 }
 
−0.91 
= 5× × −0.91 0.873
0.873
 
4.14 −3.97
= (142)
−3.97 3.81
and
= 6 × { 4.67 2 − 3.91 2.727 }T { 4.67 2 −
   
SB2 3.91 2.727 }
 
0.76 
= 6× × 0.76 −0.727
−0.727
 
3.47 −3.32
= . (143)
−3.32 3.17
Consequently the between class matrix is
SB = SB1 + SB2
 
7.61 −7.29
= . (144)
−7.29 6.98

ˆ Using eq.(133) we calculate now the within class matrix for both the classes ω1 and ω2 , which
we denote by SW1 and SW1 and consequently will calculate total within class variance which
we denote by SW .
SW1 = { 1 2 − 3 3.6 }T { 1 2 − 3 3.6 }
   

+{ 2 3 − 3 3.6 }T { 2 3 − 3 3.6 }
   

+{ 3 3 − 3 3.6 }T { 3 3 − 3 3.6 } +
   

{ 4 5 − 3 3.6 }T { 4 5 − 3 3.6 } +
   

{ 5 5 − 3 3.6 }T { 5 5 − 3 3.6 },
   
(145)
so that we have
 
10 8
SW1 = . (146)
8 7.2
Again for class 2 we have,
4.67 2 }T {
   
SW2 = { 4 2 − 4 2 −
4.67 2 }
0 − 4.67 2 }T {
   
+{ 5 5 0 − 4.67 2 }
− 4.67 2 }T { 5
   
+{ 5 2 2 − 4.67 2 } +
− 4.67 2 }T { 3
   
{ 3 2 2 − 4.67 2 } +
− 4.67 2 }T { 5
   
{ 5 3 3 − 4.67 2 } +
3 − 4.67 2 }T {
   
{ 6 6 3 − 4.67 2 }, (147)

42
And consequently we have
 
5.34 1
SW2 = . (148)
1 6

Consequently the total within class variance is

SW = SW1 + SW2
 
15.34 9
= . (149)
9 13.2

ˆ Now as SW from eq.(149) is non-singular, hence we can find it’s inverse.


 
−1 0.1086 −0.0740
SW = . (150)
−0.0740 0.1263

ˆ Using eqs.(144) and (150) we calculate the W matrix which is


−1
W = SW SB
 
1.36 −1.31
= . (151)
−1.48 1.42

ˆ We now find the eigenvalues and eigenvectors of the matrix W from eq.(151).The eigenvalues

−0.693
are λ1 = −0.0027 and λ2 = 2.783. The corresponding eigenvectors are v1 = and
−0.721
 
0.677
v2 = respectively.
−0.736
ˆ From the previous step it is clear λ2 > λ1 such that, the second eigenvector V2 corresponding
to λ2 is more robust than the first one i.e. V1 corresponding to λ1 . Therefore it is selected to
construct the lower dimensional space which will be a good discriminator between the classes.
The original data is projected on the lower dimensional space, as follows, yi = ωi v2 , where
yi of order ni × 1 (here n1 = 5 for class 1 and n2 = 6 for class 2) represents the data after
projection of the ith class and it’s values will be as follows:
 
1 2
 2 3   
  0.6772
y 1 = w1 v 2 =  3 3  ×

 4 5 
 −0.7358 2×1
5 5 5×2
 
−0.7944
 −0.8530 
 
=   −0.1758 
 (152)
 −0.9702 
−0.2930 5×1

43
and
 
4 2
 5 0 
   
 5 2  0.6772
y2 = w2 v 2 =   ×
 3
 2 
 −0.7358 2×1
 5 3 
6 3 6×2
 
1.2372
 3.3860 
 
 1.9144 
= 
 0.5600 
 . (153)
 
 1.1786 
1.8558 6×1

2.3.2 Algorithm (LDA class dependent method):


The first few steps (steps 1- 3) of class dependent method is similar to that of class independent
method. Yet we re-write these steps in the following to show a complete picture of the said
algorithm.
ˆ Given a set of N samples [xi ]N
i=1 , each of which is represented as a row of length M and the
matrix X is given as follows, which is of course of the order N × M .
 
x(1,1) x(1,2) · · · x(1,M )
 x(2,1) x(2,2) · · · x(2,M ) 
X=  ···
 (154)
··· ··· ··· 
x(N,1) x(N,2) · · · x(N,M ) N ×M

ˆ Compute the mean µk of each class ωk as


1 X
µk = xi , (155)
nk x ∈ω
i k

where µk ’s are of order 1 × M .

ˆ Compute the total mean µ of all data, which is (where k denotes the number of classes and
each class having samples of dimension M , and N is the total number of samples considering
all the classes),
N k
1 X X ni
µ= xi = µi , (156)
N N
i=1 i=1

where µ is of order 1 × M .

ˆ Compute between class variance (or between class matrix) as


k
X
SB = nl (µk − µ)T (µk − µ). (157)
l=1

SB is of order M × M .

44
ˆ for all class i, i = 1, 2, · · · , c, do

Compute within class matrix of each class (that we denote by SWi and is of order M × M )
and which is given by
X
SWi = (xi − µk )T (xi − µk ). (158)
xi ∈ωk

Construct a transformation matrix for each class Wi , as follows,


−1
Wi = SW S .
i B
(159)

The eigenvalues λi and the eigenvectors V i of each transformation matrix Wi are then
calculated, where λi and the eigenvectors V i represent the calculated eigenvalues and eigen-
vectors of the ith class, respectively.
Sorting the eigenvectors in descending order according to their corresponding eigenvalues.
The first k eigenvectors are then used to construct a lower dimensional space for each class
Vki .
Project the samples of each class (ωi ) onto their lower dimensional space Vki , as follows:

Ωj = xi Vkj , xi ∈ ωj , (160)

where Ωj represents the projected samples of the class ωj .

ˆ end for

Eqs. (154)-(157) aee similar to those of eqs. (129)-(132). Using these eqs. we get the SB matrix
as before which is given in eq.(144). We now use the next set of eqs. given in class dependent
algorithm to proceed to find the within class matrix for each class. For the problem that we have
considered above, SW1 and SW2 will correspond to the within class matrices for classes ω1 and ω2
respectively. The within class matrices are then given by
   
−1 0.9 −1 −1 0.1933 −0.0322
SW1 = , SW2 = . (161)
−1 1.25 −0.0322 0.1720

Consequently we have
   
14.13 −13.54 1.705 −1.634
[W1 ]sw1 = , [W2 ]sw2 = , (162)
−16.72 16.01 −1.499 1.435
−1 −1
where [W1 ]sw1 = SW S and [W2 ]sw2 = SW
1 B
S . The eigenvalues-eigenvectors of [W1 ]sw1 are
2 B   
−0.6916 0.6456
λ1 = −0.0056 and λ2 = 30.145 and V1 = and V2 = while the
−0.7221 −0.7637
 
1 2 1 0.7508
eigenvalues-eigenvectors of [W2 ]sw2 are λ = 3.1389 and λ = 0.00105 and V =
−0.6605
 
0.6913
and V 2 = .
−0.7226
We see that the second eigenvector V2 for the class ω1 has corresponding eigenvalue (i.e. λ2 =
30.145) more than the first one; thus, the second eigenvector is used as a lower dimensional space

45
for the first class. However in the class ω2 , the first eigenvector has corresponding eigenvalue (i.e.
λ1 = 3.1389) more than the second one. Hence the first eigenvector is used to project the data of
the second class to lower dimensional space. Thus the projected vectors for the classes ω1 and ω2
will respectively be
   
1 2 −0.88
 2 3    −0.99 
  0.6456  
Ωω1 = ω1 V2 =  3 3 
  = −0.35 , (163)
 4 5  −0.7637

 −1.24 
5 5 −0.59

and
   
4 2 1.68

 5 0 
  
 3.75 

 5 2  0.7508  2.43 
Ωω2 = ω2 V 1 =  
 −0.6605 =  . (164)

 3 2 

 0.93 

 5 3   1.77 
6 3 2.52

If we compare the two LDA methods (i.e. class dependent (CDLDA)and class independent (CILDA)),
we find that, although the two LDA methods are used to calculate lower dimensional space (or ld
space), CDLDA calculates separate lower dimensional spaces for each class which has two main
limitations: (1) it needs more CPU time and calculations more than CILDA; (2) it may lead to SSS
(Small Sample Size) problem because the number of samples in each class affects the singularity of
SW2 . These findings reveal that the standard LDA technique used the CILDA method rather than
i
using the CDLDA method.

We now try to understand the generalized scenario about what we have learned so far and to
know about the prior probabilities and the linear discriminant function parameters.

2.4 Classes and features (with binary classification:)


Suppose the population P is partitioned into k unordered classes39 , groups or sub-populations,
which we denote by ω1 , ω2 , · · · , ωk . Each item in P is classified into one (and only one) of those
classes. Measurements on a sample of items are to be used to help assign future unclassified
items to one of the designated classes. The r− measurements on an item is represented by
X = (X1 , X2 , · · · , Xr )τ . The variables X1 , X2 , · · · , Xr are likely to be chosen because of their
suspected ability to distinguish among the k− classes. The variables X1 , X2 , · · · , Xr are hence
called discriminating or feature variables and the vector X is called feature vector.

We consider below, the binary classification problem with two classes (i.e. k = 2) where we
wish to discriminate between two classes ω1 and ω2 . Let P (X ∈ ωi ) = Πi , i = 1, 2, be the prior
probabilities that a randomly selected observation X = x belongs to either ω1 and ω2 . Suppose
also that the conditional multivariate probability density of X for ith class is

P (X = x|X ∈ ωi ) = fi (x), i = 1, 2. (165)


39
By unordered we mean, any of the classes may be 1st , 2nd , · · · or kth classes.

46
The function fi on the right hand side of eq.(165) may be discrete, continuous or be finite mix
distributions or even have singular covariance matrices. Bayes’ theorem yields posterior probability
as

P (ωi |x) = P (X ∈ ωi |X = x)
P [(X ∈ ωi ) ∩ (X = x)]
=
P (X = x)
fi (x)Πi
= , (166)
f1 (x)Π1 + f2 (x)Π2

that the observed x belongs to ωi , i = 1, 2. For a given X, a reasonable classification strategy is


to assign x to that class with the higher posterior probability. This strategy is called the Bayes’
rule classifier. In other words, we assign x ∈ ω1 if

P (ω1 |x)
> 1, (167)
P (ω2 |x)
(ω1 |x)
and we assign x to ω2 otherwise. The ratio PP (ω 2 |x)
is referred to as the odds-ratio that ω1 rather
than ω2 is the correct class given the information in x. Substituting eq.(166) into eq.(167), the
Bayes’ rule classifier assigns x to ω1 if ff21 (x)
(x)

Π1 and to ω2 otherwise.
2

Note: In the dimensionality reduction problem considered in the previous section, the proba-
5 6
bilities i.e. P (X ∈ ω1 ) = Π1 = 11 = 0.4545 and P (X ∈ ω2 ) = Π2 = 11 = 0.5454 are the prior
probabilities.

3 Cluster Analysis:
An ability that distinguishes humans from other creatures is their ability to classify things. The
classification is fundamental to most branches of science and is extremely important for the study
of that branch. In biology, the theory and practice of classifying organisms is generally known
as taxonomy (we have often encountered this term, in Mendel’s work, Darwin’s work and so on).
Day by day the amount of data is becoming large and so is the importance to manage this huge
data sets. In market research analysis we have seen that online portals such as Amazon, Flipkart
often try to guess the consumer choices and accordingly they send us the hints of products that we
would like to buy. The exploration of such databases using cluster analysis and other multivariate
analysis techniques is now often called data mining. In the 21st century, data mining has become of
particular interest for investigating material on the World Wide Web, where the aim is to extract
useful information or knowledge from web page contents. At one point, it seems that the terms
cluster, group and class are similar kind of words. But there are thin line differences among these
terms (in Mathematical sense). It is not possible to give a universally unique definition for clusters.
It is not entirely clear how a ‘cluster’ is recognized when displayed in the plane, but one feature of
the recognition process would appear to involve assessment of the relative distances between
points. The following fig 9 represents a few schematic representation of clusters which often human
minds can process.

47
3.1 A few clustering techniques:
Hierarchical clustering:
In a hierarchical classification the data are not partitioned into a particular number of classes or
clusters at a single step. Instead the classification consists of a series of partitions, which may run
from a single cluster containing all individuals, to n clusters each containing a single individual.
Hierarchical clustering techniques may be subdivided into agglomerative methods, which proceed
by a series of successive fusions of the n individuals into groups, and divisive methods, which sep-
arate the n individuals successively into finer groupings. The following figure 10 shows schematic
of hierarchical clustering. From the figure 10, it is observed that, the initially the black dots were
all considered together. Then gradual grouping/clustering were applied until all the black dots
become individualistic (shown to reside inside the red rectangular box).With hierarchical methods,
divisions or fusions, once made, are irrevocable so that when an agglomerative algorithm has joined
two individuals they cannot subsequently be separated, and when a divisive algorithm has made a
split it cannot be undone.

Below we give an example of agglomerative method. We consider a distance matrix which is


given by

Table 10: Distance matrix I


Points 1 2 3 4 5
1 0.0
2 2.0 0.0
3 6.0 5.0 0.0
4 10.0 9.0 4.0 0.0
5 9.0 8.0 5.0 3.0 0.0

Single Linkage Method:


The simplest hierarchical clustering methods, single linkage, also known as the nearest-neighbour
technique. It was first described by Florek et al. (1951) and later by Sneath (1957) and Johnson
(1967). Single linkage operates directly on a proximity matrix.

Analysis:
Let us first analyze the above distance matrix. What does it interpret? It gives an impression of five
points numbered 1, 2, · · · , 5 scattered around. The distance between the points are represented.
Let us take, for example cell (1, 1) where the value is 0.0. It means the distance of point 1 from 1
is zero. Hence we see that all the diagonal values of the table 10 are zero. Take again for example
cell (4, 5) where the value is 3.0. This means the that the distance between points 4 and 5 is 3. You
see that some cells are empty. They have not been mentioned just. Take for example, cell (4, 5)
which seems to have no value. But you know dist(4, 5) = dist(5, 4) and the value of dist(5, 4) is 3.0
and so on. You can read the table values from row to column or from column to rows. We begin
our process now.

48
ˆ We club points 1 and 2 in a single cluster and then from this cluster we calculate the distances
of other three points 3, 4 and 5 This because in this cell (1, 2), we have smallest non-zero entry
2.0 for the distance. It is shown below. dist(12)3 = min{dist13 , dist23 } = min{6.0, 5.0} = 5.0.
In a similar manner we compute the other distances and we get subsequently the following
table.

Table 11: Distance matrix II


Points (12) 3 4 5
(12) 0.0
3 5.0 0.0
4 9.0 4.0 0.0
5 8.0 5.0 3.0 0.0

ˆ In the table 11, the lowest non zero entry is 3.0 which appears in cell (4, 5) or (5, 4). So we
club points 4 and 5 and make another cluster (4, 5). So now we have two clusters (1, 2) and
(4, 5). We will calculate the following distances dist(12)3 , dist(12)(45) and dist(45)3 . (We show
one example below and likewise we will proceed). So dist(12)(45) = min{dist(12)4 , dist(12)5 } =
min{9.0, 8.0} = 8.0 and so on. Thus we get the distance matrix as table 12.

Table 12: Distance matrix II


Points (12) 3 (45)
(12) 0.0
3 5.0 0.0
(45) 8.0 4.0 0.0

ˆ The smallest entry is now d(45)3 in table 12 and individual 3 is added to the cluster containing
individuals 4 and 5. Finally the groups containing individuals (1, 2) and (3, 4, 5) are combined
into a single cluster. We calculate the distance dist(12)(345) which easily one can calculate to
be 5.0.

Table 13: Distance matrix II


Points (12) (345)
(12) 0.0
(345) 5.0 0.0

So what we have seen is that we have formed two clusters altogether. The above methodology of
forming clustering can be plotted using dendogram. Look at figure 11 and tally with the tables
of the single linkage method. See how dendogram is made (Try to understand the similarity!).

Q: What are various types of Hierarchial clustering? Explain it! What is dendogram? [2023-
24 ODD]

49
Ans: Hierarchial clustering techniques proceed by either a series of successive mergers or a se-
ries of successive divisions. Agglomerative hierarchial methods start with the individual objects.
Thus, there are initially as many clusters as objects. The most similar objects are first grouped and
then these initial groups are merged according to their similarities. Eventually, as the similarity
decreases, all subgroups are fused into a single cluster (see figure 10, the direction is from right to
left). Divisive hierarchial methods work in the opposite direction. An initial single group of objects
is divided into two subgroups such that the objects in one subgroup are far from the objects in the
other. These subgroups are then further divided into dissimilar subgroups; the process continues
until there are as many subgroups as objects-that is, until each object forms a group (see figure 10,
direction as shown). The results of both agglomerative and divisive methods may be displayed in
the form of a two-dimensional diagram, known as dendogram. Linkage methods are suitable for
clustering items, as well as variables. This is not true for all hierarchial agglomerative procedures.
Some of the agglomerative processes are (i) single linkage method, (ii) complete linkage method,
and (iii) average linkage method40 .

Q: What is agglomerative and divisive analysis in hierarchial clustering methods? [2023-24 ODD]

Ans: Write the same answer as above. Here you don’t have to discuss the methodologies in
details. Just state the difference between the two techniques.

Complete Linkage Method:


Complete linkage clustering proceeds in much the same manner as single linkage clusterings, with
one important exception. At each stage, the distance (similarity) between clusters is determined
by the distance (similarity) between the two elements, one from each cluster, that are most distant.
Thus, complete linkage ensures that all items in a cluster are within some maximum distance (or
minimum similarity) of each other. The general agglomerative algorithm again starts by finding
the minimum entry in D = {dik } and merging the corresponding objects, such as U and V , to get
cluster (U V ). Then the distance between (U V ) and any other cluster W are computed by
dist(U V )W = max{distU W , distV W }. (168)
Here, distU W and distV W are the distances between the most distant members of clusters U and
W and clusters V and W , respectively.

Q: Solve the following distance matrix by complete linkage technique and construct dendogram
of it [2023-24 ODD].

Sol: In the given table among all the values the minimum is 2 occurring at cell (c, e). So at
first stage object c and e are merged. This gives the cluster (ce). At stage II, we compute the
following.
dist(ce)a = max{distca , distea } = max{3, 11} = 11,
dist(ce)b = max{distcb , disteb } = 10,
dist(ce)d = max{distcd , disted } = 9. (169)
40
Students are now supposed to write in brief each of these methods as required in the MAKAUT question paper.

50
Points a b c d e
a 0 9 3 6 11
b 9 0 7 5 10
c 3 7 0 9 2
d 6 5 9 0 8
e 11 10 2 8 0

and the modified distance matrix is as follows: The next merger occurs between the most similar

Points (ce) a b d
(ce) 0
a 11 0
b 10 9 0
d 9 6 5 0

groups b and d, to give cluster (bd) as the minimum value is 5 appearing at the cell (b, d). At stage
III, we have

dist(bd)(ce) = max{distb ce, distd ce} = max{10, 9} = 10,


dist(bd)a = max{distba , distda } = 9. (170)

Consequently the distance matrix is You see that in the distance matrix minimum element is 9

Points (ce) (bd) a


(ce) 0
(bd) 10 0
a 11 9 0

appearing at cell (a, bd), hence we merge these two and we get the cluster as (abd). The distance
is calculated as

dist(abd)(ce) = max{dista(ce) , dist(bd)(ce) } = max{11, 10} = 11. (171)

This completes the solution to the problem. The dendogram is shown in the following figure 12.

Procedure of drawing Dendogram:


ˆ Draw horizontal and vertical axes.

ˆ Calibrate them with points. Horizontal axis will be calibrated with points which need to be
clustered. Vertical axis is calibrated with distances.

ˆ Mark the minimum distance from the cell of the distance matrix. Mark it on vertical axis.
Mark the cell number (row-wise and column-wise) on the horizontal axis. Raise two vertical
bars from these cell numbers on horizontal axis and goes up to the value of the minimum
distance. Join them overhead.

51
ˆ After construction of the distance matrix mark the minimum distance. Mark the cell number.
Club the new cell number with old ones. Make new cluster.

ˆ Repeat the step 3, until all clusters are made and you get the minimum distance.

Note: For both simple linkage method and complete linkage method, the process of drawing den-
dogram is same. The only difference is in the construction of distance matrix, in the simple linkage
we take minimum of all the distances and in the complete linkage we take maximum all the distances.

Q: Give one methodology for non-hierarchial clustering.

Ans: K- Means clustering profiling.

52
Purity vs hydrocarbon level
98
96
94
Purity

92
90
88

0.9 1.0 1.1 1.2 1.3 1.4 1.5

Hydrocarbon level

Figure 3: The figure shows that the data has a linear tendency in increasing direction.

53
Purity vs hydrocarbon level
98
96
94
Purity

92
90
88

0.9 1.0 1.1 1.2 1.3 1.4 1.5

Hydrocarbon level

Figure 4: The figure shows fitted regression line with scattered points.

54
100.0

97.5

95.0
Purity

92.5

90.0

87.5

1.0 1.2 1.4


Hydrocarbon.Level
Figure 5: The different colours shows that how the observed values are scattered. Red coloured
dots are those points which are highly deviated from regression line while green coloured points are
those that are least deviated from the regression line.
55
3D Scatterplot

600
500
400
data$Die.Height

300

data$Wire.Length
200

20
15
100

10
5
0

0
0 10 20 30 40 50 60 70

data$Pull.Strength

Figure 6: The figure shows the 3D-scatter plot considering Pull strength as response variable and
Wire Length and Die Height as input variables.

Figure 7: This figures shows how T transforms the vectors (1, 0) and (0, 1) in the coordinate plane.

56
Figure 8: This figures shows how T transforms the vectors (−1, 1) and (1, 1) in the coordinate
plane.

Figure 9: How humans draw clusters

Figure 10: Schematic of hierarchial clustering

57
Figure 11: Dendogram of hierarchial clustering (single linkage method)

Figure 12: Dendogram of hierarchial clustering (complete linkage method)

58

You might also like