Unit-13 Feature Selection and Extraction
Unit-13 Feature Selection and Extraction
13.1 INTRODUCTION
Data sets are made up of numerous data columns, which are also referred to
as data attributes. These data columns can be interpreted as dimensions on an
n-dimensional feature space, and data rows can be interpreted as points inside
that space. One can gain a better understanding of a dataset by applying geometry
in this manner. In point of fact, these characteristics are measurements of the
same entity. It is possible for their existence in the algorithm's logic to get
muddled, which will result in a change to how well the model functions.
Input variables are the columns of data that are fed into a model in order to
provide a forecast for a target variable. However, if your data is given in the
form of rows and columns, such as in a spreadsheet, then features is another
term that can be used interchangeably with input variables. It is possible that
the presence of a large number of dimensions in the feature space implies that
the volume of that space is enormous. As a result, the points (data rows) in that
space reflect a small and non-representative sample of the space's contents. It is
possible for the performance of machine learning algorithms to degrade when
there are an excessive number of input variables. The existence of an excessive
number of input variables has a significant impact on the efficiency with which
machine learning algorithms function. when it is used to data that contains a
large number of input attributes; this phenomenon is referred to as the "curse of
dimensionality." As a consequence of this, one of the most common goals is to
cut down on the number of input features. The process of decreasing the number
of dimensions that characterise a feature space is referred to as "dimensionality
reduction," which is a phrase that was made up specifically to describe this
phenomenon.
The usefulness of data mining can be hindered by an excessive amount of
information on occasion. There are occasions when only a handful of the
columns of data characteristics that have been compiled for the purpose of
constructing and testing a model do not offer any information that is significant
395
Machine Learning - II to the model. However, there are some that actually reduce the reliability and
precision of the model.
For instance, let's say you want to build a model that can forecast the incomes of
people already employed in their respective fields. Therefore, data columns like
cellphone number, house number, and so on will not truly contribute any value
to the dataset, and they can therefore be omitted. This is because irrelevant
qualities introduce noise to the data and affect the accuracy of the model.
Additionally, because of the Noise, the size of the model as well as the amount
of time and system resources required for model construction and scoring are
both increased.
At this point in time, we are required to put the concept of Dimension
Reductionality into practise. This can be done in one of two ways: either by
selecting features to be extracted or by extracting features to be selected.
Both of these approaches are broken down in greater detail below. The step
of dimension reduction is one of the preprocessing phases that occurs during
the process of data mining. This step is one of the preprocessing steps that may
be beneficial in minimising the impacts of noise, correlation, and excessive
dimensionality.
Some more examples are presented below to let you understand What does
dimensionality reduction have to do with machine learning and predictive
modelling?
• A simple issue concerning the classification of e-mails, in which we are
tasked with deciding whether or not a certain email constitutes spam. can
be brought up as a practical illustration of the concept of dimensionality
reduction. This can include elements like whether or not the email has a
standard subject line, the content of the email, whether or not it uses a
template, and so on. However, some of these features may overlap with one
another.
• A classification problem that involves humidity and rainfall can sometimes
be simplified down to just one underlying feature as a result of the strong
relationship that exists between the two variables. As a direct consequence
of this, the number of characteristics could get cut down in some
circumstances.
• A classification problem with three dimensions can be difficult to
understand, whereas a problem with two dimensions can be translated to a
fundamental space with two dimensions, and a problem with one dimension
can be mapped to a line with one dimension. This concept is depicted in
the diagram that follows, which shows how a three-dimensional feature
space can be broken down into two one-dimensional feature spaces, with
the number of features being reduced even further if it is discovered that
they are related.
In context of dimensionality reduction, various techniques like Principal
Component Analysis, Linear Discriminant Analysis, Singular Value
Decomposition are frequently used. In this unit we will discuss all the mentioned
concepts, related to Dimension reductionality
396
Feature Selection
Dimensionality Reduction and Extraction
Y
X
Y
Y
X
X
13.2 DIMENSIONALITY REDUCTION
The Data mining and Machine Learning methodologies both have processing
challenges when working with big amounts of data (many attributes). In point
of fact, the dimensions of the feature space utilised by the approach, often
referred to as the model attributes, play the most important function. Processing
algorithms grow more difficult and time-consuming to implement as the
dimensionality of the processing space increases.
These elements, also known as the model attributes, are the fundamental
qualities, and they can either be variables or features. When there are more
features, it is more difficult to see them all, and as a result, the work on the
training set becomes more complex as well. This complexity was further
increased when a significant number of characteristics were linked; hence,
the classification became irrelevant as a result. In circumstances like these,
the strategies for decreasing the number of dimensions can prove to be highly
beneficial. In a nutshell, "the process of making a set of major variables from a
huge number of random variables is what is referred to as dimension reduction."
When conducting data mining, the step of dimension reduction can be helpful
as a preprocessing step to lessen the negative effects of noise, correlation, and
excessive dimensionality.
Dimension reduction can be accomplished in two ways :
• Feature selection: During this approach, a subset of the complete set of
variables is selected; as a result, the number of conditions that can be
utilised to illustrate the issue is narrowed down. It's normally done in one
of three ways.:
○ Filter method
○ Wrapper method
○ Embedded method
• Feature extraction: It takes data from a space with many dimensions and
transforms it into another environment with fewer dimensions.
397
Machine Learning - II 13.2.1 Feature Selection
It is the process of selecting some attributes from a given collection of prospective
features, and then discarding the rest of the attributes that were considered.
The use of feature selection can be done for one of two reasons: either to get
a limited number of characteristics in order to prevent overfitting or to avoid
having features that are redundant or irrelevant. For data scientists, the ability
to pick features is a vital asset. It is essential to the success of the machine
learning algorithm that you have a solid understanding of how to choose the
most relevant features to analyse. Features that are irrelevant, redundant, or
noisy can contaminate an algorithm, which can have a detrimental impact on
the learning performance, accuracy, and computing cost. The importance of
feature selection is only going to increase as the size and complexity of the
typical dataset continues to balloon at an exponential rate.
Feature Selection Methods: Feature selection methods can be divided into
two categories: supervised, which are appropriate for use with labelled data,
and unsupervised, which are appropriate for use with unlabeled data. Filter
methods, wrapper methods, embedding methods, and hybrid methods are the
four categories that unsupervised approaches fall under.:
• Filter methods: Filter methods choose features based on statistics instead of
how well they perform in feature selection cross-validation. Using a chosen
metric, irrelevant attributes are found and recursive feature selection is done.
Filter methods can be either univariate, in which an ordered ranking list of
features is made to help choose the final subset of features, or multivariate,
in which the relevance of all the features as a whole is evaluated to find
features that are redundant or not important.
• Wrapper methods: Wrapper feature selection methods look at the
choice of a set of features as a search problem. Their quality is judged
by preparing, evaluating, and comparing a set of features to other sets of
features. This method makes it easier to find possible interactions between
variables. Wrapper methods focus on subsets of features that will help
improve the quality of the results from the clustering algorithm used for
the selection. Popular examples are Boruta feature selection and Forward
feature selection.
• Embedded methods: Embedded feature selection approaches incorporate
the feature selection machine learning algorithm as an integral component of
the learning process. This allows for simultaneous classification and feature
selection to take place within the method. Careful consideration is given to
the extraction of the characteristics that will make the greatest contribution
to each iteration of the process of training the model. A few examples of
common embedded approaches are the LASSO feature selection algorithm,
the random forest feature selection algorithm, and the decision tree feature
selection algorithm.
Among all approaches the most conventional feature selection is feed forward
feature selection.
Forward feature selection: The first step in the process of feature selection is
398 to evaluate each individual feature and choose the one that results in the most
effective algorithm model. This is referred to as "forward feature selection." Feature Selection
After that step, each possible combination of the feature that was selected and a and Extraction
subsequent feature is analysed, and then a second feature is selected, and so on,
until the required specified number of features is chosen. The operation of the
forward feature selection algorithm is depicted here in the figure.
Accuracy = 87%
We'll next use the Gender feature to train the model, and we acquire an accuracy
of 80%. –
Accuracy = 88%
We acquire a 91 percent accuracy when we combine Plays Sport with Calories
Burnt. The variable that yields the greatest improvement will be kept. That
makes natural sense. As you can see, when we combine Plays Sport with
Calories Burnt, we get a better result. As a result, we'll keep it and use it in our
model. We'll keep repeating the process till all the features are considered in
improving the model performance
Accuracy = 91%
f2
e1
e2
f1
When applying the PCA method, the following are the primary steps that should
be followed:
1. Obtain the dataset you need.
2. Calculate the mean of the vectors ().
3. Deduct the mean of the given data from the total.
4. Complete the computation for the covariance matrix.
5. Determine the eigenvectors and eigenvalues of the matrix that represents
the covariance matrix.
6. Creating a feature vector and deciding which components would be the
major ones i.e. the principal components.
7. Create a new data set by projecting the weight vector onto the dataset.As a
result, we have a smaller number of eigenvectors, and some data may have
been lost in the process. However, the remaining eigenvectors should keep
the most significant variances.
403
Machine Learning - II Merits of Dimensionality Reduction
• It helps to compress data, which reduces the amount of space needed to
store it and the amount of time it takes to process it.
• If there are any redundant features, it also helps to get rid of them.
Limitations of Dimensionality Reduction
• You might lose some data.
• You might lose some data.
• PCA fails when the mean and covariance are not enough to describe a
dataset.
• We don't know how many major parts we need to keep track of, but in
practice, we follow some rules.
Below is the practice question for Principal Component Analysis (PCA) :
Problem-01: 2, 3, 4, 5, 6, 7; 1, 5, 3, 6, 7, 8 are the given data. Using the PCA
Algorithm, calculate the primary component.
OR
Consider the two-dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (8, 8)
and (9, 10). (7, 8).
Using the PCA Algorithm, calculate the primary component.
OR
Calculate the principal component of following data-
4.5
Thus, Mean vector (μ) =
5
404
Step-03: Feature Selection
and Extraction
On subtracting mean vector (µ) from the given feature vectors.
• x1 – µ = (2 – 4.5, 1 – 5) = (-2.5, -4)
same for others
Feature vectors (xi) generated after subtraction are
−2.5 −1.5 −0.5 0.5 1.5 2.5
− 4 0 − 2 1 2 3
Step-04:
Now to find covariance matrix : Covariance Matrix =
∑(X i − µ )( X i − µ )t
n
−2.5 6.25 10
m1 = (x1 − µ )(x1 − µ ) t = [ −2.5 − 4] =
16
−4 10
−1.5 2.25 0
m 2 = (x 2 − µ )(x 2 − µ ) t = [ −1.5 0] =
0
0 0
−0.5 0.25 1
m3 = (x 3 − µ )(x 3 − µ ) t = [ −0.5 − 2] =
4
−2 0
0.5 0.25 1
m 4 = (x 4 − µ )(x 4 − µ ) t = [ 0.5 1] =
2 1 4
1.5 2.25 3
m5 = (x 5 − µ )(x 5 − µ ) t = [1.5 2] =
2 3 4
1.5 6.25 7.5
m 6 = (x 6 − µ )(x 5 − µ ) t = [ 2.5 3] =
2 7.5 9
1 17.5 22
Covariance Marix =
6 22 34
2.92 3.67
Covariance Matrix =
3.67 5.67
Step-05:
Eigen values and Eigen vectors of the covariance matrix.
2.92 3.67 λ 0
− =0
3.67 5.67 0 λ
2.92 − λ 3.67
=0
3.67 5.67 − λ
From here,
(2.92-λ)(5.67-λ)-(3.67x 3.67)=0
16.56-2.92λ-5.67λ+λ2-13.47=0
λ2-8.56λ+3.09=0
Solving this quadratic equation, we get λ=8.22,0.38 405
Machine Learning - II Thus, two eigen values are λ_1=8.22 and λ_2=0.38.
Clearly, the second eigen value is vary small compared to the first eigen value.
So, the second eigen vactor can be left out.
Eigen vector corresponding to the greatest eigen value is the principle component
for the given data set.
So, we find the eigen vector corresponding to eigen value λ_1.
We use the following equation to find the eigen vector-
MX=λX
Where-
• M=Covariance Matrix ; X=Eigen vector ,and λ=Eigen value
Substituting the values in the above equation, we get-
On being substituting the values in the above equation, we get-
2.92 3.67 X1 X1
3.67 5.67 X2 = 8.22 X2
406 1 2 3 4 5 6 7 8
Problem -02 Feature Selection
and Extraction
Use PCA Algorithm to transform the pattern (2, 1) onto the eigen vector in the
previous question.
Solution-
2
The given feature vector is (2, 1) i.e. Given Feature Vector:
1
The feature vector gets transformed to :
= Transpose of Eigen vector x (Feature Vector – Mean Vector)
T
2.55 2 4.5 −2.5
= x − = [ 2.55 3.67 ] x − 21.055
=
3.67 1 5 −4
10
X2
2 4
2
WLDA
0
0 2 4 6 8 10
X1
11 409
Machine Learning - II Solution :
To understand the working of LDA lets take an example and workout step by
step.
Step 1:- Compute the within – Class Scatter matrix (Sw), Sw measures how well
the data is scattered with in the class , which is done by finding the mean of the
classes i.e. μ1 & μ2 where, μ1& μ2 are the mean of class C1 & C2 respectively
Now, Sw is given by
S w = S1 + S2
where, S1 is the covariance matrix for the class C1 and
S2 is the covariance matrix for the class C2
Let’s find the covariance matrices S1 & S2 of each class
S1 = ∑ε x c1
(x − µ1 ) (x − µ1 )T
where, μ1 is the mean of class C1, which is computed by averaging the coordinates
of dataset X1
(Coordinates of X1)
4+2+2+3+4 1+4+3+6+4
µ1 = , ;
5 5
µ1 = [3.00 ; 3.60]
S1 = ∑
ε
(x − µ ) (x − µ )
x w1
1 1
T
µ1 = [3 3 ⋅ 60]
1 −1 −1 0 1
(x1 − µ1 ) =
−2.6 0.4 − 0.6 2.4 0.4
Now, for each x1 we are going to calculate (x – μ1) (x – μ1)T . So, we will have
“5” such matrices.
Now solving it one by one i.e. for each column of (x1 – μ1) we find (x1 – μ1).
(x1 – μ1)T
1 1 − 2.6
−2.6 [1 − 2.6] = (x1 − µ1 ) = −2.6 6.16
→
FIRST MATRIX
1 1 − 0.4
Similarly for all columns [ − 1 0.4] = →
0.4 0.4 0.16
−1 1 0.6
0.6 [ − 1 0.6] =
0.36
→
410 0.6
Feature Selection
0 1 0
and Extraction
[0 2.4] =
5.76
→
2.4 0
1 1 0
[1 0.4] =
0.16
→
0.4 0.4
SB =(µ1 − µ2 ) (µ1 − µ2 )T
−5.4 29.16 21.6
= (−5.4
= −4
−4 21.6 16.00
11.89 8.81 V1 V1
5.08 3.76 V = 15.65 V
2 2
V 0.91
We get 1 =
V2 0.39
411
Machine Learning - II Or we directly solve
V1 0.1921 − 0.032 −5.4
[0.91 0.39]
T
V = Sw (µ1 − µ2 ) =
−1
−0.031 =
2 0.38 − 4
Projection Vector
Corresponding to
highest Eigen
value
412 .....................................................................................................................
13.5 SINGLE VALUE DECOMPOSITION Feature Selection
and Extraction
The Singular Value Decomposition (SVD) method is a well-known technique
for decomposing a matrix into a large number of component matrices. This
method is valuable since it reveals many of the interesting and helpful
characteristics of the initial matrix. We can use SVD to discover the optimal
lower-rank approximation to the matrix, determine the rank of the matrix, or
test a linear system's sensitivity to numerical error.
Singular value decomposition is a method of decomposing a matrix into three
smaller matrices.
A = U∑VT
Where:
• A : is an m × n matrix
• U : is an m × n orthogonal matrix
• S : is an n × n diagonal matrix
• V : is an n × n orthogonal matrix
Below is a practice problems based on single value decomposition
1 1
Problem-03. Find the SVD of the matrix A = 0 1
−1 1
1 1
1 0 -1 2 0
T
A A= x 0 1 =
1 1 1 -1 0 3
1
a
For ⋋1=3 let X1 = be the Eigen Vector of AT A; i.e. (AT A) X1 = ⋋1X1 ;
b
Substituting the values we get
2 0 a a
0 3 b = 3 b
Solving we get
–a + 0 b = 0------------------------(1)
0 a + 0 b = 0------------------------(2)
From (1); - a + 0 b = 0 ; substituting a=0 & b=1 satisfies the equation so the
Eigen Vector
a 0
X1 = =
b 1
Now we need to Normalize this Eigen Vector X1, and for this we need to divide
each element of Eigen Vector X1 by the length of Eigen Vector X1.
Step-2 : Compute ∑
To compute ∑ We need to consider that the order of ∑ must be equal to the
order of A, and we need to calculate the singular values σ1 and σ2 ; where ⋋ σ1
= 1 &⋋σ2 = 2 i.e. σ1 = 3 & . σ2 = 2 . Also, we need to recall the
relation between rank of Matrix and Eigen Values that number of non zero
Eigen values should be equal to the rank of the matrix. Since number of non
zero Eigen values is Two, thus the Rank (R) of the matrix is also two i.e. 2
Σ1⋅ 0
∑ = 0 0
;
414
Where, ∑1. is a diagonal matrix whose diagonal elements are σ1 & σ2 ; and Feature Selection
Order of ∑1. = R X R where R is the rank of the matrix thus Order of ∑1. = 2 X and Extraction
2 (since R = 2).
3 0
Therefore, ∑ = 0 2 ; we are adding zeros in the 3rd row because the order
0 0
1 1 2 1 0
1 0 − 1
= 1 1 1
Calculate A.AT = 0 1 =
1 1 1
−1 1 0 1 2
2 1 0
Now, find Eigen Values and Eigen Vectors of A.AT = 1 1 1 , as we did it
earlier for the case of matrix V 0 1 2
find the Eigen values (⋋) of AT A
The procedure of finding the Eigen values (⋋) of AT A is as follows
Let. ⋋3–S1 ⋋2+S2⋋–S3 = 0 be the characteristic equation of AT A,
where
S1 = Trace of AAT = Tr(AAT ) = (2+1+2) = 5
Note : Trace implies the sum of diagonal elements
S2 = Sum of Minors of Diagonal Matrix AAT
1 1 2 0 2 1
S2 = + + = 1+4+1 = 6
1 2 0 2 1 1
And
2 1 0
S3 = Determinant of AAT = 1 1 1 = 2(1) – 1(2) = 0
0 1 2
a
For ⋋ = 3 let X1 = b be the Eigen vector of AAT ; then (AAT) X1 = ⋋X1
c
2 1 0 a a
Thus, (AAT) X1 = ⋋X1 => 1 1 1 b = 3
b ; solving the expression we get
0 1 2 c c
–a + 0 b + 0 c = 0 -----------------------(1’)
a - 2 b + c = 0 ------------------------(2’)
0 a + b - c = 0 ------------------------(3’)
1
Solving equation (1’),(2’) & (3’) we get X1 = 1
1
Now we need to Normalize this Eigen Vector X1, and for this we need to divide
each element of Eigen Vector X1 by the length of Eigen Vector X1.
1/ 3
Therefore, N(X1) i.e. Normalized Eigen Vector X1 = 1/ 3
1/ 3
1 1/ 2
Similarly X2 = 0 Therefore, N(X2) i.e. Normalized Eigen Vector X2 = 0
−1 −1/ 2
1 1/ 6
Similarly X3 = −2 Therefore, N(X3) i.e. Normalized Eigen Vector X3 = −2/ 6
1 1/ 6
Now, Matrix U can be written by using N(X1), N(X2) & N(X3) i.e.
1/ 3 1/ 2 1/ 6
U = 1/ 3 0 − 2/ 6
1/ 3 − 1/ 2 1/ 6
416
Since, U ∑ VT is the Singular Value Decomposition (SVD) of the Matrix A, so Feature Selection
need to substitute values of VT, ∑ and U to find U ∑ VT and Extraction
1 1 1/ 3 1/ 2 1/ 6 3 0
0 0 1
SVD of the matrix A i.e.= 1 = U Σ V T = 1/ 3 0 − 2/ 6 0 2
−1 1 0 1 1 0
1/ 3 − 1/ 2 1/ 6
13.6 SUMMARY
In this unit we learned about the concept of Dimension Reductionality, wherein
we understood basics of Feature Selection and Feature extraction techniques.
Thereafter an explicit discussion of Principal Component Analysis(PCA),
Linear Discriminant Analysis(LDA) and Singular Value Decomposition(SVD)
was given.