[go: up one dir, main page]

0% found this document useful (0 votes)
5 views158 pages

Unit-II Feature Engineering - Removed

pdf of feature engineering

Uploaded by

shrustiturkane24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views158 pages

Unit-II Feature Engineering - Removed

pdf of feature engineering

Uploaded by

shrustiturkane24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 158

Unit –II : Feature Engineering

CO2: Apply various data pre-processing techniques to simplify and speed up


machine learning algorithms
2
Unit –II : Feature Engineering

● Part-1 Concept of Feature, Preprocessing of data: Normalization and Scaling, Standardization,


Managing missing values,
● Part-2 Introduction to Dimensionality Reduction, Principal Component Analysis (PCA)Feature
Extraction: Kernel PCA, Local Binary Pattern.
● Part-3 Introduction to various Feature Selection Techniques, Sequential Forward Selection,
Sequential Backward Selection.
● Part-4 Statistical feature engineering: Mean, Median, Mode etc. based feature vector creation.
● Part-5 Multidimensional Scaling, Matrix Factorization Techniques.
3
Feature Engineering

● Feature engineering is the pre-processing step of machine learning, which is used to transform raw
data into features that can be used for creating a predictive model using Machine learning or
statistical Modelling.
● Feature engineering in machine learning aims to improve the performance of models.

4
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering

What is a Feature?

● Generally, all machine learning algorithms take input data to


generate the output.
● The input data remains in a tabular form consisting of rows (instances
or observations) and columns (variable or attributes), and these
attributes are often known as features.
● For example, an image is an instance in computer vision, but a line in
the image could be the feature.
5
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering

What is a Feature Engineering?

● Feature engineering is the pre-processing step of machine learning,


which extracts features from raw data.
● It helps to improve the accuracy of the model for unseen data.
● The predictive model contains predictor variables and an outcome
variable, and while the feature engineering process selects the most
useful predictor variables for the model.

https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering

What is a Feature Engineering?

7
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering

Feature Creation
Feature Engineering Process:

Transformations

Feature Extraction

Feature Selection

8
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Feature Engineering Process:
Feature Creation
● Feature creation is finding the most useful
variables to be used in a predictive model.
Transformations

Feature Extraction

Feature Selection

9
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Feature Engineering Process:

Feature Creation ● The transformation step of feature engineering


involves adjusting the predictor variable to improve
the accuracy and performance of the model. For
Transformations
example, it ensures that the model is flexible to
take input of the variety of data; it ensures that all
Feature Extraction the variables are on the same scale, making the
model easier to understand.
Feature Selection

10
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering

Feature Creation
Feature Engineering Process:

Transformations ● Feature extraction is an automated feature


engineering process that generates new variables
by extracting them from the raw data.
Feature Extraction
● Feature extraction methods include cluster
analysis, text analytics, edge detection algorithms,
Feature Selection
and principal components analysis (PCA).
11
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering

Feature Creation

Transformations
Feature Engineering Process:

Feature Extraction ● Feature selection is a way of selecting the subset of


the most relevant features from the original
features set by removing the redundant, irrelevant,
Feature Selection
or noisy features.
12
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Benefits of Feature Engineering:

● It helps in avoiding the curse of dimensionality.


● It helps in the simplification of the model so that the researchers can
easily interpret it.
● It reduces the training time.
● It reduces overfitting hence enhancing the generalization.

13
https://www.javatpoint.com/feature-engineering-for-machine-learning
Unit –II : Feature Engineering

● Part-1 Concept of Feature, Preprocessing of data: Normalization and Scaling, Standardization,


Managing missing values
● Part-2 Introduction to Dimensionality Reduction, Principal Component Analysis (PCA)Feature
Extraction: Kernel PCA, Local Binary Pattern.
● Part-3 Introduction to various Feature Selection Techniques, Sequential Forward Selection,
Sequential Backward Selection.
● Part-4 Statistical feature engineering: Mean, Median, Mode etc. based feature vector creation.
● Part-5 Multidimensional Scaling, Matrix Factorization Techniques.

14
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values

15
https://www.youtube.com/watch?v=AOfzlVi-NJs
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values

16
https://www.youtube.com/watch?v=AOfzlVi-NJs
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values

Scaling:

● Numerical features of the dataset do not have a certain range and they
differ from each other.
● Can’t expect age and income columns to have the same range.
● But from the machine learning point of view, how these two columns can be
compared?
● The continuous features become identical in terms of the range, after a
scaling process

17
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values

Scaling:

Two common ways of scaling: value normalized


0 2 0.23
1) Normalization 1 45 0.63
2 -23 0.00
3 85 1.00
X_norm = (X - X_min)/(X_max - X_min) 4 28 0.47
5 2 0.23
Normalization (or min-max normalization) scale all values in a fixed 6 35 0.54
range between 0 and 1. 7 -12 0.10
18 data= {'value':[2,45, -23, 85, 28, 2, 35, -12]}
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values

Scaling:

2) Standardization or Z-Score Normalization:

Transformation of features by subtracting from mean and dividing by standard deviation.


This is often called as Z-score.
X_new = (X - mean)/Standard deviation

Standardisation is more robust to outliers, and in many cases, it is preferable over Max-Min
Normalization

19
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values

Scaling:

2) Standardization or Z-Score Normalization:

The standard deviation formula may look confusing, but it will make sense after we break it down.
Step 1: Find the mean.
Step 2: For each data point, find the square of its distance to the mean.
Step 3: Sum the values from Step 2.
Step 4: Divide by the number of data points.
Step 5: Take the square root.
20
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values

Scaling: 2) Standardization or Z-Score Normalization: click here

21
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values

Scaling: 2) Standardization or Z-Score Normalization:

University Question: Link of paper

Consider a vector x = (23, 29, 52, 31, 45, 19, 18, 27) Apply feature scaling and find out min-max scaled
values as well as z-score values.

22
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values

Scaling:

2) Standardization or Z-Score Normalization:

●A z-score tells us how many standard deviations away a value is from the mean.
●If a value has a z-score equal to 0, then the value is equal to the mean.
●If a value has a z-score equal to -1.3, then the value is 1.3 standard deviations below the
mean.
●If a value has a z-score equal to 2.2, then the value is 2.2 standard deviations above the

23
mean.
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values

24
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values

Though the data is in the correct format, there is a case that some values are missing, Suppose we have
the data of students and one of the features is landline contact no. Now for some of the records values
are missing for this feature that is landline contact no. this is because of the fact that some students
don't have landline contact no. at home.
Another case of missing data can be due to the process of collecting data. For example initially for
storing students data only one region is considered so pincode is not considered initially. Later on when
we want to expand the dataset for all the regions pincode is missing which is a necessary feature.

25
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values

Types of missing data:


1. Missing Completely at Random (MCAR)
2. Missing at Random (MAR)
3. Missing Not at Random (MNAR)

26
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values

1. Missing Completely at Random (MCAR)


● There’s no relationship between whether a data
point is missing and any values in the data set
● For example, the thermometer cannot measure
temperature as it has been damaged. So
temperature data is missing.
● The missing data are nothing but a random
subset of the data.
● Other variables are not affected by the
missingness.
● It rarely happens that data is MCAR.
27
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values

2.Missing at Random (MAR)


● Missing at Random means the data is missing
relative to the observed data.
● It is not related to the specific missing values.
● For example, a student cannot take admission
because his/her score is less than the merit
score.
● The data is not missing across all observations
but only within sub-samples of the data.
● We could easily notice that IQ score is missing
for youngsters (<40)
28
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values

3.Missing Not at Random (MNAR)


● It is nor Type I neither Type II , and the data will
be missing based on the missing column itself

● The MNAR category applies when the missing


data has a structure to it. In other words, there
appear to be reasons the data is missing.

● The fact that data are missing on IQ score with


only the people having a low score

29
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values

So how can we handle missing data?

Some ways to handle missing data are,


● Deleting the record with missing data
● Replacing missing data with constant
● Imputation.(Mean, Median etc for numerical variables.)

30
Unit –II : Feature Engineering

● Part-1 Concept of Feature, Preprocessing of data: Normalization and Scaling, Standardization,


Managing missing values
● Part-2 Introduction to Dimensionality Reduction, Principal Component Analysis (PCA)Feature
Pattern
Extraction: Kernel PCA, Local Binary Pattern.
● Part-3 Introduction to various Feature Selection Techniques, Sequential Forward Selection,
Sequential Backward Selection.
● Part-4 Statistical feature engineering: Mean, Median, Mode etc. based feature vector creation.
● Part-5 Multidimensional Scaling, Matrix Factorization Techniques.

31
Introduction to Dimensionality Reduction

● Dimensionality reduction is the process of reducing the number of random variables under
consideration, by obtaining a set of principal variables.

● In machine learning classification problems, there are often too many factors on the basis of which
the final classification is done.
● These factors are basically variables called features.
● The higher the number of features, the harder it gets to visualize the training set and then work on
it.
● Sometimes, most of these features are correlated, and hence redundant. This is where dimensionality
reduction algorithms come into play.

32
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction

33
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction

There are two components of dimensionality reduction:


Selection: Choosing a subset of the original pool of features.
Extraction: Getting useful features from existing data.

34
https://www.geeksforgeeks.org/dimensionality-reduction/
Why Feature Selection?

1
Introduction to Dimensionality Reduction

Feature selection: In this, we try to find a subset of the original set of variables, or features, to get a
smaller subset which can be used to model the problem. It usually involves three ways:

○ Filter
○ Wrapper
○ Embedded

36
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction : Feature selection

Filter Method:

○ These methods use statistical measures to rank


features based on their relevance to the target
variable.
○ Features with high scores are considered more
important.
○ Common filter methods include Pearson correlation,
Chi-square test, and Information Gain
37
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction : Feature selection

Wrapper Method:

○ In wrapper methodology, selection of features is


done by considering it as a search problem, in
which different combinations are made, evaluated,
and compared with other combinations.
○ It trains the algorithm by using the subset of
features iteratively.
○ Forward Selection, Backward Elimination etc.
38
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction : Feature selection

Embedded Method:

○ These approaches combine feature selection with


the model training process.
○ The model itself decides which features are
essential and which ones can be discarded.
○ Lasso and Ridge regression are examples of
embedded methods.

39
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction : Feature selection

40
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction : Feature selection

41
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction : Feature selection

Filter method Wrapper method

Measure the relevance of features with the Measure the usefulness of a subset of
dependent variable feature

This method is fast and is computationally This method is slow and is computationally
less expensive more expensive

Useful for large datasets Useful for small datasets

Might fail to find the best subset of features Always provide the best subset of features

Avoid overfitting Prone to overfitting


42
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction

Feature Extraction: By finding a smaller set of new variables, each being a combination of the input
variables, containing basically the same information as the input variables.
The various methods used for dimensionality reduction include:
● Principal Component Analysis (PCA)
● Kernel PCA
● Linear Discriminant Analysis (LDA)

https://www.analyticsvidhya.com/blog/2021/04/guide-for-feature-extraction-techniques/

43
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction

Methods of Dimensionality Reduction


The various methods used for dimensionality reduction include:
● Principal Component Analysis (PCA)
○ Unsupervised algorithm useful for dimensionality reduction.
● Linear Discriminant Analysis (LDA)
● Generalized Discriminant Analysis (GDA)

44
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction

Methods of Dimensionality Reduction


The various methods used for dimensionality reduction include:
● Principal Component Analysis (PCA)
● Linear Discriminant Analysis (LDA)
○ Projects data in such a way that separability is maximised.
● Generalized Discriminant Analysis (GDA)

45
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction

Methods of Dimensionality Reduction


The various methods used for dimensionality reduction include:
● Principal Component Analysis (PCA)
● Linear Discriminant Analysis (LDA)
● Generalized Discriminant Analysis (GDA)
○ It is effective approach for extracting nonlinear features

46
https://www.geeksforgeeks.org/dimensionality-reduction/
Feature Engineering
What is a Feature Engineering?

● Feature engineering is the pre-processing step of machine learning,


which extracts features from raw data.
● It helps to improve the accuracy of the model for unseen data.
● The predictive model contains predictor variables and an outcome
variable, and while the feature engineering process selects the most
useful predictor variables for the model.

47

https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
What is a Feature Engineering?

48

https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Feature Creation
Feature Engineering Process:

Transformations

Feature Extraction
49

Feature Selection

https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Feature Creation
Feature Engineering Process:

Transformations

Feature Extraction
50

Feature Selection

https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Feature Engineering Process:
Feature Creation
● Feature creation is finding the most useful
variables to be used in a predictive model.
Transformations

Feature Extraction
51

Feature Selection

https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Feature Engineering Process:
Feature Creation
● The transformation step of feature engineering
involves adjusting the predictor variable to improve
Transformations the accuracy and performance of the model. For
example, it ensures that the model is flexible to

Feature Extraction take input of the variety of data; it ensures that all
52
the variables are on the same scale, making the
model easier to understand.
Feature Selection

https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Feature Creation
Feature Engineering Process:

Transformations ● Feature extraction is an automated feature


engineering process that generates new variables
by extracting them from the raw data.
Feature Extraction
● Feature extraction 53 methods include cluster
analysis, text analytics, edge detection algorithms,
Feature Selection
and principal components analysis (PCA).

https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Feature Creation

Transformations
Feature Engineering Process:

Feature Extraction ● Feature selection is a way of selecting the subset of


54
the most relevant features from the original
features set by removing the redundant, irrelevant,
Feature Selection
or noisy features.
https://www.javatpoint.com/feature-engineering-for-machine-learning
Benefits of Feature Engineering
● It helps in avoiding the curse of dimensionality.

● It helps in the simplification of the model so that the


researchers can easily interpret it.
● It reduces the training time.
● It reduces overfitting hence enhancing
55
the generalization.

https://www.javatpoint.com/feature-engineering-for-machine-learning
Overfitting and Underfitting
Overfitting Underfitting
Y Y

X X
• Sphere
• Play Ball
• Cannot Eat
• Radius : 5cm Sphere
10 cm
Ball ????
Principal Component Analysis
2
PC1

1 PC2
Covariance Matrix
Eigenvalues and Eigenvectors
Eigenvalues and Eigenvectors
Eigenvalues and Eigenvectors
● For a square matrix, A, a non-zero vector is called an eigenvector if multiplication by A results
in a scalar multiple of

A*x = λ*x

● The scalar λ is called the eigenvalue associated with the Eigenvector.


● There are n eigenvalues ( λ1, λ2 ... λn ) exist for a n Î n matrix. The eigenvalues are
calculated by using the formula:

| A-λ*I= 0 |

● Where A is the covariance matrix and I is Identity matrix.


● The determinant of the resulting matrix, results into polynomial of order n.
Eigenvalues and Eigenvectors
● The determinant of the resulting matrix, results into polynomial of order n.

● By setting this polynomial equal to zero and solving for λ the desired

eigenvalues are generated.

● Here n solutions are generated; it means that neigen values are derived.

● It is not essential that all eigenvalues are unique


Eigenvalues and Eigenvectors For PCA
WORKING OF PCA
WORKING OF PCA
WORKING OF PCA
WORKING OF PCA
WORKING OF PCA
WORKING OF PCA
WORKING OF PCA
WORKING OF PCA
WORKING OF PCA
PCA in Face Recognition
Python code PCA
Kernel Tick https://www.youtube.com/watch?v=vMmG_7JcfIc

82
Kernel Tick

83
Kernel Tick

84
Kernel Tick

85
Kernel Tick

86
Kernel Tick

87
Kernel Tick

88
Kernel Tick

89
Kernel Tick

90
Kernel Tick

91
Kernel Tick

92
Kernel Tick

93
Kernel Tick
● Let us say that we have two points, x= (2, 3, 4) and y= (3, 4, 5)
● As we have seen, K(x, y) = < f(x), f(y) >.
● Let us first calculate < f(x), f(y) >
○ f(x)=(x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3)
○ f(y)=(y1y1, y1y2, y1y3, y2y1, y2y2, y2y3, y3y1, y3y2, y3y3)
● so,
○ f(2, 3, 4)=(4, 6, 8, 6, 9, 12, 8, 12, 16)and
○ f(3 ,4, 5)=(9, 12, 15, 12, 16, 20, 15, 20, 25)
● so the dot product,
○ f (x). f (y) = f(2,3,4) . f(3,4,5)=
○ (36 + 72 + 120 + 72 +144 + 240 + 120 + 240 + 400)=1444
94
Kernel Tick
● Another way x= (2, 3, 4) and y= (3, 4, 5)
● K(x, y) =
○ (2*3 + 3*4 + 4*5) ^2
○ =(6 + 12 + 20)^2
○ =38*38
○ =1444.

95
Why Kernel Tick?
● This as we find out, f(x).f(y) and K(x, y) give us the same

result

● the first method required a lot of calculations(because of

projecting 3 dimensions into 9 dimensions)

● using the kernel, it was much easier.

96
Types of Kernel

● Linear Kernel
● Polynomial Kernel
● Exponential Kernel
● Gaussian Kernel
● Sigmoid Kernel
● And Many more
97
Types of Kernel - Linear Kernel

● Let us say that we have two vectors with name x1 and Y1, then

the linear kernel is defined by the dot product of these two

vectors:

● K(x1, x2) = x1 . x2

98
Types of Kernel - Polynomial Kernel

● A polynomial kernel is defined by the following equation:

● K(x1, x2) = (x1 . x2 + 1)d,

● Where, d is the degree of the polynomial and x1 and x2 are

vectors

99
Types of Kernel - Gaussian Kernel
● This kernel is an example of a radial basis function kernel.

● The given sigma plays a very important role in the performance

of the Gaussian kernel and should neither be overestimated and

nor be underestimated, it should be carefully tuned according

to the problem.
100
Types of Kernel - Exponential Kernel
● This is in close relation with the previous kernel i.e. the

Gaussian kernel with the only difference is – the square of the

norm is removed.

● The function of the exponential function is:

101
Linear PCA Vs Kernel PCA

1
1
Kernel PCA
● PCA is a linear method. It works great for linearly separable
datasets.
● However, if the dataset has non-linear relationships, then it
produces undesirable results.
● Kernel PCA is a technique which uses the so-called kernel trick
and projects the linearly inseparable data into a higher
dimension where it is linearly separable.
● There are various kernels that are popularly used; some of them
are linear, polynomial, RBF, and sigmoid 1
Kernel PCA
Step 1:
● First choose a kernel functions k(x_i, x_j) and let T be any
transformation to a higher dimension.
Step 2:
● Find the covariance matrix of data, But here, kernel function is
used to calculate this matrix. It is the matrix that results from
applying kernel function to all pairs of data.

K=T(X)T(X)^T 1
Kernel PCA
Step 3:

● Center the kernel matrix (this equivalent to subtract the mean


of the transformed data and dividing by standard deviations)

● K_new = K - 2(I)K + (I)K(I)


● where I is a matrix that its all elements are equal to i/d.

1
Kernel PCA
Step 4:
● find eigenvectors and eigenvalues of this matrix.
● Sort eigenvectors based on their corresponding eigenvalues in
a decreasing order.
● choose the number of dimensions that needed in reduced
dataset, let's call it k.
● choose our first k eigenvectors and concatenate them in one
matrix.
● Finally, Calculate the product of that matrix with your data.
The result will be reduced dataset. 1
Kernel PCA
Step 4:
● find eigenvectors and eigenvalues of this matrix.
● Sort eigenvectors based on their corresponding eigenvalues in
a decreasing order.
● choose the number of dimensions that needed in reduced
dataset, let's call it k.
● choose our first k eigenvectors and concatenate them in one
matrix.
● Finally, Calculate the product of that matrix with your data.
The result will be reduced dataset. 1
Local Binary Pattern
● Local Binary Pattern (LBP) is a simple but powerful technique

used in machine learning to analyze textures and patterns in

images.

● It's especially useful for tasks like image classification, face

recognition, and texture analysis.

1
Local Binary Pattern
● It combined statistical and structural methods and was first

described in 1994.

● The Local Binary Pattern is a technique of local

representation of a picture.

● It comprises relative values by comparing each pixel with its

neighboring pixels.
1
Local Binary Pattern

1
Local Binary Pattern

● Binary value = 11100010


● Decimal Value = 226.
● It indicates that all these pixels around the central value equal
to 226. 1
Local Binary Pattern

● Binary value = 11100010


● Decimal Value = 226

1
Local Binary Pattern
The original LBP operator labels the pixels of an image with decimal
numbers, called Local Binary Patterns which encode the local
structure around each pixel.

1. Each pixel is compared with its eight neighbors in a 3x3


neighborhood by subtracting the center pixel value.
2. The resulting strictly negative values are encoded with 0 and
the others with 1.
3. A binary number is obtained by concatenating all these binary
codes in a clockwise direction starting from the top-left one
and its corresponding decimal value is used for labeling. 1
The LBP descriptor is defined as a grey-scale invariant texture
measure derived from a general definition of texture in a local
neighborhood.

115
Unit –II : Feature Engineering

● Part-1 Concept of Feature, Preprocessing of data: Normalization and Scaling, Standardization,


Managing missing values
● Part-2 Introduction to Dimensionality Reduction, Principal Component Analysis (PCA)Feature
Pattern
Extraction: Kernel PCA, Local Binary Pattern.
● Part-3 Introduction to various Feature Selection Techniques, Sequential Forward Selection,
Sequential Backward Selection.
● Part-4 Statistical feature engineering: Mean, Median, Mode etc. based feature vector creation.
● Part-5 Multidimensional Scaling, Matrix Factorization Techniques.

139
Feature Selection

• find a smaller subset of a many-dimensional data set to


create a data model
• finding k features of the d dimensions that give us the most
information and discard the other (d − k) dimensions.
• Subset selection is one of the widely used method
Forward Selection
o It starts with no variables or null model.
o In next step it will add one by one feature which is not
already considered before.
o At each step after adding the one feature the error is
checked.
o The process is continuing until it will find the subset of
features that decreases the error the most, or until any
further addition does not decrease the error.
Algorithm -Forward Selection
Algorithm -Backward Elimination
1. Start with F containing all features
2. Remove one attribute from F that causes the least error

3. Stop if removing a feature does not decrease the error


Comment
The complexity of backward search has the same order of
complexity as forward search, except that training a system with
more features is costlier than training a system with fewer
features, and forward search may be preferable especially if we
expect many useless features.
148
149
Unit –II : Feature Engineering

● Part-1 Concept of Feature, Preprocessing of data: Normalization and Scaling, Standardization,


Managing missing values
● Part-2 Introduction to Dimensionality Reduction, Principal Component Analysis (PCA)Feature
Pattern
Extraction: Kernel PCA, Local Binary Pattern.
● Part-3 Introduction to various Feature Selection Techniques, Sequential Forward Selection,
Sequential Backward Selection.
● Part-4 Statistical feature engineering: Mean, Median, Mode etc. based feature vector creation.
● Part-5 Multidimensional Scaling, Matrix Factorization Techniques.

150
Statistical Feature Engineering
Feature engineering refers to a process of selecting & transforming variables/features in
your dataset when creating a predictive model using machine learning.
• Therefore you have to extract the features from the raw dataset you have collected before
training your data in machine learning algorithms.
• Feature engineering has two goals:
– Preparing the proper input dataset, compatible with the machine learning algorithm
requirements.
– Improving the performance of machine learning models.

151
Feature Vector Creation
● There are various techniques for creating feature vectors, each
tailored to different types of data and tasks.
● Here are some common techniques for creating feature vectors
○ Counter based
○ Mean
○ Median
○ Mode
● Other are: LBP, MDS, Label encoding, One hot Encoding,TF IDF,
Histograms etc. 1
Counter based Vectorization

1
Counter based Vectorization
● Counter-based feature vector creation is a technique that
involves counting the occurrences of certain elements or
events in a dataset and representing these counts as features
in a vector format.
● This technique is commonly used in natural language
processing (NLP) for text analysis, where words or phrases are
counted to create feature vectors
1
Counter based Vectorization
● Original Data:
○ Review 1: "The product is great and durable."
○ Review 2: "I am satisfied with this purchase."
○ Review 3: "Not worth the money, very disappointed."
● consider the words "product," "great," "durable," "satisfied," "purchase,"
"worth," "money," and "disappointed" as our vocabulary.
● Arrange in Alphabetical Order

1
Counter based Vectorization

● consider the words "product," "great," "durable," "satisfied," "purchase,"


"worth," "money," and "disappointed" as our vocabulary.
● "disappointed", "durable”, "great", "money", "product", “purchase",
"satisfied", “worth”
● Feature Vectors:
○ Review 1 Feature Vector: [0,1,1,0,1,0,0,0]
○ Review 2 Feature Vector: [0,0,0,0,0,1,1,0]
○ Review 3 Feature Vector: [1,0,0,1,0,0,0,1]
1

Mean-Based Feature Extraction
● Mean-based feature extraction involves taking the average of
specific attributes or measurements for a set of data points.
● This can be useful when you want to represent the typical or
average value of certain characteristics within a group.
● It creates a new feature that captures the overall performance of
● This can be useful in cases where you want to simplify the
representation of data or when the average behavior of a group is
of interest. 1
Mean based Vectorization
● Original Data:
○ Student 1: Math = 85, English = 75, Science = 90
○ Student 2: Math = 70, English = 80, Science = 85
○ Student 3: Math = 95, English = 92, Science = 88
● Calculate Mean of each instance
● Feature Vectors:
○ New Feature for Student 1: Mean Score = 83.33
○ New Feature for Student 2: Mean Score = 78.33
○ New Feature for Student 3: Mean Score = 91.67
1
Median-Based Feature Extraction
● Mean-based feature extraction involves that captures the middle
value for a set of data points.
● The median is useful when you want to understand the typical or
central value while being less affected by extreme values
(outliers).
● It is particularly relevant when dealing with data that might have
outliers or skewed distributions, as it provides a more robust
measure of central tendency compared to the mean. 1
Median based Vectorization
● Original Data:
○ Student 1: Math = 85, English = 75, Science = 90
○ Student 2: Math = 70, English = 80, Science = 85
○ Student 3: Math = 95, English = 92, Science = 88
● Calculate Mean of each instance
● Feature Vectors:
○ New Feature for Student 1: Median Score = 85
○ New Feature for Student 2: Median Score = 80
○ New Feature for Student 3: Median Score = 92
1
Mode-Based Feature Extraction
● It involves calculating the mode, which is the most frequently
occurring value, of certain attributes or measurements for a group
of data points
● The mode is useful when you want to identify the most common
attribute value in a dataset.
● Mode-based feature extraction is particularly relevant when
dealing with categorical data or discrete variables, where you're
interested in identifying the most typical or popular value within a
group. 1
Mode based Vectorization
● Original Data:
○ Student 1: Math = 85, English = 75, Science = 90
○ Student 2: Math = 70, English = 80, Science = 85
○ Student 3: Math = 95, English = 92, Science = 88
● Calculate Mean of each instance
● Feature Vectors:
○ New Feature for Student 1: Mode Score = None (No mode)
○ New Feature for Student 2: Mode Score = None (No mode)
○ New Feature for Student 3: Mode Score = None (No mode)
1
Multidimensional Scaling
● Multidimensional Scaling (MDS) is a way to show how different
things are from each other.
● Imagine you have things like colors, faces, or even opinions
about politics.
● MDS helps us see how similar or different these things are by
putting them on a graph.
● Things that are very similar are close together on the graph, and
things that are less similar are farther apart
1
Multidimensional Scaling
● MDS can also help us with a tricky problem.
● Imagine you have a lot of information about things, but it's hard
to understand because there's too much.
● MDS can simplify this by making the information simpler, like
turning a big puzzle into a smaller one.
● This smaller puzzle still keeps the important parts of the big
one.

1
Multidimensional Scaling
● The "multi" part means that MDS is not just for
two-dimensional pictures.
● It can work for 3D, 4D, or even more dimensions. This is like
having more layers in your graph.
● People use MDS in many different areas. It's like a tool that can
help us understand things better.

1
Multidimensional Scaling
● The term scaling comes from psychometrics, where abstract
concepts (“objects”) are assigned numbers according to a rule
● For example, you may want to quantify a person’s attitude to
global warming. You could assign a “1” to “doesn’t believe in
global warming”, a 10 to “firmly believes in global warming”
and a scale of 2 to 9 for attitudes in between.

1
Multidimensional Scaling
● You can also think of “scaling” as the fact that you’re essentially
scaling down the data (i.e. making it simpler by creating
lower-dimensional data).
● Data that is scaled down in dimension keeps similar properties.
For example, two data points that are close together in
high-dimensional space will also be close together in
low-dimensional space

1
Multidimensional Scaling
For example, if you had a list
of cities and only knew how
far apart they are, MDS
could help you create a map
that shows their distances
and positions, even if you
don't know exactly where
they are.

1
Multidimensional Scaling
Step 1: Assign a number of points to coordinates in n-dimensional
space.
● N-dimensional space could be 2-dimensional, 3-dimensional, or
higher spaces (at least, theoretically, because 4-dimensional
spaces and above are difficult to model). The orientation of the
coordinate axes is arbitrary and is mostly the researcher’s
choice.
● For maps like the one in the simple example above, axes that
represent north/south and east/west make the most sense
1
Multidimensional Scaling
Step 2: Calculate Euclidean distances for all pairs of points.

● The Euclidean distance is the “as the crow flies”


straight-line distance between two points x and y in
Euclidean space. It’s calculated using the Pythagorean
theorem (c2 = a2 + b2),
● although it becomes somewhat more complicated for
n-dimensional space This results in the similarity matrix.

1
Multidimensional Scaling
Step 3: Compare the similarity matrix with the original input
matrix by evaluating the stress function.

● Stress is a goodness-of-fit measure, based on differences


between predicted and actual distances.
● In his original 1964 MDS paper, Kruskal wrote that fits close to
zero are excellent, while anything over 0.2 should be
considered “poor”.

1
Multidimensional Scaling
Step 4: Adjust coordinates, if necessary, to minimize stress.

1
Why MDS?

1
Types of MDS

1
Types of MDS
Metric MDS :
● also known as Principal Coordinate Analysis (PCoA).
● Make sure not to confuse it with Principal Component Analysis (PCA), a
separate yet similar technique.
● Metric MDS attempts to model the similarity/dissimilarity of data by
calculating distances between each pair of points using their geometric
coordinates.
● The key here is the ability to measure a distance using a linear scale.
● Eg if the distance between two points is 10 units, it means they're twice as
far apart as two points that are only 5 units apart.
1
Types of MDS
Non Metric MDS :
● It is used when you have data with ordered values, like ratings.
● It's about showing the relationships between items based on their order,
rather than exact distances.
● Imagine you asked people to rate products from 1 to 5.
● In non-metric MDS, the focus is on the order of ratings (1 < 2 < 3 < 4 < 5)
rather than the actual numerical difference between them.
● It helps create a map that captures the ranking relationships between
items, even if you can't say exactly how much better one item is compared to
another.
1
MDS for Face Recognization

1
Matrix Factorization
● The goal here is expressing a matrix as the product of two smaller
matrices.
● In the image below the blue matrix is your data where each row is a
sample and each column is a feature.
● The Archetypes are the simplest forms you are going to use to
reconstruct your data

1
Matrix Factorization
● One row of your data will be expressed as a linear combination of your
archetypes.
● The coefficients of your linear combination (in red in the image) are
your low dimensional representation. And that is basically Matrix
Factorization.

1
Matrix Factorization

1
Matrix Factorization
● In the above graph, on the left-hand side, we have cited individual preferences, wherein 4 individuals
have been asked to provide a rating on safety and mileage.
● Cars are been rated based on the number of features (items) they offer. A ranking of 4 implies high
features, and 1 depicts fewer features.
● The blue colored ? mark is the sparse value, wherein either person does not know about the car or is
not part of the consideration list for buying the car or has forgotten to rate.

1
Matrix Factorization
● In the above graph, on the left-hand side, we have cited individual preferences, wherein 4 individuals
have been asked to provide a rating on safety and mileage.
● Cars are been rated based on the number of features (items) they offer. A ranking of 4 implies high
features, and 1 depicts fewer features.
● The blue colored ? mark is the sparse value, wherein either person does not know about the car or is
not part of the consideration list for buying the car or has forgotten to rate.

1
Thank You !!!

186
Unit –II : Feature Engineering

● Part-1 Concept of Feature, Preprocessing of data: Normalization and Scaling, Standardization,


Managing missing values
● Part-2 Introduction to Dimensionality Reduction, Principal Component Analysis (PCA)Feature
Pattern
Extraction: Kernel PCA, Local Binary Pattern.
● Part-3 Introduction to various Feature Selection Techniques, Sequential Forward Selection,
Sequential Backward Selection.
● Part-4 Statistical feature engineering: Mean, Median, Mode etc. based feature vector creation.
● Part-5 Multidimensional Scaling, Matrix Factorization Techniques.

187

You might also like