Unit-II Feature Engineering - Removed
Unit-II Feature Engineering - Removed
● Feature engineering is the pre-processing step of machine learning, which is used to transform raw
data into features that can be used for creating a predictive model using Machine learning or
statistical Modelling.
● Feature engineering in machine learning aims to improve the performance of models.
4
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
What is a Feature?
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
7
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Feature Creation
Feature Engineering Process:
Transformations
Feature Extraction
Feature Selection
8
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Feature Engineering Process:
Feature Creation
● Feature creation is finding the most useful
variables to be used in a predictive model.
Transformations
Feature Extraction
Feature Selection
9
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Feature Engineering Process:
10
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Feature Creation
Feature Engineering Process:
Feature Creation
Transformations
Feature Engineering Process:
13
https://www.javatpoint.com/feature-engineering-for-machine-learning
Unit –II : Feature Engineering
14
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values
15
https://www.youtube.com/watch?v=AOfzlVi-NJs
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values
16
https://www.youtube.com/watch?v=AOfzlVi-NJs
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values
Scaling:
● Numerical features of the dataset do not have a certain range and they
differ from each other.
● Can’t expect age and income columns to have the same range.
● But from the machine learning point of view, how these two columns can be
compared?
● The continuous features become identical in terms of the range, after a
scaling process
17
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values
Scaling:
Scaling:
Standardisation is more robust to outliers, and in many cases, it is preferable over Max-Min
Normalization
19
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values
Scaling:
The standard deviation formula may look confusing, but it will make sense after we break it down.
Step 1: Find the mean.
Step 2: For each data point, find the square of its distance to the mean.
Step 3: Sum the values from Step 2.
Step 4: Divide by the number of data points.
Step 5: Take the square root.
20
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values
21
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values
Consider a vector x = (23, 29, 52, 31, 45, 19, 18, 27) Apply feature scaling and find out min-max scaled
values as well as z-score values.
22
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values
Scaling:
●A z-score tells us how many standard deviations away a value is from the mean.
●If a value has a z-score equal to 0, then the value is equal to the mean.
●If a value has a z-score equal to -1.3, then the value is 1.3 standard deviations below the
mean.
●If a value has a z-score equal to 2.2, then the value is 2.2 standard deviations above the
23
mean.
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values
24
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values
Though the data is in the correct format, there is a case that some values are missing, Suppose we have
the data of students and one of the features is landline contact no. Now for some of the records values
are missing for this feature that is landline contact no. this is because of the fact that some students
don't have landline contact no. at home.
Another case of missing data can be due to the process of collecting data. For example initially for
storing students data only one region is considered so pincode is not considered initially. Later on when
we want to expand the dataset for all the regions pincode is missing which is a necessary feature.
25
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values
26
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values
29
Part-1
Preprocessing of data:
Concept of Feature Normalization and Scaling Standardization Managing missing values
30
Unit –II : Feature Engineering
31
Introduction to Dimensionality Reduction
● Dimensionality reduction is the process of reducing the number of random variables under
consideration, by obtaining a set of principal variables.
● In machine learning classification problems, there are often too many factors on the basis of which
the final classification is done.
● These factors are basically variables called features.
● The higher the number of features, the harder it gets to visualize the training set and then work on
it.
● Sometimes, most of these features are correlated, and hence redundant. This is where dimensionality
reduction algorithms come into play.
32
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction
33
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction
34
https://www.geeksforgeeks.org/dimensionality-reduction/
Why Feature Selection?
1
Introduction to Dimensionality Reduction
Feature selection: In this, we try to find a subset of the original set of variables, or features, to get a
smaller subset which can be used to model the problem. It usually involves three ways:
○ Filter
○ Wrapper
○ Embedded
36
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction : Feature selection
Filter Method:
Wrapper Method:
Embedded Method:
39
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction : Feature selection
40
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction : Feature selection
41
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction : Feature selection
Measure the relevance of features with the Measure the usefulness of a subset of
dependent variable feature
This method is fast and is computationally This method is slow and is computationally
less expensive more expensive
Might fail to find the best subset of features Always provide the best subset of features
Feature Extraction: By finding a smaller set of new variables, each being a combination of the input
variables, containing basically the same information as the input variables.
The various methods used for dimensionality reduction include:
● Principal Component Analysis (PCA)
● Kernel PCA
● Linear Discriminant Analysis (LDA)
https://www.analyticsvidhya.com/blog/2021/04/guide-for-feature-extraction-techniques/
43
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction
44
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction
45
https://www.geeksforgeeks.org/dimensionality-reduction/
Introduction to Dimensionality Reduction
46
https://www.geeksforgeeks.org/dimensionality-reduction/
Feature Engineering
What is a Feature Engineering?
47
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
What is a Feature Engineering?
48
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Feature Creation
Feature Engineering Process:
Transformations
Feature Extraction
49
Feature Selection
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Feature Creation
Feature Engineering Process:
Transformations
Feature Extraction
50
Feature Selection
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Feature Engineering Process:
Feature Creation
● Feature creation is finding the most useful
variables to be used in a predictive model.
Transformations
Feature Extraction
51
Feature Selection
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Feature Engineering Process:
Feature Creation
● The transformation step of feature engineering
involves adjusting the predictor variable to improve
Transformations the accuracy and performance of the model. For
example, it ensures that the model is flexible to
Feature Extraction take input of the variety of data; it ensures that all
52
the variables are on the same scale, making the
model easier to understand.
Feature Selection
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Feature Creation
Feature Engineering Process:
https://www.javatpoint.com/feature-engineering-for-machine-learning
Feature Engineering
Feature Creation
Transformations
Feature Engineering Process:
https://www.javatpoint.com/feature-engineering-for-machine-learning
Overfitting and Underfitting
Overfitting Underfitting
Y Y
X X
• Sphere
• Play Ball
• Cannot Eat
• Radius : 5cm Sphere
10 cm
Ball ????
Principal Component Analysis
2
PC1
1 PC2
Covariance Matrix
Eigenvalues and Eigenvectors
Eigenvalues and Eigenvectors
Eigenvalues and Eigenvectors
● For a square matrix, A, a non-zero vector is called an eigenvector if multiplication by A results
in a scalar multiple of
A*x = λ*x
| A-λ*I= 0 |
● By setting this polynomial equal to zero and solving for λ the desired
● Here n solutions are generated; it means that neigen values are derived.
82
Kernel Tick
83
Kernel Tick
84
Kernel Tick
85
Kernel Tick
86
Kernel Tick
87
Kernel Tick
88
Kernel Tick
89
Kernel Tick
90
Kernel Tick
91
Kernel Tick
92
Kernel Tick
93
Kernel Tick
● Let us say that we have two points, x= (2, 3, 4) and y= (3, 4, 5)
● As we have seen, K(x, y) = < f(x), f(y) >.
● Let us first calculate < f(x), f(y) >
○ f(x)=(x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3)
○ f(y)=(y1y1, y1y2, y1y3, y2y1, y2y2, y2y3, y3y1, y3y2, y3y3)
● so,
○ f(2, 3, 4)=(4, 6, 8, 6, 9, 12, 8, 12, 16)and
○ f(3 ,4, 5)=(9, 12, 15, 12, 16, 20, 15, 20, 25)
● so the dot product,
○ f (x). f (y) = f(2,3,4) . f(3,4,5)=
○ (36 + 72 + 120 + 72 +144 + 240 + 120 + 240 + 400)=1444
94
Kernel Tick
● Another way x= (2, 3, 4) and y= (3, 4, 5)
● K(x, y) =
○ (2*3 + 3*4 + 4*5) ^2
○ =(6 + 12 + 20)^2
○ =38*38
○ =1444.
95
Why Kernel Tick?
● This as we find out, f(x).f(y) and K(x, y) give us the same
result
96
Types of Kernel
● Linear Kernel
● Polynomial Kernel
● Exponential Kernel
● Gaussian Kernel
● Sigmoid Kernel
● And Many more
97
Types of Kernel - Linear Kernel
● Let us say that we have two vectors with name x1 and Y1, then
vectors:
● K(x1, x2) = x1 . x2
98
Types of Kernel - Polynomial Kernel
vectors
99
Types of Kernel - Gaussian Kernel
● This kernel is an example of a radial basis function kernel.
to the problem.
100
Types of Kernel - Exponential Kernel
● This is in close relation with the previous kernel i.e. the
norm is removed.
101
Linear PCA Vs Kernel PCA
1
1
Kernel PCA
● PCA is a linear method. It works great for linearly separable
datasets.
● However, if the dataset has non-linear relationships, then it
produces undesirable results.
● Kernel PCA is a technique which uses the so-called kernel trick
and projects the linearly inseparable data into a higher
dimension where it is linearly separable.
● There are various kernels that are popularly used; some of them
are linear, polynomial, RBF, and sigmoid 1
Kernel PCA
Step 1:
● First choose a kernel functions k(x_i, x_j) and let T be any
transformation to a higher dimension.
Step 2:
● Find the covariance matrix of data, But here, kernel function is
used to calculate this matrix. It is the matrix that results from
applying kernel function to all pairs of data.
K=T(X)T(X)^T 1
Kernel PCA
Step 3:
1
Kernel PCA
Step 4:
● find eigenvectors and eigenvalues of this matrix.
● Sort eigenvectors based on their corresponding eigenvalues in
a decreasing order.
● choose the number of dimensions that needed in reduced
dataset, let's call it k.
● choose our first k eigenvectors and concatenate them in one
matrix.
● Finally, Calculate the product of that matrix with your data.
The result will be reduced dataset. 1
Kernel PCA
Step 4:
● find eigenvectors and eigenvalues of this matrix.
● Sort eigenvectors based on their corresponding eigenvalues in
a decreasing order.
● choose the number of dimensions that needed in reduced
dataset, let's call it k.
● choose our first k eigenvectors and concatenate them in one
matrix.
● Finally, Calculate the product of that matrix with your data.
The result will be reduced dataset. 1
Local Binary Pattern
● Local Binary Pattern (LBP) is a simple but powerful technique
images.
1
Local Binary Pattern
● It combined statistical and structural methods and was first
described in 1994.
representation of a picture.
neighboring pixels.
1
Local Binary Pattern
1
Local Binary Pattern
1
Local Binary Pattern
The original LBP operator labels the pixels of an image with decimal
numbers, called Local Binary Patterns which encode the local
structure around each pixel.
115
Unit –II : Feature Engineering
139
Feature Selection
150
Statistical Feature Engineering
Feature engineering refers to a process of selecting & transforming variables/features in
your dataset when creating a predictive model using machine learning.
• Therefore you have to extract the features from the raw dataset you have collected before
training your data in machine learning algorithms.
• Feature engineering has two goals:
– Preparing the proper input dataset, compatible with the machine learning algorithm
requirements.
– Improving the performance of machine learning models.
151
Feature Vector Creation
● There are various techniques for creating feature vectors, each
tailored to different types of data and tasks.
● Here are some common techniques for creating feature vectors
○ Counter based
○ Mean
○ Median
○ Mode
● Other are: LBP, MDS, Label encoding, One hot Encoding,TF IDF,
Histograms etc. 1
Counter based Vectorization
1
Counter based Vectorization
● Counter-based feature vector creation is a technique that
involves counting the occurrences of certain elements or
events in a dataset and representing these counts as features
in a vector format.
● This technique is commonly used in natural language
processing (NLP) for text analysis, where words or phrases are
counted to create feature vectors
1
Counter based Vectorization
● Original Data:
○ Review 1: "The product is great and durable."
○ Review 2: "I am satisfied with this purchase."
○ Review 3: "Not worth the money, very disappointed."
● consider the words "product," "great," "durable," "satisfied," "purchase,"
"worth," "money," and "disappointed" as our vocabulary.
● Arrange in Alphabetical Order
1
Counter based Vectorization
1
Multidimensional Scaling
● The "multi" part means that MDS is not just for
two-dimensional pictures.
● It can work for 3D, 4D, or even more dimensions. This is like
having more layers in your graph.
● People use MDS in many different areas. It's like a tool that can
help us understand things better.
1
Multidimensional Scaling
● The term scaling comes from psychometrics, where abstract
concepts (“objects”) are assigned numbers according to a rule
● For example, you may want to quantify a person’s attitude to
global warming. You could assign a “1” to “doesn’t believe in
global warming”, a 10 to “firmly believes in global warming”
and a scale of 2 to 9 for attitudes in between.
1
Multidimensional Scaling
● You can also think of “scaling” as the fact that you’re essentially
scaling down the data (i.e. making it simpler by creating
lower-dimensional data).
● Data that is scaled down in dimension keeps similar properties.
For example, two data points that are close together in
high-dimensional space will also be close together in
low-dimensional space
1
Multidimensional Scaling
For example, if you had a list
of cities and only knew how
far apart they are, MDS
could help you create a map
that shows their distances
and positions, even if you
don't know exactly where
they are.
1
Multidimensional Scaling
Step 1: Assign a number of points to coordinates in n-dimensional
space.
● N-dimensional space could be 2-dimensional, 3-dimensional, or
higher spaces (at least, theoretically, because 4-dimensional
spaces and above are difficult to model). The orientation of the
coordinate axes is arbitrary and is mostly the researcher’s
choice.
● For maps like the one in the simple example above, axes that
represent north/south and east/west make the most sense
1
Multidimensional Scaling
Step 2: Calculate Euclidean distances for all pairs of points.
1
Multidimensional Scaling
Step 3: Compare the similarity matrix with the original input
matrix by evaluating the stress function.
1
Why MDS?
1
Types of MDS
1
Types of MDS
Metric MDS :
● also known as Principal Coordinate Analysis (PCoA).
● Make sure not to confuse it with Principal Component Analysis (PCA), a
separate yet similar technique.
● Metric MDS attempts to model the similarity/dissimilarity of data by
calculating distances between each pair of points using their geometric
coordinates.
● The key here is the ability to measure a distance using a linear scale.
● Eg if the distance between two points is 10 units, it means they're twice as
far apart as two points that are only 5 units apart.
1
Types of MDS
Non Metric MDS :
● It is used when you have data with ordered values, like ratings.
● It's about showing the relationships between items based on their order,
rather than exact distances.
● Imagine you asked people to rate products from 1 to 5.
● In non-metric MDS, the focus is on the order of ratings (1 < 2 < 3 < 4 < 5)
rather than the actual numerical difference between them.
● It helps create a map that captures the ranking relationships between
items, even if you can't say exactly how much better one item is compared to
another.
1
MDS for Face Recognization
1
Matrix Factorization
● The goal here is expressing a matrix as the product of two smaller
matrices.
● In the image below the blue matrix is your data where each row is a
sample and each column is a feature.
● The Archetypes are the simplest forms you are going to use to
reconstruct your data
1
Matrix Factorization
● One row of your data will be expressed as a linear combination of your
archetypes.
● The coefficients of your linear combination (in red in the image) are
your low dimensional representation. And that is basically Matrix
Factorization.
1
Matrix Factorization
1
Matrix Factorization
● In the above graph, on the left-hand side, we have cited individual preferences, wherein 4 individuals
have been asked to provide a rating on safety and mileage.
● Cars are been rated based on the number of features (items) they offer. A ranking of 4 implies high
features, and 1 depicts fewer features.
● The blue colored ? mark is the sparse value, wherein either person does not know about the car or is
not part of the consideration list for buying the car or has forgotten to rate.
1
Matrix Factorization
● In the above graph, on the left-hand side, we have cited individual preferences, wherein 4 individuals
have been asked to provide a rating on safety and mileage.
● Cars are been rated based on the number of features (items) they offer. A ranking of 4 implies high
features, and 1 depicts fewer features.
● The blue colored ? mark is the sparse value, wherein either person does not know about the car or is
not part of the consideration list for buying the car or has forgotten to rate.
1
Thank You !!!
186
Unit –II : Feature Engineering
187