Chapter 2
Data Handling Techniques
Content
• Data cleaning
• Data Transformations
• Outliner detection and visualization
• Imbalance dataset handling
• Feature Selection and extraction
Importance of data preprocessing
• It improves accuracy and reliability. Preprocessing data removes
missing or inconsistent data values resulting from human or
computer error, which can improve the accuracy and quality of a
dataset, making it more reliable.
• It makes data consistent. When collecting data, it's possible to have
data duplicates, and discarding them during preprocessing can
ensure the data values for analysis are consistent, which helps
produce accurate results.
• It increases the data's algorithm readability. Preprocessing
enhances the data's quality and makes it easier for machine learning
algorithms to read, use, and interpret it.
Features of data preprocessing
• Preprocessing has many features that make it an important
preparation step for data analysis. The following are the two
main features with a brief explanation:
Data
Data imputation
validation
Features of data preprocessing
• Data validation: This is the process where businesses analyze and
assess the raw data for a project to determine if it's complete and
accurate to achieve the best results.
• Data imputation: Data imputation is where you input missing values
and rectify data errors during the validation process manually or
through programming, like business process automation.
Data Cleaning
Effective Strategies for Handling Missing Values in Data
Analysis (How to handle missing values?)
• What Is a Missing Value?
• Missing data is defined as the values or data that is not stored
(or not present) for some variable/s in the given dataset. Below
is a sample of the missing data from the Titanic dataset. You can
see the columns ‘Age’ and ‘Cabin’ have some missing values.
Effective Strategies for Handling Missing
Values in Data Analysis
Effective Strategies for Handling Missing
Values in Data Analysis
• How Is a Missing Value Represented in a Dataset?
• In the dataset, the blank shows the missing values.
• In Pandas, usually, missing values are represented by NaN. It
stands for Not a Number.
Effective Strategies for Handling Missing
Values in Data Analysis
Effective Strategies for Handling Missing
Values in Data Analysis
Why Is Data Missing From the Dataset?
• Past data might get corrupted due to improper maintenance.
• Observations are not recorded for certain fields due to some reasons.
There might be a failure in recording the values due to human error.
• The user has not provided the values intentionally
• Item nonresponse: This means the participant refused to respond.
Types of Missing Values
• Formally the missing values are categorized as follows:
• Missing completely at
random
• Missing at random
• Missing not at random
Missing Completely At Random (MCAR)
• In MCAR, the probability of data being missing is the same for all the observations. In this case, there is no
relationship between the missing data and any other values observed or unobserved (the data which is not
recorded) within the given dataset. That is, missing values are completely independent of other data. There is no
pattern.
• In the case of MCAR data, the value could be missing due to human error, some system/equipment failure, loss
of sample, or some unsatisfactory technicalities while recording the values. For Example, suppose in a library
there are some overdue books. Some values of overdue books in the computer system are missing. The reason
might be a human error, like the librarian forgetting to type in the values. So, the missing values of overdue books
are not related to any other variable/data in the system. It should not be assumed as it’s a rare case. The
advantage of such data is that the statistical analysis remains unbiased.
Missing At Random (MAR)
• MAR data means that the reason for missing values can be explained by variables on which you have complete information, as there is
some relationship between the missing data and other values/data. In this case, the data is not missing for all the observations. It is
missing only within sub-samples of the data, and there is some pattern in the missing values.
• For example, if you check the survey data, you may find that all the people have answered their ‘Gender,’ but ‘Age’ values
are mostly missing for people who have answered their ‘Gender’ as ‘female.’ (The reason being most of the females don’t want to reveal
their age.)
• So, the probability of data being missing depends only on the observed value or data. In this case, the variables ‘Gender’ and ‘Age’ are
related. The reason for missing values of the ‘Age’ variable can be explained by the ‘Gender’ variable, but you can not predict the missing
value itself.
• Suppose a poll is taken for overdue books in a library. Gender and the number of overdue books are asked in the poll. Assume that most of
the females answer the poll and men are less likely to answer. So why the data is missing can be explained by another factor, that is gender.
In this case, the statistical analysis might result in bias. Getting an unbiased estimate of the parameters can be done only by modeling the
missing data.
Missing Not At Random (MNAR)
• Missing values depend on the unobserved data. If there is some structure/pattern in missing data and other observed data can
not explain it, then it is considered to be Missing Not At Random (MNAR).
• If the missing data does not fall under the MCAR or MAR, it can be categorized as MNAR. It can happen due to the reluctance of
people to provide the required information. A specific group of respondents may not answer some questions in a survey.
• For example, suppose the name and the number of overdue books are asked in the poll for a library. So most of the people having
no overdue books are likely to answer the poll. People having more overdue books are less likely to answer the poll. So, in this
case, the missing value of the number of overdue books depends on the people who have more books overdue.
• Another example is that people having less income may refuse to share some information in a survey or questionnaire.
• In the case of MNAR as well, the statistical analysis might result in bias.
Handling Missing Data
• Why Do We Need to Care About Handling Missing Data?
• It is important to handle the missing values appropriately.
• Many machine learning algorithms fail if the dataset contains missing values. However,
algorithms like K-nearest and Naive Bayes support data with missing values.
• You may end up building a biased machine learning model, leading to incorrect results if
the missing values are not handled properly.
• Missing data can lead to a lack of precision in the statistical analysis.
Handling Missing Values
• Now that you have found the missing data, how do you handle the missing values?
• Analyze each column with missing values carefully to understand the reasons behind the
missing of those values, as this information is crucial to choose the strategy for handling
the missing values.
• There are 2 primary ways of handling missing values:
1.Deleting the Missing values
2.Imputing the Missing Values
Deleting the Missing value
• Generally, this approach is not recommended. It is one of the
quick and dirty techniques one can use to deal with missing
values. If the missing value is of the type Missing Not At Random
(MNAR), then it should not be deleted.
• If the missing value is of type Missing At Random (MAR) or
Missing Completely At Random (MCAR) then it can be deleted
(In the analysis, all cases with available data are utilized, while
missing observations are assumed to be completely random
(MCAR) and addressed through pairwise deletion.)
Deleting the Missing value
• The disadvantage of this method is one might end up deleting
some useful data from the dataset.
• There are 2 ways one can delete the missing data values:
• Deleting the entire row (listwise deletion)
• If a row has many missing values, you can drop the entire row. If
every row has some (column) value missing, you might end up
deleting the whole data. The code to drop the entire row is as
follows:
Deleting entire Row and Column
IN: df = train_df.dropna(axis=0) IN: df = train_df.drop(['Dependents'],axis=1)
df.isnull().sum() df.isnull().sum()
Full Row is deleted Full column is deleted
Imputing the Missing Value
• There are many imputation methods for replacing the missing
values. You can use different python libraries such as Pandas,
and Sci-kit Learn to do this. Let’s go through some of the ways
of replacing the missing values.
Replacing with an arbitrary value
• If you can make an educated guess about the missing value, then
you can replace it with some arbitrary value using the following
code. E.g., in the following code, we are replacing the missing
values of the ‘Dependents’ column with ‘0’.
Replacing with the mean
• This is the most common method of
imputing missing values of numeric
columns. If there are outliers, then the
mean will not be appropriate. In such
cases, outliers need to be treated first.
• use the ‘fillna’ method for imputing
the columns ‘LoanAmount’ and
‘Credit_History’ with the mean of the
respective column values.
Replacing with the mode
• Mode is the most
frequently occurring value.
It is used in the case of
categorical features. You
can use the ‘fillna’ method
for imputing the
categorical columns
‘Gender,’ ‘Married,’ and
‘Self_Employed.’
Replacing with the median
• The median is the middlemost value. It’s better to use the
median value for imputation in the case of outliers. You can use
the ‘fillna’ method for imputing the column ‘Loan_Amount_Term’
with the median value.
train_df['Loan_Amount_Term']= train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median())
Replacing with the previous value – forward
fill
• In some cases, imputing the
values with the previous value
instead of the mean, mode, or
median is more appropriate.
• This is called forward fill.
• It is mostly used in time series
data. You can use the ‘fillna’
function with the parameter
‘method = ffill’
Replacing with the next value – backward fill
• In backward fill, the
missing value is
imputed using the next
value.
Interpolation
• Missing values can also be
imputed using interpolation.
Pandas’ interpolate method
can be used to replace the
missing values with different
interpolation methods like
‘polynomial,’ ‘linear,’ and
‘quadratic.’ The default
method is ‘linear.’
Data Transformation
Data Transformation-in detail
• Data transformation is used when data needs to be converted to match that of
the destination system
• Data transformation is the process of converting raw data into a format or
structure that would be more suitable for model building and also data discovery
in general.
• It is an imperative step in feature engineering that facilitates discovering insights.
Feature Transforming techniques
1) MinMax Scaler
2) Standard Scaler
3) MaxAbsScaler
4) Robust Scaler
5) Quantile Transformer Scaler
6) Power Transformer Scaler
7) Unit Vector Scaler/Normalizer
8) Custom Transformer
Base code in python
Data set with income, Age
and Department for some
firm and its employee
Base code in python
• Observe Categorical data as ‘Department ’
df_scaled = df.copy()
col_names = ['Income', 'Age’]
features = df_scaled[col_names]
We will execute this snippet before using a new scaler every time.
MinMax Scaler
• The MinMax scaler is one of the simplest scalers to
understand. It just scales all the data between 0 and 1.
• x_scaled = (x – x_min)/(x_max – x_min)
Though (0, 1) is the default range, we can define our range of max and
min values as well.
MinMax Scaler
Income (data
given) x_scaled = (x – x_min)/(x_max – x_min)
15000 Xmax-Xmin= X-Xmin = X scaled= 13,200/1,18,200
1,20,000-1800 15,000 – 1800 = 0.111675
= 1,18,200/- = 13,200
1800 1800-1800=0 Calculate
120000
10000
Observe each column carefully
Minimum value = 0
Max value =1
So the name Min Max scaler .
Eg:If you do not want Age = 0 than range can be from 5 to 10 also
How?
MinMax Scaler
The min-max scaler lets you set the range in which you want the variables to be.
Standard Scaler
• For each feature, the Standard Scaler scales the values such that the
mean is 0 and the standard deviation is 1(or the variance).
x_scaled = x – mean/std_dev
Standard Scaler assumes that the distribution of the variable is normal.
In case, the variables are not normally distributed,
either choose a different scaler
Or
first, convert the variables to a normal distribution and then apply this scaler
Standard Scaler- Normal data distribution
• Histogram will help to identify the normal distribution of data
Standard Scaler- Normal data distribution
Age X- [X-Xmean ]2 Standard X-mean/ SD
(data mean deviation
given
)
25 mea -9.75 95.06 Calculate
18 n=3 -16.75 280.6
4.75
45 10.25 105.06
51 16.25 264.06 = 13.64
Sum= 744.78
Sum/4= 186.195
MaxAbsScaler
• In simplest terms, the MaxAbs scaler takes the absolute
maximum value of each column and divides each value in the
column by the maximum value.
• It first takes the absolute value of each value in the column and
then takes the maximum value out of those. This operation
scales the data between the range [-1, 1].
MaxAbsScaler
Age and income can not be negative let us take one more variable as Balance
Balance value Absolute value Divide each value by
max value
100 100 100/2000=0.05
-263 263 Calculate
2000 2000
-5 5
Compare with percentile marks
Robust Scaler
• If you have noticed in the scalers we used so far, each of them
was using values like
• the mean, maximum and minimum values of the columns.
• All these values are sensitive to outliers.
• If there are too many outliers in the data, they will influence the
mean and the max value or the min value.
• Thus, even if we scale this data using the above methods, we
cannot guarantee a balanced data with a normal distribution.
Robust Scaler
The Robust Scaler, as the name suggests is not sensitive to
outliers. This scaler-
1.removes the median from the data
2.scales the data by the InterQuartile Range(IQR)
• Q1= First half of data and its median
• Q2= Actual median
• Q3= Second half of data and its median
IQR = Q3 – Q1
x_scaled = (x – Q1)/(Q3 – Q1)
Quantile Transformer Scaler
• the Quantile Transformer Scaler converts the variable
distribution to a normal distribution. and scales it accordingly.
• few important points regarding the Quantile Transformer Scaler:
• 1. It computes the cumulative distribution function of the variable
• 2. It uses this cdf to map the values to a normal distribution
• 3. Maps the obtained values to the desired output distribution using the
associated quantile function
Quantile Transformer Scaler
• his scaler changes the very distribution of the variables, linear
relationships among variables may be destroyed by using this
scaler.
• Thus, it is best to use this for non-linear data.
• Useful for larger data sets
Log Transform
• It is primarily used to convert a skewed distribution to a normal
distribution/less-skewed distribution.
• In this transform, we take the log of the values in a column and
use these values as the column instead.
• Why does it work? It is because the log function is equipped to
deal with large numbers. Here is an example-
• log(10) = 1
• log(100) = 2, and
• log(10000) = 4.
Log Transform
• Thus, in our example, while plotting the histogram of Income, it ranges from
0 to 1,20,000:
While our Income column had extreme values ranging from 1800 to 1,20,000 – the log values are
now ranging from approximately 7.5 to 11.7! Thus, the log operation had a dual role
•Reducing the impact of too-low values
•Reducing the impact of too-high values.
if our data has negative values, NAN or values ranging from 0 to 1, we cannot apply log transform
directly – since the log of negative numbers and numbers between 0 and 1 is undefined,
Original data and Log Transformed data
Power Transformer Scaler
• the Power Transformer also changes the distribution of the
variable, as in, it makes it more Gaussian(normal).
• The Power Transformer actually automates this decision making
by introducing a parameter called lambda. It decides on a
generalized power transform by finding the best value of lambda
using either the:
• 1. Box-Cox transform (only +ve values)
• 2. The Yeo-Johnson transform (+ve and –Ve both)
Unit Vector Scaler/Normalizer
• Normalization is the process of scaling individual samples to
have unit norm
• The most interesting part is that unlike the other scalers which
work on the individual column values, the Normalizer works on
the rows
• Each row of the dataframe with at least one non-zero
component is rescaled independently of other samples so that
its norm (l1, l2, or inf) equals one.
Unit Vector Scaler/Normalizer
• If we are using L1 norm, the values in each column are converted so that
the sum of their absolute values along the row = 1
• If we are using L2 norm, the values in each column are first squared and
added so that the sum of their absolute values along the row = 1
if you check the first row,
(.999999)^2 + (0.001667)^2 = 1.000(approx)
Custom Transformer
• Your own transform as per the need of data
• I have a feature transformation technique that involves taking
(log to the base 2) of the values. In NumPy, there is a function
called log2. let us use it
Log base 2
Log base 10
Categorial Feature handling
Categorial data Label Encoding One Hot Encoding
Feature 0 Feature 1 Feature 2
Due to this, there is a very high probability that the model captures the
relationship between countries such as India < Japan < the US.
One-Hot Encoding vs Label Encoding
We apply One-Hot Encoding when: We apply Label Encoding when:
1.The categorical feature is not ordinal (like 1.The categorical feature is ordinal (like Jr. kg, Sr.
the countries above) kg, Primary school, high school)
2.The number of categorical features is less so 2. The number of categories is quite large as one-
one-hot encoding can be effectively applied hot encoding can lead to high memory
consumption
How to Impute Missing Values for Categorical
Features?
Apply Strategy-1 Apply Strategy-3(Delete the variable
(Delete the missing observations). which is having missing values).
Apply Strategy-2 Apply Strategy-4
(Replace missing values with the most (Develop a model to predict missing
frequent value). values).
Apply Strategy-4
(Develop a model to predict missing values).
Read and Load the Encoded Dataset.
– Make missing records as our Testing data.
– Make non-missing records as our Training data.
– Separate Dependent and Independent variables.
– Fit our Logistic Regression model.
– Predict the class for missing records.
Outlier detection and visualization
Scatter plot
58
Box Plot
Box plot is another very simple visualization tool to detect outliers
which use the concept of Interquartile range (IQR) technique.
August 14, 2024 Data Mining: Concepts and Techniques 59
Outlier detection and Visualization
There are several ways to treat outliers in a dataset, depending on the
nature of the outliers and the problem being solved.
• Trimming
• It excludes the outlier values from our analysis. By applying this technique, our
data becomes thin when more outliers are present in the dataset. Its main
advantage is its fastest nature.
• Capping
• In this technique, outliers data is CAP and make the limit i.e, above a particular
value or less than that value, all the values will be considered as outliers, and
the number of outliers in the dataset gives that capping number.
• Discretization
• In this technique, by making the groups, we include the outliers in a particular
group and force them to behave in the same manner as those of other points
in that group. This technique is also known as Binning.
Normal, Skewed and Other distribution
3 3
For Normal Distributions Use Inter-Quartile Range (IQR)
Use empirical relations of Normal proximity rule. Use a percentile-based approach.
distribution. The data points that fall below For Example, data points that are far from the
The data points that fall below mean- Q1 – 1.5 IQR or above the third
99% percentile and less than 1 percentile are
3*(sigma) or above mean+3*(sigma) are quartile Q3 + 1.5 IQR are
outliers, where Q1 and Q3 are considered an outlier.
outliers, where mean and sigma are the
average value and standard deviation of a the 25th and 75th percentile of
particular column. the dataset, respectively. IQR
represents the inter-quartile
range and is given by Q3 – Q1.
Imbalanced data handling
Data imbalance Problem
• Classification problems are quite common in the machine learning world. As we
know in the classification problem we try to predict the class label by studying the
input data or predictor where the target or output variable is a categorical variable
in nature.
• Imbalanced data refers to those types of datasets where the target class has an
uneven distribution of observations, i.e one class label has a very high number of
observations and the other has a very low number of observations.
• understand imbalanced dataset handling with an example.
Data imbalance
• Let’s assume that XYZ is a bank that issues a credit card to its customers.
Now the bank is concerned that some fraudulent transactions are going on
and when the bank checks their data they found that for each 2000
transaction there are only 30 Nos of fraud recorded.
• So, the number of fraud per 100 transactions is less than 2%, or we can say
more than 98% transaction is “No Fraud” in nature. Here, the class “No
Fraud” is called the majority class, and the much smaller in size “Fraud”
class is called the minority class.
Data imbalance
More such example of imbalanced data is –
· Disease diagnosis
· Customer churn prediction
· Fraud detection
· Natural disaster
Data imbalance
• the main problem with imbalanced dataset prediction is how
accurately are we actually predicting both majority and minority
class?
• Let’s explain it with an example of disease diagnosis.
• Let’s assume we are going to predict disease from an existing dataset
where for every 100 records only 5 patients are diagnosed with the
disease. What is majority class and minority class ?
• the majority class is 95% with no disease and the minority class is
only 5% with the disease.
• Now, ML model might predicts that all 100 out of 100 patients have
no disease.
Data imbalance
• Sometimes when the records of a certain class are much more than the other class, our
classifier may get biased towards the prediction.
• In this case, the confusion matrix for the classification problem shows how well our model
classifies the target classes and we arrive at the accuracy of the model from the confusion
matrix.
• It is calculated based on the total no of correct predictions by the model divided by the
total no of predictions. In the above case it is (0+95)/(0+95+0+5)=0.95 or 95%. It means
that the model fails to identify the minority class yet the accuracy score of the model will
be 95%.
Data imbalance
• Thus our traditional approach of classification and model
accuracy calculation is not useful in the case of the imbalanced
dataset.
ML Classifier with Imbalance Data set
Majority Class
healthy
New Case Minority Class
Heart disease
if the classifier
identifies the
minority class
poorly, i.e. more of FN increases FP increases
this class wrongfully
predicted as the
majority class then
false negatives will
increase if the classifier predicts the minority class but the prediction is
erroneous and false-positive increases, the precision metric will
be low
Approach to deal with the imbalanced dataset
problem
• In rare cases like fraud detection or disease prediction, it is vital
to identify the minority classes correctly. Techniques for the
same are
1. Choose Proper Evaluation Metric
• The accuracy of a classifier is the total number of correct predictions by the
classifier divided by the total number of predictions. This may be good
enough for a well-balanced class but not ideal for the imbalanced class
problem. The other metrics such as precision is the measure of how
accurate the classifier’s prediction of a specific class and recall is the
measure of the classifier’s ability to identify a class.
• For an imbalanced class dataset F1 score is a more appropriate metric. It is
the harmonic mean of precision and recall and the expression is –
Choose Proper Evaluation Metric
if the classifier predicts the minority class but the prediction is erroneous and false-positive
increases, the precision metric will be low and so as F1 score. Also, if the classifier identifies
the minority class poorly, i.e. more of this class wrongfully predicted as the majority class
then false negatives will increase, so recall and F1 score will low. F1 score only increases if
both the number and quality of prediction improves.
F1 score keeps the balance between precision and recall and improves the score only if the
classifier identifies more of a certain class correctly.
Choose proper evaluation matrix
• Precision:
• the number of true positives divided by all positive predictions.
• Precision is also called Positive Predictive Value.
• It is a measure of a classifier’s exactness.
• Low precision indicates a high number of false positives.
• Recall:
• the number of true positives divided by the number of positive values in the test data.
• The recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier’s
completeness.
• Low recall indicates a high number of false negatives.
• F1: Score:
• the weighted average of precision and recall.
• Area Under ROC Curve (AUROC):
• AUROC represents the likelihood of your model distinguishing observations from two classes.
• In other words, if you randomly select one observation from each class, what’s the
probability that your model will be able to “rank” them correctly?
2. Resampling (Oversampling and Under sampling)
• This technique is used to upsample or downsample the minority or majority class.
• When we are using an imbalanced dataset, we can oversample the minority class using replacement.
• This technique is called oversampling.
• Similarly, we can randomly delete rows from the majority class to match them with the minority class which
is called under sampling.
• After sampling the data we can get a balanced dataset for both majority and minority classes. So, when both
classes have a similar number of records present in the dataset, we can assume that the classifier will give
equal importance to both classes.
under sampling.
oversampling.
2. Resampling (Oversampling and Under sampling)
• An example of this technique using the sklearn library’s resample() is
shown Here, Is_Lead is our target variable. Let’s see the distribution
of the classes in the target.
2. Resampling (Oversampling and Under sampling)
It has been observed that our target class has an imbalance. So,
we’ll try to upsample the data so that the minority class
matches with the majority class.
from sklearn.utils import resample
#create two different dataframe of majority and minority class
df_majority = df_train[(df_train['Is_Lead']==0)]
df_minority = df_train[(df_train['Is_Lead']==1)]
# upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples= 131177, # to match majority class
random_state=42) # reproducible results
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_minority_upsampled, df_majority])
After upsampling, the distribution of class is balanced as below –
2. Resampling (Oversampling and Under sampling)
Sklearn.utils
resample can be
used for both
undersamplings the
majority class and
oversample
minority class
instances.
3. SMOTE
• Synthetic Minority Oversampling Technique or SMOTE is another
technique to oversample the minority class.
• Simply adding duplicate records of minority class often don’t add any new
information to the model.
• In SMOTE new instances are synthesized from the existing data. If we
explain it in simple words, SMOTE looks into minority class instances and
use k nearest neighbor to select a random nearest neighbor, and a synthetic
instance is created randomly in feature space.
SMOTE algorithm works in 4 simple steps:
1. Choose a minority class as the input vector.
2. Find its k nearest neighbors (k_neighbors is specified as an
argument in the SMOTE() function).
3. Choose one of these neighbors and place a synthetic point
anywhere on the line joining the point under consideration and
its chosen neighbor.
4. Repeat the steps until the data is balanced.
Feature selection and Extraction
Feature Selection and extraction
• In real-life machine learning problems, it’s almost rare that all the
variables in the dataset are useful for building a model.
• Adding redundant variables reduces the model’s generalization
capability and may also reduce the overall accuracy of a classifier.
• Furthermore, adding more variables to a model increases the overall
complexity of the model.
Feature Selection Techniques in Machine Learning
• Feature selection is the process of selecting the subset
of the relevant features and leaving out the irrelevant
features present in a dataset to build a model of high
accuracy.
• In other words, it is a way of selecting the optimal
features from the input dataset.
• Three methods are used for the feature selection are_
1. Filter methods
2. Wrapper methods
3. Embedded methods
83
1. Filter Methods
• features are selected on the basis of statistics measures.
• The filter method filters out the irrelevant feature and
redundant columns from the model by using different
metrics through ranking.
Some common techniques of Filter methods are
• Information Gain
• Chi-square Test
• Fisher’s Score
• Correlation Coefficient
• Variance Threshold
• Mean Absolute Difference (MAD)
2. Wrapper Methods
• In wrapper methodology, the selection of features is done by
considering it as a search problem, in which different combinations are
made, evaluated, and compared with other combinations.
• It trains the algorithm by using the subset of features iteratively.
• On the basis of the output of the model, features are added or
subtracted, and with this feature set, the model is trained again.
Some techniques of wrapper methods are
• Forward Feature Selection
• Backward Feature Elimination
• Exhaustive Feature Selection
• Recursive Feature Elimination
3. Embedded Methods
• Embedded methods combine the advantages of both filter
and wrapper methods by considering the interaction of
features along with low computational cost.
• These are fast processing methods similar to the filter
method but more accurate than the filter method.
• These methods are also iterative, which evaluates each
iteration, and optimally finds the most important features
that contribute the most to training in a particular iteration.
Some techniques of embedded methods are:
• Regularization (L1)
• Tree-based methods
Types of Feature Selection Methods in ML
Filter Methods Wrapper Methods Embedded Methods
• Filter methods pick up the intrinsic Wrappers require some method to search the These methods encompass the benefits of
properties of the features measured space of all possible subsets of features, both the wrapper and filter methods by
via univariate statistics instead of assessing their quality by learning and including interactions of features but also
cross-validation performance. These evaluating a classifier with that feature subset. maintaining reasonable computational costs.
methods are faster and less Embedded methods are iterative in the sense
computationally expensive than The wrapper methods usually result in better that takes care of each iteration of the model
wrapper methods. When dealing predictive accuracy than filter methods. training process and carefully extract those
with high-dimensional data, it is features which contribute the most to the
computationally cheaper to use filter training for a particular iteration.
methods.
• Information Gain • Forward Feature Selection • Regularization (L1)
• Chi-square Test • Backward Feature Elimination • Tree-based methods
• Fisher’s Score • Exhaustive Feature Selection
• Correlation Coefficient • Recursive Feature Elimination
• Variance Threshold
• Mean Absolute Difference (MAD)
Information Gain
• Information gain calculates the
reduction in entropy from the
transformation of a dataset.
• It can be used for feature selection by
evaluating the Information gain of each
variable in the context of the target
variable.
Chi-square Test
• The Chi-square test is used for categorical
features in a dataset.
• A chi-square test is a statistical test that is
used to compare observed and expected
results.
• The goal of this test is to identify whether a
disparity between actual and predicted data
is due to chance or to a link between the
variables under consideration.
Fisher’s Score
• Fisher’s Score is calculated as the
ratio of between-class and within-
class variance.
where μij and ρij are the mean and the variance
• A higher Fisher’s Score implies the of the i-th feature in the j-th class, respectively,
characteristic is more
nj is the number of instances in the j-th class and
discriminative and valuable for μi is the mean of the i-th feature.
study.
Correlation Coefficient
• A correlation coefficient is a numerical measure of some
type of correlation, meaning a statistical relationship
between two variables. It lies between -1 to +1.
• Good variables correlate highly with the target.
Furthermore, variables should be correlated with the
target but uncorrelated among themselves.
• Higher the correlation with the target variable better the
chances of the variable to be included in the model.
• We need to set an absolute value, say 0.5, as the
threshold for selecting the variables. If we find that the
predictor variables are correlated, we can drop the
variable with a lower correlation coefficient value than
the target variable.
Correlation coefficient formula
• r =correlation coefficient
• xi = values of the x-variable in a sample
• yi = values of the y-variable in a sample
• x ¯ and y ¯ = mean of x and y respectively
Variance Threshold
• It removes all features whose variance
doesn’t meet some threshold.
• By default, it removes all zero-
variance features, i.e., features with
the same value in all samples.
The get_support returns a Boolean vector where True
• The assumption made using this means the variable does not have zero variance.
method is higher variance features are
likely to contain more information.
Mean Absolute Difference (MAD)
• The mean absolute difference
(MAD) computes the absolute
difference from the mean value.
• The higher the MAD, the higher
the discriminatory power.
• This method is similar to variance
threshold method but the
difference is there is no square in
MAD.
(Wrapper Methods)
Forward Feature Selection and B/W feature elimination
• Forward selection –This method is an iterative approach
where we initially start with an empty set of features and
keep adding a feature which best improves our model
after each iteration. The stopping criterion is till the
addition of a new variable does not improve the
performance of the model.
• Backward elimination – This method is also an iterative
approach where we initially start with all features and
after each iteration, we remove the least significant
feature. The stopping criterion is till no improvement in
the performance of the model is observed after the
feature is removed.
Recursive Feature Elimination
• Given an external estimator that assigns weights to features (e.g., the
coefficients of a linear model), the goal of recursive feature
elimination (RFE) is to select features by recursively considering
smaller and smaller sets of features.
• First, the estimator is trained on the initial set of features, and each
feature’s importance is obtained either through a coef_ attribute or a
feature_importances_ attribute.
• Then, the least important features are pruned from the current set of
features. That procedure is recursively repeated on the pruned set
until the desired number of features to select is eventually reached
Exhaustive Feature Selection
• it tries every possible
combination of the variables
and returns the best-
performing subset.
• This can be computationally
expensive, especially with a large
number of features.
Embedded Methods
Regularization (L1)
• This method adds a penalty to different parameters of the machine
learning model to avoid over-fitting of the model.
• This approach of feature selection uses Lasso (L1 regularization) and
Elastic nets (L1 and L2 regularization). The penalty is applied over the
coefficients, thus bringing down some coefficients to zero. The
features having zero coefficient can be removed from the dataset.
• Lasso or L1 has the property that can shrink some of the coefficients
to zero. Therefore, that feature can be removed from the model.
• (Note: Ridge Regression allows coefficients to be very close to zero but never
actually zero)
Tree-based methods
• These methods such as Random Forest, Gradient
Boosting provides us feature importance as a way
to select features as well.
• Nodes with the greatest decrease in impurity
happen at the start of the trees, while nodes with
the least decrease in impurity occur at the end of
the trees.
• Thus, by pruning trees below a particular node,
we can create a subset of the most important
features.
Feature Extraction techniques
• Feature Extraction aims to reduce the number of features in a dataset by
creating new features from the existing ones (and then discarding the
original features).
• These new reduced set of features should then be able to summarize most
of the information contained in the original set of features.
• In this way, a summarized version of the original features can be created
from a combination of the original set.
Techniques_
• PCA (Principle Components Analysis)
• ICA (Independent Component Analysis)
• LDA (Linear Discriminant Analysis)
• Auto encoders
August 14, 2024 Data Mining: Concepts and Techniques 100
Principle Components Analysis (PCA)
• PCA is one of the mostely used linear dimensionality reduction technique.
• When using PCA, we take as input our original data and try to find a
combination of the input features which can best summarize the original
data distribution so that to reduce its original dimensions.
• PCA is able to do this by maximizing variances and minimizing the
reconstruction error by looking at pair wised distances.
• In PCA, our original data is projected into a set of orthogonal axes and each
of the axes gets ranked in order of importance.
August 14, 2024 Data Mining: Concepts and Techniques 101
Independent Component Analysis (ICA)
• ICA is a linear dimensionality reduction method which takes as input data a
mixture of independent components and it aims to correctly identify each of
them (deleting all the unnecessary noise).
• Two input features can be considered independent if both their linear and not
linear dependence is equal to zero.
• Independent Component Analysis is commonly used in medical applications such
as EEG and fMRI analysis to separate useful signals from unhelpful ones.
• As a simple example of an ICA application, let’s consider we are given an audio
registration in which there are two different people talking.
• Using ICA we could, for example, try to identify the two different independent
components in the registration (the two different people).
• In this way, we could make our unsupervised learning algorithm recognize
between the different speakers in the conversation.
August 14, 2024 Data Mining: Concepts and Techniques 102
Linear Discriminant Analysis (LDA)
• LDA aims to maximize the distance between the mean of each class and
minimize the spreading within the class itself.
• LDA uses therefore within classes and between classes as measures. This is
a good choice because maximizing the distance between the means of
each class when projecting the data in a lower-dimensional space can lead
to better classification results
August 14, 2024 Data Mining: Concepts and Techniques 103
Autoencoder
• Autoencoders are a family of Machine Learning algorithms which can
be used as a dimensionality reduction technique.
• The main difference between Autoencoders and other dimensionality
reduction techniques is that Autoencoders use non-linear
transformations to project data from a high dimension to a lower one.
• There exist different types of Autoencoders such as_
1. Denoising Autoencoder
2. Variational Autoencoder
3. Convolutional Autoencoder
4. Sparse Autoencoder
August 14, 2024 Data Mining: Concepts and Techniques 104
Autoencoder
1.Encoder: takes the input data and compress it,
so that to remove all the possible noise and
unhelpful information. The output of the
Encoder stage is usually called bottleneck or
latent-space.
2.Decoder: takes as input the encoded latent
space and tries to reproduce the original
Autoencoder input using just it’s compressed
form (the encoded latent space).
August 14, 2024 Data Mining: Concepts and Techniques 105