0% found this document useful (0 votes)

29 views29 pages

Unit-4 Part 3 Feature Engineering

The document discusses feature engineering, focusing on feature transformation, construction, scaling, encoding, and selection techniques. It outlines methods such as quantization, log transformation, one-hot encoding, and feature hashing, emphasizing their roles in improving model performance and efficiency. Additionally, it covers feature selection approaches to reduce model complexity while maintaining predictive accuracy.

Uploaded by

yadavaakash1260

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views29 pages

Unit-4 Part 3 Feature Engineering

Uploaded by

yadavaakash1260

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

UNIT-4: FEATURE

ENGINEERING
Prof. Atmiya Patel
Feature Transformation

■ It conforms the assumption of a model.

■ Important tool for dimensionality reduction.
■ Two goals of feature transformation:
– Achieving best reconstruction in the original features.
– Achieving highest efficiency in the learning task.
■ It can apply to numeric features or non-numeric. (Like: text and
images)
Feature Construction

■ The process discovers missing information about the relationships

between features and expands the feature space by creating
additional features.
■ Added more features.
■ Techniques:
– Quantization or Binning
– Log Transform
– Feature Scaling or Normalization etc...
Quantization or Binning

■ The original data values which fall into a given small interval, a bin,
are replaced by a value representative of that interval, often the
central value. It is a form of quantization.
■ Statistical data binning is a way to group numbers of more or less
continuous values into a smaller number of "bins"
Log Transform

■ Power tool for dealing with large positive number with a heavy-tailed
distribution.
■ It compress the long tall in the heigh end of the distribution into a
shorter tail and expands the low end into a longer head.
Feature Scaling / Normalization

■ Some feature are bounded in value and other numeric features

increase without bound are affected by the scale of the input.
■ If the model is sensitive to the scale of input feature, feature scaling
could help.
■ It also called feature normalization.
■ It done individually to each feature.
Min-Max Scaling
Variance Scaling

■ Standardization (or Z-Score normalization or Variance scaling) scales

the values taking standard deviation of the features into account.
■ The mean of the feature is subtracted and divided by the standard
deviation of the feature.
■ The resulting feature has a mean 0 and a standard deviation of 1.
■ If the original feature is a normal distribution then the scaled feature
is also normal distribution.
l2 Normalization

■ It normalize the original feature value by l2 norm. It also known as the

Euclidean norm.

■ The l2 norm measures the length of the vector in coordinate space.

Encoding Categorical Variables

■ A categorical variable is used to represent categories or labels.

■ Large categorical variable are particularly common in transactional
records. Like IP address.
■ Even though user IDs and IP address are numeric, Their magnitude is
not relevant to the task.
■ The IP address might be relevant when doing fraud detection on
individual transaction.
■ The categories of a categorical variable are usually not numeric. So an
encoding method is needed to turn these non- numeric categories in
to numbers.
One-Hot Encoding

■ It creates new(binary) columns, indicating the presence of each

possible value from the original data.
■ Each bit represents a possible category.

■ One-Hot Encoding is simple but uses more bit than it strictly

necessary.
■ The sum of all the bits must be equal to 1.
Dummy Coding

■ The Problem with One-Hot encoding is that it allows for k degrees of

freedom, while the variable itself needs only k-1.
■ Dummy coding removes the extra degree of freedom by using only k-
1 features in the representation.
■ One feature is disregarded and is represented by the vector of all
zeros. This is known as the reference category.
■ The column “Blue” is deleted as it contained 0 for the first two rows.
■ The last row has both “Black” and “Brown” as 0 meaning that “Blue”
must be 1.
Feature Hashing

■ Large categorical features, such as user ID, website URL, IP address

etc., pose computation challenges in terms of memory efficiency and
storage.
■ To overcome this problem, Feature Hashing is used that makes
working with large categorical variables less computation intensive
and yet produces accurate models that fast to train.
■ Hashing, in general, is the process of taking any length of input
information and finding a unique fixed length representation of that
input information.
■ It is the process of finding a unique message digest (or hash value)
that corresponds to the input information.
■ It can be used in several different domains such as information
security, cryptocurrency, high-performance programming and for
creating quick lookup tables.
■ In machine learning, hash functions can be constructed for any object
that can be represented numerically. Like numbers, strings, complex
structures, etc.
Handling Textual Features

■ Need to apply machine learning on textural features such as product

reviews, comments, story line, news reports, etc.
■ List of techniques:
– Bag-of-Words
– Bag-of-n-Grams
Feature Extraction

■ It is the process of extracting or creating a new set of features from the

current dataset using some functional mapping.
■ It use for dimensionality reduction.
■ This can be having supervised and unsupervised.
■ Popular methods for the feature extractions are:
– Principal Components Analysis (PCA)
– Singular Value Decomposition (SVD)
– Linear Discriminant Analysis (LDA)
■ Both are linear projection methods. PCA is unsupervised and LDA is
supervised method.
Feature Subset Selection

■ Feature selection technique discards unnecessary features to reduce

the complexity of the resulting model.
■ Similar activity as dimensionality reduction.
■ The goal is to prudent model that is fast to compute, with little or no
degradation in predictive accuracy.
Key Drivers of Feature Selection

■ Which feature is going to be select?

■ Which feature to exclude?
■ Two key drivers for selecting features.
– Feature Relevance
– Feature Redundancy
Feature Relevance

■ Any feature, which is irrelevant in the context of machine learning

task on hand, is a potential candidate for rejection when selecting
subset of features.
■ Done by case-by-case basis.

■ In this “Name” feature is the most

irrelevant feature for age prediction.
Feature Redundancy

■ A feature may contribute information which is similar to the

information contribution by one or more features in the same data
set.
■ All features having potential redundancy are candidates for rejection
in the final feature subset.

■ The “Site length”, “Site Breadth” and “Site Area”

Reveal the dimensions of the site and can be
removed.
Overall Feature Selection Process

■ Generation of possible subsets.

■ Subset evaluation
■ Stop searching based on some stopping criterion
■ Validation of the result with respect to the chosen subsets.
Feature Selection Approaches

1. Filter:- Features are pre-processed to remove the ones that are

unlikely to be useful for the model.
2. Wrapper:- Allow to try out subsets of features.
3. Hybrid:- Takes the advantages of both.
4. Embedded:-Performs feature selection as part of the model training
process.
Thank you…

Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Eature Engineering: Presenter: Prof. Amit Kumar Das
No ratings yet
Eature Engineering: Presenter: Prof. Amit Kumar Das
17 pages
UNIT04
No ratings yet
UNIT04
35 pages
AI-Module 4 - Updated
No ratings yet
AI-Module 4 - Updated
53 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Feature Selection
No ratings yet
Feature Selection
13 pages
Unit 2 Feature Engineering
No ratings yet
Unit 2 Feature Engineering
64 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Unit II
No ratings yet
Unit II
119 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
Xplore Feature Engineering
No ratings yet
Xplore Feature Engineering
9 pages
Basics of Feature Engineering Marked
No ratings yet
Basics of Feature Engineering Marked
33 pages
ML 3170724 Unit-4
No ratings yet
ML 3170724 Unit-4
97 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Unit No: 4 Basics of Feature Engineering (31707 24)
No ratings yet
Unit No: 4 Basics of Feature Engineering (31707 24)
98 pages
ML Inter Q&A
No ratings yet
ML Inter Q&A
54 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
Data Preprocessing and Feature Engineering
No ratings yet
Data Preprocessing and Feature Engineering
32 pages
Explore Feature Engineering
No ratings yet
Explore Feature Engineering
10 pages
University Institute of Engineering Department of Computer Science & Engineering
No ratings yet
University Institute of Engineering Department of Computer Science & Engineering
23 pages
Deep Learning Vocabulary
No ratings yet
Deep Learning Vocabulary
6 pages
Unit 3
No ratings yet
Unit 3
50 pages
ML Unit 2 CLS Notes
No ratings yet
ML Unit 2 CLS Notes
38 pages
L5 Dimensionality Reduction
No ratings yet
L5 Dimensionality Reduction
47 pages
ML-Unit 3
No ratings yet
ML-Unit 3
58 pages
ML Unit 2 Part 2
No ratings yet
ML Unit 2 Part 2
23 pages
UT-1-Machine Learning Lecture Notes-2
No ratings yet
UT-1-Machine Learning Lecture Notes-2
11 pages
Model Selection and Feature Engineering
No ratings yet
Model Selection and Feature Engineering
64 pages
Feature Engineering Overview
No ratings yet
Feature Engineering Overview
1 page
5 Data Pre Processing III
No ratings yet
5 Data Pre Processing III
30 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
ML Unit2 Classppt
No ratings yet
ML Unit2 Classppt
44 pages
Unit 3
No ratings yet
Unit 3
55 pages
Feature Engineering in ML Guide
No ratings yet
Feature Engineering in ML Guide
6 pages
6 - Data Pre-Processing-III
No ratings yet
6 - Data Pre-Processing-III
30 pages
CHP 4
No ratings yet
CHP 4
72 pages
ML UNIT 2 2 Old
No ratings yet
ML UNIT 2 2 Old
15 pages
Explain The Role of Feature Engineering With Suitable Real Time Example
No ratings yet
Explain The Role of Feature Engineering With Suitable Real Time Example
17 pages
AI5003 AML Week07
No ratings yet
AI5003 AML Week07
14 pages
Tripti Ahmed 20 42960 1
No ratings yet
Tripti Ahmed 20 42960 1
11 pages
ML - Week 04
No ratings yet
ML - Week 04
33 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
43 pages
Feature Engineering Essentials
0% (1)
Feature Engineering Essentials
29 pages
Machine Learning
No ratings yet
Machine Learning
35 pages
Module 2 Data Preprocessing
No ratings yet
Module 2 Data Preprocessing
31 pages
Feature Selection
No ratings yet
Feature Selection
53 pages
Unit-II Feature Engineering - Removed
No ratings yet
Unit-II Feature Engineering - Removed
158 pages
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
No ratings yet
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
69 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
139 pages
Feature Engineering Guide
No ratings yet
Feature Engineering Guide
51 pages
PPA Data Preparation
No ratings yet
PPA Data Preparation
31 pages
Conference 101719
No ratings yet
Conference 101719
7 pages
Wa0003.
No ratings yet
Wa0003.
27 pages
MCA Class Note Feature
No ratings yet
MCA Class Note Feature
5 pages
Week 10
No ratings yet
Week 10
50 pages
Mendota, CA MS13 Federal Indictment
No ratings yet
Mendota, CA MS13 Federal Indictment
49 pages
Chapter 9
No ratings yet
Chapter 9
46 pages
SRC400C Load Charts Manual
No ratings yet
SRC400C Load Charts Manual
20 pages
Toyota Manual 5vz
No ratings yet
Toyota Manual 5vz
10 pages
Print Page - Turbidity Sensor Coding
No ratings yet
Print Page - Turbidity Sensor Coding
3 pages
Flow Field and Performance Analysis of An Integrated Diverterless Supersonic Inlet
No ratings yet
Flow Field and Performance Analysis of An Integrated Diverterless Supersonic Inlet
10 pages
4.12 Microwave Link (SDH & PDH)
80% (5)
4.12 Microwave Link (SDH & PDH)
48 pages
Balaji
No ratings yet
Balaji
8 pages
RS232 Data Transmission Guide
No ratings yet
RS232 Data Transmission Guide
2 pages
Exam Questions EX200: EX200 Red Hat Certified System Administrator (RHCSA) Exam
No ratings yet
Exam Questions EX200: EX200 Red Hat Certified System Administrator (RHCSA) Exam
20 pages
Catalyst 4500-X WSC4500X16SFP PDF
No ratings yet
Catalyst 4500-X WSC4500X16SFP PDF
680 pages
Apache Maven
No ratings yet
Apache Maven
9 pages
OOP's Experiments
No ratings yet
OOP's Experiments
14 pages
Oraimo Colorful Light Thumping Bass SoundFlow Wireless Soundbar Jumia Nigeria
No ratings yet
Oraimo Colorful Light Thumping Bass SoundFlow Wireless Soundbar Jumia Nigeria
1 page
Top-K Graph Mining Algorithm
No ratings yet
Top-K Graph Mining Algorithm
18 pages
Spek MX 450 STDR
No ratings yet
Spek MX 450 STDR
2 pages
How To Set Up University of Southampton VPN
No ratings yet
How To Set Up University of Southampton VPN
10 pages
Profinet Step7 v18 Function Manual en-US en-US
No ratings yet
Profinet Step7 v18 Function Manual en-US en-US
319 pages
PTM-11-22-33 User Manual
No ratings yet
PTM-11-22-33 User Manual
13 pages
(Ebook) Just Give Me The Answer$: Expert Advisors Address Your Most Pressing Financial Questions by Sheryl Garrett, Marie Swift ISBN 9780793183395, 0793183391 Instant Download
100% (4)
(Ebook) Just Give Me The Answer$: Expert Advisors Address Your Most Pressing Financial Questions by Sheryl Garrett, Marie Swift ISBN 9780793183395, 0793183391 Instant Download
66 pages
The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win Kim Download
100% (3)
The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win Kim Download
69 pages
Questions For QM 011: 01-SAMSS-005 Is Less Than 0.5 %) 16 To 40 Mils
No ratings yet
Questions For QM 011: 01-SAMSS-005 Is Less Than 0.5 %) 16 To 40 Mils
2 pages
Fire, Standards & Bangladesh's Electric Code
67% (3)
Fire, Standards & Bangladesh's Electric Code
6 pages
Podcast Sop
No ratings yet
Podcast Sop
4 pages
Principles of Programming Languages
No ratings yet
Principles of Programming Languages
32 pages
Ethernet DHX Driver Help
No ratings yet
Ethernet DHX Driver Help
95 pages
OM - M4 - Project Management
No ratings yet
OM - M4 - Project Management
21 pages
TITAN Whitepaper en
No ratings yet
TITAN Whitepaper en
8 pages
Autonics KN 2000W Datasheet
No ratings yet
Autonics KN 2000W Datasheet
12 pages
Development of A New Student Admission Service Information System
No ratings yet
Development of A New Student Admission Service Information System
3 pages

Unit-4 Part 3 Feature Engineering

Uploaded by

Unit-4 Part 3 Feature Engineering

Uploaded by

UNIT-4: FEATURE

■ It conforms the assumption of a model.

■ The process discovers missing information about the relationships

■ Some feature are bounded in value and other numeric features

■ Standardization (or Z-Score normalization or Variance scaling) scales

■ It normalize the original feature value by l2 norm. It also known as the

■ The l2 norm measures the length of the vector in coordinate space.

■ A categorical variable is used to represent categories or labels.

■ It creates new(binary) columns, indicating the presence of each

■ One-Hot Encoding is simple but uses more bit than it strictly

■ The Problem with One-Hot encoding is that it allows for k degrees of

■ Large categorical features, such as user ID, website URL, IP address

■ Need to apply machine learning on textural features such as product

■ It is the process of extracting or creating a new set of features from the

■ Feature selection technique discards unnecessary features to reduce

■ Which feature is going to be select?

■ Any feature, which is irrelevant in the context of machine learning

■ In this “Name” feature is the most

■ A feature may contribute information which is similar to the

■ The “Site length”, “Site Breadth” and “Site Area”

■ Generation of possible subsets.

1. Filter:- Features are pre-processed to remove the ones that are

You might also like