UNIT-4: FEATURE
ENGINEERING
Prof. Atmiya Patel
Feature Transformation
■ It conforms the assumption of a model.
■ Important tool for dimensionality reduction.
■ Two goals of feature transformation:
– Achieving best reconstruction in the original features.
– Achieving highest efficiency in the learning task.
■ It can apply to numeric features or non-numeric. (Like: text and
images)
Feature Construction
■ The process discovers missing information about the relationships
between features and expands the feature space by creating
additional features.
■ Added more features.
■ Techniques:
– Quantization or Binning
– Log Transform
– Feature Scaling or Normalization etc...
Quantization or Binning
■ The original data values which fall into a given small interval, a bin,
are replaced by a value representative of that interval, often the
central value. It is a form of quantization.
■ Statistical data binning is a way to group numbers of more or less
continuous values into a smaller number of "bins"
Log Transform
■ Power tool for dealing with large positive number with a heavy-tailed
distribution.
■ It compress the long tall in the heigh end of the distribution into a
shorter tail and expands the low end into a longer head.
Feature Scaling / Normalization
■ Some feature are bounded in value and other numeric features
increase without bound are affected by the scale of the input.
■ If the model is sensitive to the scale of input feature, feature scaling
could help.
■ It also called feature normalization.
■ It done individually to each feature.
Min-Max Scaling
Variance Scaling
■ Standardization (or Z-Score normalization or Variance scaling) scales
the values taking standard deviation of the features into account.
■ The mean of the feature is subtracted and divided by the standard
deviation of the feature.
■ The resulting feature has a mean 0 and a standard deviation of 1.
■ If the original feature is a normal distribution then the scaled feature
is also normal distribution.
l2 Normalization
■ It normalize the original feature value by l2 norm. It also known as the
Euclidean norm.
■ The l2 norm measures the length of the vector in coordinate space.
Encoding Categorical Variables
■ A categorical variable is used to represent categories or labels.
■ Large categorical variable are particularly common in transactional
records. Like IP address.
■ Even though user IDs and IP address are numeric, Their magnitude is
not relevant to the task.
■ The IP address might be relevant when doing fraud detection on
individual transaction.
■ The categories of a categorical variable are usually not numeric. So an
encoding method is needed to turn these non- numeric categories in
to numbers.
One-Hot Encoding
■ It creates new(binary) columns, indicating the presence of each
possible value from the original data.
■ Each bit represents a possible category.
■ One-Hot Encoding is simple but uses more bit than it strictly
necessary.
■ The sum of all the bits must be equal to 1.
Dummy Coding
■ The Problem with One-Hot encoding is that it allows for k degrees of
freedom, while the variable itself needs only k-1.
■ Dummy coding removes the extra degree of freedom by using only k-
1 features in the representation.
■ One feature is disregarded and is represented by the vector of all
zeros. This is known as the reference category.
■ The column “Blue” is deleted as it contained 0 for the first two rows.
■ The last row has both “Black” and “Brown” as 0 meaning that “Blue”
must be 1.
Feature Hashing
■ Large categorical features, such as user ID, website URL, IP address
etc., pose computation challenges in terms of memory efficiency and
storage.
■ To overcome this problem, Feature Hashing is used that makes
working with large categorical variables less computation intensive
and yet produces accurate models that fast to train.
■ Hashing, in general, is the process of taking any length of input
information and finding a unique fixed length representation of that
input information.
■ It is the process of finding a unique message digest (or hash value)
that corresponds to the input information.
■ It can be used in several different domains such as information
security, cryptocurrency, high-performance programming and for
creating quick lookup tables.
■ In machine learning, hash functions can be constructed for any object
that can be represented numerically. Like numbers, strings, complex
structures, etc.
Handling Textual Features
■ Need to apply machine learning on textural features such as product
reviews, comments, story line, news reports, etc.
■ List of techniques:
– Bag-of-Words
– Bag-of-n-Grams
Feature Extraction
■ It is the process of extracting or creating a new set of features from the
current dataset using some functional mapping.
■ It use for dimensionality reduction.
■ This can be having supervised and unsupervised.
■ Popular methods for the feature extractions are:
– Principal Components Analysis (PCA)
– Singular Value Decomposition (SVD)
– Linear Discriminant Analysis (LDA)
■ Both are linear projection methods. PCA is unsupervised and LDA is
supervised method.
Feature Subset Selection
■ Feature selection technique discards unnecessary features to reduce
the complexity of the resulting model.
■ Similar activity as dimensionality reduction.
■ The goal is to prudent model that is fast to compute, with little or no
degradation in predictive accuracy.
Key Drivers of Feature Selection
■ Which feature is going to be select?
■ Which feature to exclude?
■ Two key drivers for selecting features.
– Feature Relevance
– Feature Redundancy
Feature Relevance
■ Any feature, which is irrelevant in the context of machine learning
task on hand, is a potential candidate for rejection when selecting
subset of features.
■ Done by case-by-case basis.
■ In this “Name” feature is the most
irrelevant feature for age prediction.
Feature Redundancy
■ A feature may contribute information which is similar to the
information contribution by one or more features in the same data
set.
■ All features having potential redundancy are candidates for rejection
in the final feature subset.
■ The “Site length”, “Site Breadth” and “Site Area”
Reveal the dimensions of the site and can be
removed.
Overall Feature Selection Process
■ Generation of possible subsets.
■ Subset evaluation
■ Stop searching based on some stopping criterion
■ Validation of the result with respect to the chosen subsets.
Feature Selection Approaches
1. Filter:- Features are pre-processed to remove the ones that are
unlikely to be useful for the model.
2. Wrapper:- Allow to try out subsets of features.
3. Hybrid:- Takes the advantages of both.
4. Embedded:-Performs feature selection as part of the model training
process.
Thank you…