Data Preprocessing
Dr. Indu Joshi
Assistant Professor at
Indian Institute of Technology Mandi
8 August 2025
What is Data?
• Collection of data objects and their attributes
• An attribute is a property or characteristic of an object
• Examples: eye color of a person, temperature, etc.
• Attribute is also known as variable, field, characteristic, or
feature
• A collection of attributes describe an object
• Object is also known as record, point, case, sample, entity, or
instance
Data: Example
Attributes
Tid Refund Marital Taxable Cheat
Status Income
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
Types of Attributes
• There are different types of attributes
– Nominal (Name only, no order)
• Examples: ID numbers, eye color, zip codes
– Ordinal (Order matters, but not the exact difference)
• Examples: rankings (e.g., taste of potato chips on a scale from
1–10), grades, height in {tall, medium, short}
– Numeric (Interval-Scaled) (Order exists, No true zero
point)
• Examples: calendar dates, temperatures in Celsius or
Fahrenheit
– Numeric (Ratio-Scaled) (Ratios matter because zero is
absolute)
• Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values
• The type of an attribute depends on which of the following
properties it possesses:
• Distinctness: =, ̸=
• Order: <, >
• Addition: +, −
• Multiplication: ∗, /
Nominal attribute: distinctness
Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties
Attribute Types
Attribute Description Examples
Type
Nominal The values of a nominal attribute zip codes, employee ID
are just different names, i.e., nom- numbers, eye color, sex:
inal attributes provide only enough {male, female}
information to distinguish one ob-
ject from another. (=, ̸=)
Ordinal The values of an ordinal attribute hardness of minerals,
provide enough information to or- {good, better, best},
der objects (<, >). grades, street numbers
Attribute Types
Attribute Description Examples
Type
Interval For interval attributes, the differ- calendar dates, tem-
ences between values are meaning- perature in Celsius or
ful, i.e., a unit of measurement ex- Fahrenheit
ists. (+, −)
Ratio For ratio variables, both differ- temperature in Kelvin,
ences and ratios are meaningful. monetary quantities,
(∗, /) counts, age, mass,
length, electrical cur-
rent
Discrete and Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• Examples: zip codes, counts, or the set of words in a collection
of documents
• Often represented as integer variables.
• Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• Examples: temperature, height, or weight.
• Practically, real values can only be measured and represented
using a finite number of digits.
• Continuous attributes are typically represented as
floating-point variables.
Types of data sets
• Record
• Data Matrix
• Document Data
• Transaction Data
• Graph
• World Wide Web
• Molecular Structures
• Ordered
• Spatial Data
• Temporal Data
• Sequential Data
• Genetic Sequence Data
Record Data
• Data that consists of a collection of records, each of which
consists of a fixed set of attributes
Tid Refund Marital Status Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
Data Matrix
• If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a
multi-dimensional space, where each dimension represents a
distinct attribute.
• Such data set can be represented by an m × n matrix, where
there are m rows, one for each object, and n columns, one for
each attribute.
Projection Projection Distance Load Thickness
of x Load of y Load
10.23 5.27 15.22 2.7 1.2
12.65 6.25 16.22 2.2 1.1
Text Data
• Each document becomes a ‘term’ vector,
• each term is a component (attribute) of the vector,
• the value of each component is the number of times the
corresponding term occurs in the document.
Graph Data
• Examples: Facebook graph and HTML Links
Transaction Data
• A special type of record data, where
• each record (transaction) involves a set of items.
• For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Ordered Data
• Genomic sequence data
GGTTCGCCTTCAGCCCGGGC
CGCAGGGCCCGCCGCGGTCC
GAGAAGGGCGCCTGCCGTGC
GGGGAGGGGCCCGCGGAGGG
CCAACCGAGTCGACAGTGGC
CCCTCTGCTTAGACCTGAGG
GCTCATTAGGCCGAGGCTGG
GCCAAGTAGAACGGGCCAGG
TGGGTCGCCGCGGACCAGGG
Record Data
• Data that consists of a collection of records, each of which
consists of a fixed set of attributes
Tid Refund Marital Taxable Cheat
Status Income
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K No
9 No Married 75K No
10 No Single 90K Yes
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
• Examples of data quality problems:
• Noise and outliers
• Missing values
• Duplicate data
Noise
• Noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a
poor phone line and “snow” (Random white and black dots)
on television screen
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects in
the data set
Missing Values
• Reasons for missing values
• Information is not collected (e.g., people decline to give their
age and weight)
• Attributes may not be applicable to all cases (e.g., annual
income is not applicable to children)
• Handling missing values
• Eliminate Data Objects
• Estimate Missing Values
Duplicate Data
• Data set may include data objects that are duplicates, or
almost duplicates of one another
• Major issue when merging data from heterogenous sources
• Examples:
• Same person with multiple email addresses
• Data cleaning
• Process of dealing with duplicate data issues
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation
Aggregation
• Combining two or more attributes (or objects) into a single
attribute (or object)
• Example- instead of keeping both height and weight
attributes, you combine them into body mass index (BMI) or
combining daily sales data into monthly sales data
• Purpose
• Data reduction
• Reduce the number of attributes or objects
• Change of scale
• Cities aggregated into regions, states, countries, etc
• More “stable” data
• Aggregated data tends to have less variability
Sampling
• Sampling is the main technique employed for data selection.
• It is often used for both the preliminary investigation of the
data and the final data analysis.
• Statisticians sample because obtaining the entire set of
data of interest is too expensive or time consuming.
• Sampling is used in data mining because processing the
entire set of data of interest is too expensive or time
consuming.
Sample Size
8000 points 2000 points 500 points
Sampling . . .
• The key principle for effective sampling is the following:
• using a sample will work almost as well as using the entire
data sets, if the sample is representative
• A sample is representative if it has approximately the same
property (of interest) as the original set of data
Types of Sampling
• Simple Random Sampling
There is an equal probability of selecting any particular item
• Sampling without replacement
As each item is selected, it is removed from the population
• Sampling with replacement
Objects are not removed from the population as they are
selected for the sample.
• In sampling with replacement, the same object can be picked
up more than once
• Stratified sampling
Split the data into several partitions; then draw random
samples from each partition
Curse of Dimensionality
• When dimensionality increases, data becomes increasingly
sparse in the space that it occupies
• Definitions of density and distance between points, which is
critical for clustering and outlier detection, become less
meaningful
Dimensionality Reduction
Purpose:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by machine
learning algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise
Techniques:
• Principle Component Analysis
• Singular Value Decomposition
Feature Subset Selection
• Another way to reduce dimensionality of data
• Redundant features
• duplicate much or all of the information contained in one or
more other attributes
• Example: purchase price of a product and the amount of sales
tax paid
• Irrelevant features
• contain no information that is useful for the machine learning
task at hand
• Example: students’ ID is often irrelevant to the task of
predicting students’ GPA
Feature Creation
• Create new attributes that can capture the important
information in a data set much more efficiently than the
original attributes
• General methodologies::
• Feature Extraction
• domain-specific
• Mapping Data to New Space
Discretization
Data Equal interval width
Equal frequency K-means
Attribute Transformation
• A function that maps the entire set of values of a given
attribute to a new set of replacement values such that each
old value can be identified with one of the new values
• Simple functions: x k , log(x), e x , |x|
• Standardization and Normalization
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity
• Numerical measure of how different are two data objects
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects
Euclidean Distance
• Euclidean Distance
v
u n
uX
dist = t (pk − qk )2
k=1
• Where n is the number of dimensions (attributes) and pk and
qk are, respectively, the k th attributes (components) of data
objects p and q.
• Standardization is necessary, if scales differ.
Mahalanobis Distance
Definition
s(p, q) = (p − q) Σ−1 (p − q)T
Covariance Matrix of Input Data X
n
1 X
Σj,k = Xij − X j Xik − X k
n−1
i=1
Cosine Similarity
• If d1 and d2 are two document vectors, then
d1 · d2
cos(d1 , d2 ) =
∥d1 ∥∥d2 ∥
where · indicates vector dot product and ∥d∥ is the length of
vector d.
Example:
d1 = 3205000200
d2 = 1000000102
d1 ·d2 = 3∗1+2∗0+0∗0+5∗0+0∗0+0∗0+0∗0+2∗1+0∗0+0∗2 = 5
Cosine Similarity
p
∥d1 ∥ = 32 + 22 + 02 + 52 + 02 + 02 + 02 + 22 + 02 + 02
√
= 42 = 6.481
p
∥d2 ∥ = 12 + 02 + 02 + 02 + 02 + 02 + 02 + 12 + 02 + 22
√
= 6 = 2.245
5
cos(d1 , d2 ) = = 0.3150
6.481 × 2.245
Similarity Between Binary Vectors
• Common situation is that objects, p and q, have only binary
attributes
• Compute similarities using the following quantities:
• M01 = the number of attributes where p was 0 and q was 1
• M10 = the number of attributes where p was 1 and q was 0
• M00 = the number of attributes where p was 0 and q was 0
• M11 = the number of attributes where p was 1 and q was 1
• Simple Matching and Jaccard Coefficients:
• Simple Matching Coefficient (SMC):
number of matches M11 + M00
SMC = =
number of attributes M01 + M10 + M11 + M00
• Jaccard Coefficient (J):
number of 11 matches M11
J= =
number of not-both-zero attribute values M01 + M10 + M11
Correlation
• Correlation measures the linear relationship between objects.
• To compute correlation, we standardize data objects, p and q,
and then take their dot product:
pk − mean(p)
pk′ =
std(p)
qk − mean(q)
qk′ =
std(q)
correlation(p, q) = p · q′
′
Visually Evaluating Correlation
Scatter plots showing similarity from −1 to 1
Correlation
• +1: Perfect positive correlation: The points lie exactly in an
upward-sloping straight line. As x increases, y increases
proportionally.
• -1: Perfect negative correlation: Points lie exactly on a
downward-sloping straight line. As x increases, y decreases
proportionally.
• 0: No correlation: Points are scattered randomly; no linear
relationship.
Thank You
Contact: indujoshi@iitmandi.ac.in