0% found this document useful (0 votes)

33 views30 pages

Data Preprocessing 09112023 065121pm

1. The document discusses various concepts related to data preprocessing including data types, attributes, objects, structured data characteristics, types of datasets, and common data quality issues. 2. Key steps in data preprocessing are discussed such as data aggregation, sampling, dimensionality reduction, feature selection/creation, and data transformation techniques. 3. The goal of data preprocessing is to handle data quality issues, reduce data size, and transform the data into a format that is suitable for data mining and machine learning algorithms.

Uploaded by

AHSAN HAMEED

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views30 pages

Data Preprocessing 09112023 065121pm

Uploaded by

AHSAN HAMEED

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

DATA P R E P R O C E S S I N G

1
W HAT IS D ATA ?

● Collection of data objects and Attributes

their attributes

● An attribute is a property or
characteristic of an object
– Examples: eye color of a
person, temperature, etc.
– Attribute is also known as
variable, field, characteristic,
or feature
● A collection of attributes Objects
describe an object
– Object is also known as record,
point, case, sample, entity, or
instance
TYPES OF ATTRIBUTES

● There are different types of attributes

– Nominal
 Examples: ID numbers, eye color, zip codes
– Ordinal
 Examples: rankings (e.g., taste of potato chips on a scale from
1-10), grades, height in {tall, medium, short}
– Interval
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
 Examples: temperature in Kelvin, length, time,
counts

3
DISCRETE AND CONTINUOUS ATTRIBUTES

● Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection
of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes

● Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
4
T Y P E S O F DATA S E T S
● Record
– Data Matrix
– Document Data
– Transaction Data
● Graph
– World Wide Web
– Molecular Structures
● Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data

5
I M P O RTA N T C H A R A C T E R I S T I C S OF
S T R U C T U R E D D ATA

– Dimensionality
 Curse of Dimensionality

– Sparsity
 Only presence counts

– Resolution
 Patterns depend on the scale

6
R E C O R D D ATA

● Data that consists of a collection of records, each of

which consists of a fixed set of attributes

7
D ATA M AT R I X

● If data objects have the same fixed set of numeric

attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute

● Such data set can be represented by an m by n matrix,

where there are m rows, one for each object, and n
columns, one for each attribute

8
D O C U M E N T D ATA

● Each document becomes a `term' vector,

– each term is a component (attribute) of the vector,
– the value of each component is the number of times the
corresponding term occurs in the document.

9
T R A N S AC T I O N D ATA

● A special type of record data, where

– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one shopping
trip constitute a transaction, while the individual products
that were purchased are the items.

item

transaction

10
G R A P H D ATA

● Examples: Generic graph and H T M L Links

<a href="papers/papers.html#bbbb"> Data

Mining </a>
<li> <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li> <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of
Equations </a>
<li> <a href="papers/papers.html#ffff"> N-
Body Computation and Dense Linear
System Solvers

11
C H E M I C A L D ATA

● Benzene Molecule: C 6 H 6

12
O R D E R E D D ATA

● Sequences of transactions

Items/Events

An element of
the 13
sequence
O R D E R E D D ATA

● Genomic sequence data

14
O R D E R E D D ATA

● Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

Trajectories of
Moving Objects

15
D ATA Q UA L I T Y

● What kinds of data quality problems?

● How can we detect problems with the data?
● What can we do about these problems?

● Examples of data quality problems:

– Noise and outliers
– missing values
– duplicate data

16
NOISE

● Noise refers to modification of original values

– Examples: distortion of a person’s voice when talking on a
poor phone and “snow” on television screen

17
Two Sine Waves Two Sine Waves + Noise
O U TLI E R S

● Outliers are data objects with characteristics that are

considerably different than most of the other data
objects in the data set

18
D EVIATION /A NOMALY D E T E C T I O N

 Outliers are useful when we need to detect significant

deviations from normal behavior
 Applications:
⚫ Credit Card Fraud Detection

⚫ Network Intrusion
Detection

19
day
MISSING VALUES

● Reasons for missing values

– Information is not collected
(e.g., people decline to give
their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

● Handling missing values

– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their
probabilities)

20
D U P L I C AT E D ATA

● Data set may include data objects that are duplicates,

or almost duplicates of one another
– Major issue when merging data from heterogeous sources

● Examples:
– Same person with multiple email addresses

● Data cleaning
– Process of dealing with duplicate data issues

21
D ATA P R E P R O C E S S I N G
● Aggregation
● Sampling
● Dimensionality Reduction
● Feature subset selection
● Feature creation
● Discretization and Binarization
● Attribute Transformation
A G G R E G A TIO N

● Combining two or more attributes (or objects) into a

single attribute (or object)

● Purpose
– Data reduction
 Reduce the number of attributes or objects
– Change of scale
 Cities aggregated into regions, states, countries, etc
– More “stable” data
 Aggregated data tends to have less variability

23
S AM P L I N G
● Sampling is the main technique employed for data
selection.
– It is often used for both the
preliminary investigation of the data and the final
data analysis.

● Statisticians sample because obtaining the entire set of

data of interest is too expensive or time consuming.

● Sampling is used in data mining because processing the

entire set of data of interest is too expensive or time
consuming.
24
SAMPLING …

● The key principle for effective sampling is the

following:
– using a sample will work almost as well as using the entire
data sets, if the sample is representative

– A sample is representative if it has approximately

the same property (of interest) as the original set of
data

25
SAMPLE SIZE

8000 points 2000 Points 500 Points

26
CURSE OF D I M E N S I O NA L I T Y

● When dimensionality increases, data becomes

increasingly sparse in the space that it occupies

● Definitions of density and distance between points,

which is critical for clustering and outlier detection,
become less meaningful.
D I M E N S I O NA L I T Y R E D U C T I O N

● Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise

28
F E AT U R E S U B S E T S E L E C T I O N

● Reduce dimensionality of data

Remove:
● Redundant features
– duplicate much or all of the information contained in one or
more other attributes
– Example: purchase price of a product and the amount of
sales tax paid

● Irrelevant features
– contain no information that is useful for the data mining
task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' G PA 29
F E AT U R E S U B S E T S E L E C T I O N

● Techniques:
– Brute-force approch:
 Try all possible feature subsets as input to data mining algorithm
– Embedded approaches:
 Featureselection occurs naturally as part of the data mining
algorithm
– Filter approaches:
 Features are selected before data mining algorithm is run

SWOT Analysis/TOWS Matrix For Facebook, Inc. (2015)
78% (9)
SWOT Analysis/TOWS Matrix For Facebook, Inc. (2015)
1 page
Deluca Family Walkthrough & Guide - Story
No ratings yet
Deluca Family Walkthrough & Guide - Story
2 pages
Bol Filter Manual
100% (2)
Bol Filter Manual
7 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
Machine Learning Lecture 4 Data Types
No ratings yet
Machine Learning Lecture 4 Data Types
21 pages
DWH m2p2
No ratings yet
DWH m2p2
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
No ratings yet
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
57 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
CIS62283 02 PreProcessing
100% (1)
CIS62283 02 PreProcessing
51 pages
Data Mining
No ratings yet
Data Mining
40 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
AIML Unit 2 Understanding Data
No ratings yet
AIML Unit 2 Understanding Data
51 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
Chapter 2.1 2.2
No ratings yet
Chapter 2.1 2.2
40 pages
Kuliah 2 - Data Dan Eksplorasi Data
No ratings yet
Kuliah 2 - Data Dan Eksplorasi Data
61 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Unit I
No ratings yet
Unit I
57 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Data
No ratings yet
Data
36 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Class 2 Introduction To Data
No ratings yet
Class 2 Introduction To Data
40 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
Data Mining
No ratings yet
Data Mining
5 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
DMML Notes
No ratings yet
DMML Notes
89 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
03 Data Science Process - Fall 23-24
No ratings yet
03 Data Science Process - Fall 23-24
38 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Mining Notes C2
No ratings yet
Data Mining Notes C2
12 pages
Week 2
No ratings yet
Week 2
96 pages
Class-Data Preprocessing-II
No ratings yet
Class-Data Preprocessing-II
57 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Unit 2 Part 4
No ratings yet
Unit 2 Part 4
47 pages
Unit 2 - Data Visualization Techniques
No ratings yet
Unit 2 - Data Visualization Techniques
101 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
Down 2
No ratings yet
Down 2
61 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
SCA - Module 3
No ratings yet
SCA - Module 3
48 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Practical 1
No ratings yet
Practical 1
3 pages
INVT - XG 50-70kW User Manual V1 - 0
No ratings yet
INVT - XG 50-70kW User Manual V1 - 0
39 pages
BP 2215
No ratings yet
BP 2215
30 pages
SuccessFactor Implementation at Viridor
100% (6)
SuccessFactor Implementation at Viridor
28 pages
Department of Computer Science and Engineering: Unit I - MCQ Bank
No ratings yet
Department of Computer Science and Engineering: Unit I - MCQ Bank
7 pages
Could I Leave A Message?: Meeting 3
No ratings yet
Could I Leave A Message?: Meeting 3
10 pages
Foxpro Tutorial A
No ratings yet
Foxpro Tutorial A
8 pages
Business Logic - Module 1
No ratings yet
Business Logic - Module 1
3 pages
Deep Zoom Composer User Guide
No ratings yet
Deep Zoom Composer User Guide
10 pages
Fernando Sor PDF Studies
17% (12)
Fernando Sor PDF Studies
2 pages
Learn Java For FTC
No ratings yet
Learn Java For FTC
215 pages
Powitec Documentacion Gral
No ratings yet
Powitec Documentacion Gral
146 pages
Question Bank SUBJECT: GSM (06EC844) Part-A: Unit 1 (GSM Architecture and Interfaces)
No ratings yet
Question Bank SUBJECT: GSM (06EC844) Part-A: Unit 1 (GSM Architecture and Interfaces)
6 pages
DGTIN Assignment
No ratings yet
DGTIN Assignment
20 pages
Data Analysis Tutorial
No ratings yet
Data Analysis Tutorial
152 pages
2013, Pascon, Bruyneel, SIMULATION OF THE THERMOFORMING PROCESS OF THERMOPLASTIC COMPOSITE PARTS
No ratings yet
2013, Pascon, Bruyneel, SIMULATION OF THE THERMOFORMING PROCESS OF THERMOPLASTIC COMPOSITE PARTS
9 pages
Pmac720 2
0% (1)
Pmac720 2
33 pages
Servomotor LM24A
No ratings yet
Servomotor LM24A
4 pages
U9500-Installation-and-Operation-Manual Radio Cable David Clark
No ratings yet
U9500-Installation-and-Operation-Manual Radio Cable David Clark
1 page
需要帮助写论文题目吗？
100% (2)
需要帮助写论文题目吗？
10 pages
Manual Poche PDF
No ratings yet
Manual Poche PDF
62 pages
"Surfaces: Overview," Section 2.3.1 Coupling Kinematic Distributing "Defining Coupling Constraints," Section 15.15.4 of The Abaqus/CAE User's Guide
No ratings yet
"Surfaces: Overview," Section 2.3.1 Coupling Kinematic Distributing "Defining Coupling Constraints," Section 15.15.4 of The Abaqus/CAE User's Guide
10 pages
Waiting Line Management
100% (1)
Waiting Line Management
33 pages
Return On Investment (ROI) / Payback Period (PP) Calculation Methodology
No ratings yet
Return On Investment (ROI) / Payback Period (PP) Calculation Methodology
2 pages
Network Element Description
No ratings yet
Network Element Description
45 pages
Binary Numbering System
No ratings yet
Binary Numbering System
6 pages
Demi Unit-5 Notes
No ratings yet
Demi Unit-5 Notes
20 pages

Data Preprocessing 09112023 065121pm

Uploaded by

Data Preprocessing 09112023 065121pm

Uploaded by

DATA P R E P R O C E S S I N G

● Collection of data objects and Attributes

● There are different types of attributes

● Data that consists of a collection of records, each of

● If data objects have the same fixed set of numeric

● Such data set can be represented by an m by n matrix,

● Each document becomes a `term' vector,

● A special type of record data, where

● Examples: Generic graph and H T M L Links

<a href="papers/papers.html#bbbb"> Data

● Genomic sequence data

● What kinds of data quality problems?

● Examples of data quality problems:

● Noise refers to modification of original values

● Outliers are data objects with characteristics that are

 Outliers are useful when we need to detect significant

● Reasons for missing values

● Handling missing values

● Data set may include data objects that are duplicates,

● Combining two or more attributes (or objects) into a

● Statisticians sample because obtaining the entire set of

● Sampling is used in data mining because processing the

● The key principle for effective sampling is the

– A sample is representative if it has approximately

8000 points 2000 Points 500 Points

● When dimensionality increases, data becomes

● Definitions of density and distance between points,

● Reduce dimensionality of data

You might also like