0% found this document useful (0 votes)

60 views60 pages

M2 PPT

Data pre processing

Uploaded by

r8342254

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views60 pages

M2 PPT

Data pre processing

Uploaded by

r8342254

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

DATA (PRE-)PROCESSING

In Previous Class,
• We discuss various type of Data with examples
• In this Class,
• We focus on Data pre-processing – “an important
milestone of the Data Mining Process”
Data analysis pipeline
 Mining is not the only step in the analysis process

 Preprocessing: real data is noisy, incomplete and inconsistent.

Data cleaning is required to make sense of the data
 Techniques: Sampling, Dimensionality Reduction, Feature Selection.
 Post-Processing: Make the data actionable and useful to the user
Statistical analysis of importance & Visualization.
Why Preprocess the Data

Measures for Data Quality: A Multidimensional View

Accuracy: Correct or Wrong, Accurate or Not
Completeness: Not recorded, unavailable,…
Consistency: Come modified but some not,...
Timeliness: Timely update?
Believability: how trustable the data are correct?
Interpretability: how easily the data can be understood?
Why Data Preprocessing?
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality data
• Required for both OLAP and Data Mining!
Why can Data be
Incomplete?
 Attributes of interest are not available (e.g., customer
information for sales transaction data)
 Data were not considered important at the time of
transactions, so they were not recorded!
 Data not recorder because of misunderstanding or
malfunctions
 Data may have been recorded and later deleted!
 Missing/unknown values for some data
Attribute Values
Data is described using attribute
values
Attribute values are numbers or symbols assigned to an attribute
 Distinction between attributes and attribute values
 Same attribute can be mapped to different attribute values
Example: height can be measured in feet or meters
 Different attributes can be mapped to the same set of
values
Example: Attribute values for ID and age are integers
 But properties of attribute values can be different
 ID has no limit but age has a maximum and minimum value
Types of Attributes
There are different types of attributes
 Nominal
 Examples: ID numbers, eye color, zip codes
 Ordinal
 Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium,
short}
 Interval
 Examples: calendar dates
 Ratio
 Examples: length, time, counts
Discrete and Continuous Attributes
 Discrete Attribute
Has only a finite or count able in finite set of values
Examples: zip codes, counts, or the set of words in a
collection of documents
 Often represented as integer variables.

 Continuous Attribute
Has real numbers as attribute values
Examples : temperature, height, or weight.
Practically, real values can only be measured and
represented using a finite number of digits.
Data Preprocessing
Major Tasks in Data Preprocessing
outliers=exceptions!
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for
numerical data
Forms of data preprocessing
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, E g.,
Instrument faulty, human or computer error, Transmission error
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
How to Handle Missing
Data?
• Ignore the tuple: usually done when class label is missing (assuming the tasks
in classification)—not effective when the percentage of missing values per
attribute varies considerably.

• Fill in the missing value manually: tedious + infeasible?

• Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!

• Use the attribute mean to fill in the missing value

• Use the attribute mean for all samples belonging to the same class to fill in the
missing value: smarter

• Use the most probable value to fill in the missing value: inference-based such
as Bayesian formula or decision tree
How to Handle Missing
Data?

Age Income Religion Gender

23 24,200 Muslim M
39 ? Christian F
45 45,390 ? F

Fill missing values using aggregate functions (e.g., average) or probabilistic estimates
on global value distribution
E.g., put the average income here, or put the most probable income based on the fact
that the person is 39 years old
E.g., put the most frequent religion here
Data Quality

Data has attribute values

Then,
How good our Data w.r.t. these attribute
values?
Data Quality
 Examples of data quality problems:
 Noise and outliers
 Missing values
 Duplicate data
Data Quality: Noise
 Noise refers to modification of original values
 Examples: distortion of a person’s voice when talking on
Data Quality: Outliers
 Outliers are data objects with characteristics that are
considerably different than most of the other data objects in the
data set
Data Quality: Missing Values
 Reasons for missing values
•Information is not collected
•(e.g., people decline to give their age and weight)
•Attributes may not be applicable to all cases (e.g.,
annual income is not applicable to children)
• Handling missing values
•Eliminate Data Objects
•Estimate Missing Values
•Ignore the Missing Value During Analysis
•Replace with all possible values (weighted by their
probabilities)
Data Quality: Duplicate Data
Data set may include data objects that are duplicates, or
almost duplicates of one another
 Major issue when merging data from heterogeous
sources
Examples:
Same person with multiple email addresses
 Data cleaning
 Process of dealing with duplicate data issues
Data Quality: Handle Noise(Binning)
 Binning
 sort data and partition into (equi-depth) bins
 smooth by bin means, bin median, bin boundaries,
etc.
 Regression
 smooth by fitting a regression function
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values automatically and check
by human
Data Quality: Handle Noise(Binning)
 Equal-width binning
 Divides the range into N intervals of equal size Width
of intervals:
 Simple
 Outliers may dominate result

 Equal-depth binning
Divides the range into N intervals,
each containing approximately same number of records
Skewed data is also handled well
Simple Methods: Binning
Data Quality: Handle
Noise(Binning)
Example: Sorted price values 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
• Partition into three (equi-depth) bins
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
• Smoothing by bin means
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Quality: Handle
Noise(Regression)
•Replace noisy or missing values by
predicted values
•Requires model of attribute
dependencies (maybe wrong!)
•Can be used for data smoothing or
for handling missing data
Data Integration
The process of combining multiple sources into a single dataset. The Data
integration process is one of the main components in data management.
Data integration:
Combines data from multiple sources into a coherent store
Schema integration integrate metadata from different sources
metadata: data about the data (i.e., data descriptors)
Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id  B.cust-#
Detecting and resolving data value conflicts for the same real
world entity, attribute values from different sources are
different (e.g., J.D.Smith and Jonh Smith may refer to the same
person)
possible reasons: different representations, different scales,
e.g., metric vs. British units (inches vs. cm)
Handling Redundant Data in Data Integration

• Redundant data occur often when integration of multiple databases

• The same attribute may have different names in different
databases
• One attribute may be a “derived” attribute in another table, e.g.,
annual revenue
• Redundant data may be able to be detected by correlation analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Attribute/feature construction
• New attributes constructed from the given ones
Normalization: Why normalization?
• Speeds-up some learning techniques (ex. neural networks)
• Helps prevent attributes with large ranges outweigh ones
with small ranges
• Example:
• income has range 3000-200000
• age has range 10-80
• gender has domain M/F
Data Transformation
Data has an attribute values
Then,
Can we compare these attribute values?
For Example: Compare following two records
(1) (5.9 ft, 50 Kg)
(2) (4.6 ft, 55 Kg) Vs.
(3) (5.9 ft, 50 Kg)
(4) (5.6 ft, 56 Kg)
We need Data Transformation to makes different
dimension(attribute) records comparable ...
Data Transformation Techniques
Normalization: scaled to fall within a small,
specified range.
min-max normalization
z-score normalization
normalization by decimal scaling

 Centralization:
 Based on fitting a distribution to the data
 Distance function between distributions
 KL Distance
 Mean Centering
Data Transformation: Normalization
Example: Data Transformation
- Assume, min and max value for height and weight.
- Now, apply Min-Max normalization to both attributes as given
follow
(1) (5.9 ft, 50 Kg)
(2) (4.6 ft, 55 Kg)
Vs.
(1) (5.9 ft, 50 Kg)
(2) (5.6 ft, 56 Kg)
- Compare your results...
Data Transformation:
Aggregation
 Combining two or more attributes (or objects) into a
single attribute (or object)
 Purpose
 Data reduction
 Reduce the number of attributes or objects
 Change of scale
 Cities aggregated into regions, states, countries,
etc
 More “stable” data
 Aggregated data tends to have less variability
Data Transformation: Discretization

 Motivation for Discretization

 Some data mining algorithms only accept
categorical attributes
 May improve understandability of patterns
Data Transformation: Discretization
 Task
• Reduce the number of values for a given
continuous attribute by partitioning the range of
the attribute into intervals
• Interval labels replace actual attribute values
 Methods
• Binning (as explained earlier)
• Cluster analysis (will be discussed later)
• Entropy-based Discretization (Supervised)
Simple Discretization Methods: Binning
 Equal-width (distance) partitioning:
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the
width of intervals will
be: W = (B –A)/N.
The most straightforward, but outliers may dominate presentation
Skewed data is not handled well.

 Equal-depth (frequency) partitioning:

Divides the range into N intervals, each containing approximately
same number of samples
Good data scaling
Managing categorical attributes can be tricky.
Data Reduction Strategies
Warehouse may store terabytes of data: Complex data analysis/mining
may take a very long time to run on the complete data set
• Data reduction
• Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same)
analytical results
• Data reduction strategies
• Data cube aggregation
• Dimensionality reduction
• Data compression
• Numerosity reduction
• Discretization and concept hierarchy generation
Techniques of Data Reduction
Techniques or methods of data reduction in data mining, such
as
Dimensionality Reduction

• The reduction of random variables or

attributes is done so that the
dimensionality of the data set can be
reduced.
• Combining and merging the attributes of
the data without losing its original
characteristics. This also helps in the
reduction of storage space and
computation time is reduced.
Numerosity Reduction:
Reduce the volume of data
The representation of the data is made smaller by reducing the volume.
There will not be any loss of data in this reduction.
• Parametric methods
• Assume the data fits some model, estimate model parameters, store
only the parameters, and discard the data (except possible outliers)
• Log-linear models: obtain value at a point in m-D space as the product
on appropriate marginal subspaces
• Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling
Data Cube Aggregation
• The lowest level of a data cube
• the aggregated data for an individual entity of interest
• e.g., a customer in a phone calling data warehouse.
• Multiple levels of aggregation in data cubes
• Further reduce the size of data to deal with
• Reference appropriate levels
• Use the smallest representation which is enough to solve the task
• Queries regarding aggregated information should be answered using
data cube, when possible
Data Compression

Original Data Compressed

Data
lossless

s sy
lo
Original Data
Approximated
Histograms
40
• A popular data reduction
technique 35
• Divide data into buckets and 30
store average (or sum) for each
25
bucket
20
• Can be constructed optimally
in one dimension using 15
dynamic programming 10
• Related to quantization 5
problems.
0
10000 30000 50000 70000 90000
Histogram types
• Equal-width histograms:
• It divides the range into N intervals of equal size
• Equal-depth (frequency) partitioning:
• It divides the range into N intervals, each containing approximately same number of
samples
• V-optimal:
• It considers all histogram types for a given number of buckets and chooses the one
with the least variance.
• MaxDiff:
• After sorting the data to be approximated, it defines the borders of the buckets at
points where the adjacent values have the maximum difference
• Example: split 1,1,4,5,5,7,9,14,16,18,27,30,30,32 to three buckets
Clustering
• Partitions data set into clusters, and models it by one representative
from each cluster

• Can be very effective if data is clustered but not if data is “smeared”

• There are many choices of clustering definitions and clustering

algorithms
Cluster Analysis
the distance between points in the
salary
same cluster should be small

cluster

outlier

age
Hierarchical Reduction
• Use multi-resolution structure with different degrees of reduction
• Hierarchical clustering is often performed but tends to define
partitions of data sets rather than “clusters”
• Parametric methods are usually not amenable to hierarchical
representation
• Hierarchical aggregation
• An index tree hierarchically divides a data set into partitions by value range
of some attributes
• Each partition can be considered as a bucket
• Thus an index tree with aggregates stored at each node is a hierarchical
histogram
Discretization
• Three types of attributes:
• Nominal — values from an unordered set
• Ordinal — values from an ordered set
• Continuous — real numbers
• Discretization:
• divide the range of a continuous attribute into intervals
• why?
• Some classification algorithms only accept categorical
attributes.
• Reduce data size by discretization
• Prepare for further analysis
Discretization and Concept hierarchy
• Discretization
• reduce the number of values for a given continuous attribute
by dividing the range of the attribute into intervals. Interval
labels can then be used to replace actual data values.

• Concept hierarchies
• reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).
Discretization and concept hierarchy
generation for numeric data
• Binning/Smoothing

• Histogram analysis

• Clustering analysis

• Entropy-based discretization

• Segmentation by natural partitioning

Entropy-Based Discretization m
Ent(S1 ) = - å pi log2 ( pi )
i =1

Entropy:

• Given a set of samples S, if S is partitioned into two

intervals S1 and S2 using boundary T, the information gain
I(S,T) after partitioning is
|S1 | |S 2 |
I (S, T ) = Ent(S1) + Ent(S 2)
|S | |S |
• The boundary that maximizes the information gain over all
possible boundaries is selected as a binary discretization.
• The process is recursively applied to partitions obtained
until some stopping criterion is met, e.g.,
Ent(S ) - I (T , S ) > d
• Experiments show that it may reduce data size and improve
classification accuracy
Segmentation by natural partitioning
 Users often like to see numerical ranges partitioned into
relatively uniform, easy-to-read intervals that appear intuitive or
“natural”. E.g., [50-60] better than [51.223-60.812]
The 3-4-5 rule can be used to segment numerical data into
relatively uniform, “natural” intervals.
* If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit,
partition the range into 3 equiwidth intervals for 3,6,9 or 2-3-2 for 7
* If it covers 2, 4, or 8 distinct values at the most significant digit, partition the
range into 4 equiwidth intervals
* If it covers 1, 5, or 10 distinct values at the most significant digit, partition the
range into 5 equiwidth intervals
The rule can be recursively applied for the resulting intervals
Python - Data Wrangling
Data wrangling is the process of cleaning and unifying messy and complex data sets
for easy access and analysis
Working with raw data sucks.

• Data comes in all shapes and sizes – CSV files, PDFs, stone
tablets, .jpg…

• Different files have different forming – Spaces instead of

NULLs, extra rows

• “Dirty” data – Unwanted anomalies – Duplicates

Principal Component Analysis

Principal Component Analysis, or PCA, is a

dimensionality-reduction method that is often used to
reduce the dimensionality of large data sets, by
transforming a large set of variables into a smaller one
that still contains most of the information in the large
set.
Principal Component Analysis or
Karhuren-Loeve (K-L) method
• Given N data vectors from k-dimensions, find c <= k
orthogonal vectors that can be best used to represent data
• The original data set is reduced to one consisting of N data vectors
on c principal components (reduced dimensions)
• Each data vector is a linear combination of the c principal
component vectors
• Works for numeric data only
• Used when the number of dimensions is large
Principal Component Analysis

X1, X2: original axes (attributes) X2

Y1,Y2: principal components
Y1
Y2
significant component
(high variance)

Order principal components by significance and eliminate weaker ones

Food Safety Level 2
No ratings yet
Food Safety Level 2
1 page
Answers Lab 2
No ratings yet
Answers Lab 2
2 pages
Sonnet 130
100% (7)
Sonnet 130
8 pages
Solved Java
No ratings yet
Solved Java
28 pages
Final - Practice ANSWERS
No ratings yet
Final - Practice ANSWERS
23 pages
Exception Handling
No ratings yet
Exception Handling
13 pages
Satyr-Mythic Odysseys of Theros
No ratings yet
Satyr-Mythic Odysseys of Theros
2 pages
A Sample Cell Phone Repair Business Plan Template
100% (2)
A Sample Cell Phone Repair Business Plan Template
18 pages
August Osage County Screenplay
100% (1)
August Osage County Screenplay
139 pages
CDAC Question Paper 2010 With Answers Free Paper
56% (9)
CDAC Question Paper 2010 With Answers Free Paper
3 pages
Quiz Code Java
No ratings yet
Quiz Code Java
4 pages
Csit751-Core Java - Mca.
No ratings yet
Csit751-Core Java - Mca.
6 pages
Java Programming - 2025
No ratings yet
Java Programming - 2025
4 pages
JAVA: MCQ. These Are Sample MCQ Questions. Just For Practice. Option A/1 Option B/2 Option C/3 Option D/4 Answer
100% (1)
JAVA: MCQ. These Are Sample MCQ Questions. Just For Practice. Option A/1 Option B/2 Option C/3 Option D/4 Answer
4 pages
Core Java-Mcqs
No ratings yet
Core Java-Mcqs
65 pages
MCQ's Based On Java
No ratings yet
MCQ's Based On Java
18 pages
Examen1Z0 815
No ratings yet
Examen1Z0 815
65 pages
Soal Final Exam
100% (1)
Soal Final Exam
15 pages
Java MCQ
No ratings yet
Java MCQ
15 pages
Java MCQ Worksheet-10 MCQ
No ratings yet
Java MCQ Worksheet-10 MCQ
5 pages
Java Questions 1: A. B. C. D
No ratings yet
Java Questions 1: A. B. C. D
11 pages
OOP Examples
No ratings yet
OOP Examples
8 pages
Java MCQ'S
No ratings yet
Java MCQ'S
161 pages
MCQ With Ans
No ratings yet
MCQ With Ans
8 pages
Exercise - 3 Submission - Group - 12
No ratings yet
Exercise - 3 Submission - Group - 12
14 pages
Objective Questions-Java
No ratings yet
Objective Questions-Java
3 pages
Java Programming Bscit 42
No ratings yet
Java Programming Bscit 42
83 pages
1 - Core Java (TOC)
No ratings yet
1 - Core Java (TOC)
229 pages
100 MCQs On Java
No ratings yet
100 MCQs On Java
26 pages
Artificial Intelligence Question Bank
100% (2)
Artificial Intelligence Question Bank
8 pages
Java MCQS
100% (1)
Java MCQS
20 pages
Oops MCQ Answers
No ratings yet
Oops MCQ Answers
131 pages
C, C++Questions
No ratings yet
C, C++Questions
21 pages
Java Question Bank
No ratings yet
Java Question Bank
18 pages
Sunbeam OS Day4 MCQ's
No ratings yet
Sunbeam OS Day4 MCQ's
13 pages
Java Practice Questions and Mcqs
60% (5)
Java Practice Questions and Mcqs
309 pages
OS MCQ Set 3
No ratings yet
OS MCQ Set 3
4 pages
Java Master
No ratings yet
Java Master
138 pages
(XXXX) 2 Marks (XXXX) : - Class B Extends A
No ratings yet
(XXXX) 2 Marks (XXXX) : - Class B Extends A
11 pages
Java Mcqs
No ratings yet
Java Mcqs
7 pages
Java Questions
No ratings yet
Java Questions
69 pages
Java MockExam - 1
No ratings yet
Java MockExam - 1
8 pages
Java Program To Check Whether A Number Is Even or Odd (If-Else & Ternary)
No ratings yet
Java Program To Check Whether A Number Is Even or Odd (If-Else & Ternary)
2 pages
CS302 Unit1-III
No ratings yet
CS302 Unit1-III
18 pages
DSA Quick Revision Guide
No ratings yet
DSA Quick Revision Guide
49 pages
Daa Unit-2final
No ratings yet
Daa Unit-2final
33 pages
SQL Sample Question and Answer
No ratings yet
SQL Sample Question and Answer
15 pages
CS8392 Object Oriented Programming MCQ
No ratings yet
CS8392 Object Oriented Programming MCQ
332 pages
Short Questions: 1. What Is OOPS?
No ratings yet
Short Questions: 1. What Is OOPS?
9 pages
Java Mock Test III
No ratings yet
Java Mock Test III
6 pages
$R6RN116
No ratings yet
$R6RN116
20 pages
2 Exercises On Concurrency
No ratings yet
2 Exercises On Concurrency
15 pages
Questions
No ratings yet
Questions
6 pages
JAVA MCQ (Answers) - 2
No ratings yet
JAVA MCQ (Answers) - 2
7 pages
Advanced Java Lecture-1
No ratings yet
Advanced Java Lecture-1
48 pages
Java QB
No ratings yet
Java QB
6 pages
SDM MCQ Bank
No ratings yet
SDM MCQ Bank
87 pages
MCQ
100% (1)
MCQ
8 pages
JDBC Mock Test II
No ratings yet
JDBC Mock Test II
6 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Preprocessing
No ratings yet
Preprocessing
50 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
English Worksheet Set-1 PDF
No ratings yet
English Worksheet Set-1 PDF
37 pages
Bridge PDF
No ratings yet
Bridge PDF
57 pages
Zara'S Succes S Story: Mba Group - 6 Entrepreneur Weekly Task 27-DEC-2020
No ratings yet
Zara'S Succes S Story: Mba Group - 6 Entrepreneur Weekly Task 27-DEC-2020
7 pages
3) All of Me (Jazz Piano)
No ratings yet
3) All of Me (Jazz Piano)
1 page
Unit 6 Study Sheet Gr.2
No ratings yet
Unit 6 Study Sheet Gr.2
10 pages
Bronchial Hygiene Therapy
No ratings yet
Bronchial Hygiene Therapy
17 pages
Public Speaking: The Evolving Art Stephanie J. Coopman Download PDF
100% (8)
Public Speaking: The Evolving Art Stephanie J. Coopman Download PDF
53 pages
Inbound 8364877578280229652
No ratings yet
Inbound 8364877578280229652
18 pages
RK500-02 PH Sensor: Features
No ratings yet
RK500-02 PH Sensor: Features
3 pages
Cappello Menu Digital 1
No ratings yet
Cappello Menu Digital 1
34 pages
Contoh Paragraph Dalam Present Continuous Tense
100% (3)
Contoh Paragraph Dalam Present Continuous Tense
2 pages
2.leon Kleyn CV
No ratings yet
2.leon Kleyn CV
5 pages
Omni FC
No ratings yet
Omni FC
11 pages
Entrepreneurial Marketing
No ratings yet
Entrepreneurial Marketing
30 pages
Lecture 4 Parties To A Construction Contract
No ratings yet
Lecture 4 Parties To A Construction Contract
34 pages
Lesson Plan Format Filipino
No ratings yet
Lesson Plan Format Filipino
4 pages
BI MidTerm Submitted by Shariq Ahmed Khan 58549
No ratings yet
BI MidTerm Submitted by Shariq Ahmed Khan 58549
5 pages
Revisiting Structural Family Therapy
No ratings yet
Revisiting Structural Family Therapy
2 pages
Computer Graphics Unit 1 Bca 4th Sem Ccsu
No ratings yet
Computer Graphics Unit 1 Bca 4th Sem Ccsu
45 pages
Final Project Digital Entrepreneurship Politeknik
No ratings yet
Final Project Digital Entrepreneurship Politeknik
29 pages
Unknown
No ratings yet
Unknown
2 pages
The Importance of Wisdom in A Pandemic
No ratings yet
The Importance of Wisdom in A Pandemic
3 pages
Materi Bahasa Inggris XI Writing
No ratings yet
Materi Bahasa Inggris XI Writing
2 pages
Guide BB
No ratings yet
Guide BB
20 pages
KFS 2023-2024 Calendar
No ratings yet
KFS 2023-2024 Calendar
1 page

M2 PPT

Uploaded by

M2 PPT

Uploaded by

DATA (PRE-)PROCESSING

 Preprocessing: real data is noisy, incomplete and inconsistent.

Measures for Data Quality: A Multidimensional View

• Fill in the missing value manually: tedious + infeasible?

• Use the attribute mean to fill in the missing value

Age Income Religion Gender

Data has attribute values

• Redundant data occur often when integration of multiple databases

 Motivation for Discretization

 Equal-depth (frequency) partitioning:

• The reduction of random variables or

Original Data Compressed

• Can be very effective if data is clustered but not if data is “smeared”

• There are many choices of clustering definitions and clustering

• Segmentation by natural partitioning

• Given a set of samples S, if S is partitioned into two

• Different files have different forming – Spaces instead of

• “Dirty” data – Unwanted anomalies – Duplicates

Principal Component Analysis, or PCA, is a

X1, X2: original axes (attributes) X2

Order principal components by significance and eliminate weaker ones

You might also like