[go: up one dir, main page]

0% found this document useful (0 votes)
33 views30 pages

Data Preprocessing 09112023 065121pm

1. The document discusses various concepts related to data preprocessing including data types, attributes, objects, structured data characteristics, types of datasets, and common data quality issues. 2. Key steps in data preprocessing are discussed such as data aggregation, sampling, dimensionality reduction, feature selection/creation, and data transformation techniques. 3. The goal of data preprocessing is to handle data quality issues, reduce data size, and transform the data into a format that is suitable for data mining and machine learning algorithms.

Uploaded by

AHSAN HAMEED
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views30 pages

Data Preprocessing 09112023 065121pm

1. The document discusses various concepts related to data preprocessing including data types, attributes, objects, structured data characteristics, types of datasets, and common data quality issues. 2. Key steps in data preprocessing are discussed such as data aggregation, sampling, dimensionality reduction, feature selection/creation, and data transformation techniques. 3. The goal of data preprocessing is to handle data quality issues, reduce data size, and transform the data into a format that is suitable for data mining and machine learning algorithms.

Uploaded by

AHSAN HAMEED
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

DATA P R E P R O C E S S I N G

1
W HAT IS D ATA ?

● Collection of data objects and Attributes


their attributes

● An attribute is a property or
characteristic of an object
– Examples: eye color of a
person, temperature, etc.
– Attribute is also known as
variable, field, characteristic,
or feature
● A collection of attributes Objects
describe an object
– Object is also known as record,
point, case, sample, entity, or
instance
TYPES OF ATTRIBUTES

● There are different types of attributes


– Nominal
 Examples: ID numbers, eye color, zip codes
– Ordinal
 Examples: rankings (e.g., taste of potato chips on a scale from
1-10), grades, height in {tall, medium, short}
– Interval
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
 Examples: temperature in Kelvin, length, time,
counts

3
DISCRETE AND CONTINUOUS ATTRIBUTES

● Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection
of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes

● Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
4
T Y P E S O F DATA S E T S
● Record
– Data Matrix
– Document Data
– Transaction Data
● Graph
– World Wide Web
– Molecular Structures
● Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data

5
I M P O RTA N T C H A R A C T E R I S T I C S OF
S T R U C T U R E D D ATA

– Dimensionality
 Curse of Dimensionality

– Sparsity
 Only presence counts

– Resolution
 Patterns depend on the scale

6
R E C O R D D ATA

● Data that consists of a collection of records, each of


which consists of a fixed set of attributes

7
D ATA M AT R I X

● If data objects have the same fixed set of numeric


attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute

● Such data set can be represented by an m by n matrix,


where there are m rows, one for each object, and n
columns, one for each attribute

8
D O C U M E N T D ATA

● Each document becomes a `term' vector,


– each term is a component (attribute) of the vector,
– the value of each component is the number of times the
corresponding term occurs in the document.

9
T R A N S AC T I O N D ATA

● A special type of record data, where


– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one shopping
trip constitute a transaction, while the individual products
that were purchased are the items.

item

transaction

10
G R A P H D ATA

● Examples: Generic graph and H T M L Links

<a href="papers/papers.html#bbbb"> Data


Mining </a>
<li> <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li> <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of
Equations </a>
<li> <a href="papers/papers.html#ffff"> N-
Body Computation and Dense Linear
System Solvers

11
C H E M I C A L D ATA

● Benzene Molecule: C 6 H 6

12
O R D E R E D D ATA

● Sequences of transactions

Items/Events

An element of
the 13
sequence
O R D E R E D D ATA

● Genomic sequence data

14
O R D E R E D D ATA

● Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

Trajectories of
Moving Objects

15
D ATA Q UA L I T Y

● What kinds of data quality problems?


● How can we detect problems with the data?
● What can we do about these problems?

● Examples of data quality problems:


– Noise and outliers
– missing values
– duplicate data

16
NOISE

● Noise refers to modification of original values


– Examples: distortion of a person’s voice when talking on a
poor phone and “snow” on television screen

17
Two Sine Waves Two Sine Waves + Noise
O U TLI E R S

● Outliers are data objects with characteristics that are


considerably different than most of the other data
objects in the data set

18
D EVIATION /A NOMALY D E T E C T I O N

 Outliers are useful when we need to detect significant


deviations from normal behavior
 Applications:
⚫ Credit Card Fraud Detection

⚫ Network Intrusion
Detection

19
day
MISSING VALUES

● Reasons for missing values


– Information is not collected
(e.g., people decline to give
their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

● Handling missing values


– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their
probabilities)

20
D U P L I C AT E D ATA

● Data set may include data objects that are duplicates,


or almost duplicates of one another
– Major issue when merging data from heterogeous sources

● Examples:
– Same person with multiple email addresses

● Data cleaning
– Process of dealing with duplicate data issues

21
D ATA P R E P R O C E S S I N G
● Aggregation
● Sampling
● Dimensionality Reduction
● Feature subset selection
● Feature creation
● Discretization and Binarization
● Attribute Transformation
A G G R E G A TIO N

● Combining two or more attributes (or objects) into a


single attribute (or object)

● Purpose
– Data reduction
 Reduce the number of attributes or objects
– Change of scale
 Cities aggregated into regions, states, countries, etc
– More “stable” data
 Aggregated data tends to have less variability

23
S AM P L I N G
● Sampling is the main technique employed for data
selection.
– It is often used for both the
preliminary investigation of the data and the final
data analysis.

● Statisticians sample because obtaining the entire set of


data of interest is too expensive or time consuming.

● Sampling is used in data mining because processing the


entire set of data of interest is too expensive or time
consuming.
24
SAMPLING …

● The key principle for effective sampling is the


following:
– using a sample will work almost as well as using the entire
data sets, if the sample is representative

– A sample is representative if it has approximately


the same property (of interest) as the original set of
data

25
SAMPLE SIZE

8000 points 2000 Points 500 Points

26
CURSE OF D I M E N S I O NA L I T Y

● When dimensionality increases, data becomes


increasingly sparse in the space that it occupies

● Definitions of density and distance between points,


which is critical for clustering and outlier detection,
become less meaningful.
D I M E N S I O NA L I T Y R E D U C T I O N

● Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise

28
F E AT U R E S U B S E T S E L E C T I O N

● Reduce dimensionality of data

Remove:
● Redundant features
– duplicate much or all of the information contained in one or
more other attributes
– Example: purchase price of a product and the amount of
sales tax paid

● Irrelevant features
– contain no information that is useful for the data mining
task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' G PA 29
F E AT U R E S U B S E T S E L E C T I O N

● Techniques:
– Brute-force approch:
 Try all possible feature subsets as input to data mining algorithm
– Embedded approaches:
 Featureselection occurs naturally as part of the data mining
algorithm
– Filter approaches:
 Features are selected before data mining algorithm is run

30

You might also like