0% found this document useful (0 votes)

15 views67 pages

Data Preprocessing

The document discusses data pre-processing in data mining, emphasizing the importance of data cleaning, integration, and reduction to ensure data quality. It outlines various types of anomalies such as syntactically and semantically dirty data, as well as coverage issues like missing attributes and records. The document also covers techniques for handling missing and noisy data, along with strategies for dimensionality reduction and data transformation.

Uploaded by

Obaid Amir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views67 pages

Data Preprocessing

Uploaded by

Obaid Amir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 67

Data Mining

DATA PRE-PROCESSING

Instructor: Qurat-ul-Ain
quratulain.ssc@stmu.edu.pk
Background
 Other names: Called as data scrubbing or cleaning.

 More than data arranging: DWH is NOT just about arranging

data, but should be clean for overall health of organization.

 Big problem, big effect: Enormous problem, as most data is dirty.

 Data duplication: Original problem was removing duplicates in

one system, compounded by duplicates from many systems.

Three Classes of Anomalies
Syntactically Dirty Data
 Lexical Errors
 Irregularities

Semantically Dirty Data

 Integrity Constraint Violation
 Business rule contradiction
 Duplication

Coverage Anomalies
 Missing Attributes
 Missing Records
Syntactically Dirty Data
1. Lexical Errors
 Discrepancies between the structure of the data items and

the specified format of stored values e.g. number of columns

used are unexpected for a tuple (mixed up number of
attributes)
 Every tuple is expected to consist of 4 values, one for each

attribute. In this case, a missing ZIP value caused a shift of

the HDATE value to the ZIP column. This may occur in case
an employee’s ZIP code is unavailable when data is
transferred to the data warehouse
Syntactically Dirty Data
2. Irregularities
 “irregularities are concerned with the non-uniform use

of values, units and abbreviations”

 For example only giving annual salary but without

info i.e. in US$ or PK Rs?

Syntactically Dirty Data
1. Integrity Constraint violation

2. Contradiction (DoB > Hiring date etc)

3. Duplication

Duplicate anomalies are often caused during the process of

integrating different data sources into a single data
warehouse.
Coverage or lack of it
1. Missing Attribute
 missing attribute values issue is the most common

failure in data warehouses.

 Result of omissions while collecting the data.
Coverage or Lack of it
2. Missing Tuple
 Missing tuples can occur in a data warehouse if some

entities that exist in the are not represented by tuples in

the database.
 In the example, which is described in section 3, the table

“Employee” consists of three tuples. The Employee

„Smith“ is working for the company, but the appropriate
tuple is missing in the table below.
Why Coverage Anomalies?
 Equipment malfunction (bar code reader, keyboard etc.)

 Inconsistent with other recorded data and thus deleted.

 Data not entered due to misunderstanding/illegibility.

 Data not considered important at the time of entry (e.g.

Y2K).
Handling Missing Data
 Dropping records.

 “Manually” filling missing values.

 Using a global constant as filler.

 Using the attribute mean (or median) as filler.

 Using the most probable value as filler.

Data Preprocessing: An
Overview
 No quality data, no quality results!

 Quality decisions must be based on quality data

 e.g., duplicate or missing data may cause incorrect

or even misleading statistics.
 Data warehouse needs consistent integration of quality

data
Data Preprocessing
 Data Quality:

 Accuracy

 Completeness

 Consistency.

 Problems:

 Inaccuracy:

 Incomplete

 Inconsistency
Major Tasks in Data
Preprocessing
 Data cleaning

 Parsing ,correcting, Standardization, Matching, Consolidation

 Dealing with missing data.

 Dealing with incorrect and noisy data.

 Fill in missing values, smooth noisy data, identify or remove

outliers, and resolve inconsistencies

 Data integration

 Integration of multiple databases or files

 Different Concepts may have different names, causing

inconsistencies and redundancies.
 Some attributes may be inferred from others, e.g. annual revenue.
Major Tasks in Data
Preprocessing
 Data reduction

 Reduced representation of data set much smaller in volume,

yet produces same results. Volume of data is reduced to make

analysis easier.
 Obtain a reduced representation of the data set that is much

smaller in volume but yet produces the same (or almost the
same) analytical results
 Why data reduction?

 A database/data warehouse may store terabytes of data.

Complex data analysis may take a very long time to run on

the complete data set. Dimensionality reduction: Attribute
subset selection, attribute construction.
Data Reduction Strategies
There are three data reduction strategies
Dimensionality Reduction

Numerosity Reduction

Data Compression
Data Reduction Strategies
 Numerosity reduction: Data is replaced by alternative

smaller representation using parametric models (e.g.

Histograms, Clusters, Linear Regression). It can be of two
types
 Parametric: Only parameters of data and outliers are

stored instead of actual data ( Regression and Log-Linear

Models)
 Non Parametric: Data is stored in the form of histogram,

clustering and sampling.

Major Tasks in Data
Preprocessing
 Data transformation and data discretization

 Normalization: Scaling to a smaller range. E.g. [0-1]

 E.g. Normalizing the age and salary to [0-1]

 Without normalization, distance based calculations may

generate skewed results.

 Concept hierarchy generation: Raw data values are
replaced by ranges.
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect

data, e.g., instrument faulty, human or computer error,

transmission error
 Incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data e.g.,
Occupation=“ ” (missing data)
 Noisy: containing noise, errors, or outliers e.g., Salary=“−10”

(an error)
 Inconsistent: containing discrepancies in codes or names,

e.g.,
 Age=“42”, Birthday=“03/07/2010”
Incomplete (Missing) Data
 Data is not always available

 E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data

 Missing data may be due to

 equipment malfunction

 inconsistent with other recorded data and thus deleted

 certain data may not be considered important at the time

of entry
 Missing data may need to be inferred

 Manual Entry (mean /probable)

How to Handle Missing Data?
 Ignore the tuple:

 Usually done when class label is missing (when doing

classification)—not effective when the % of missing values

per attribute varies considerably
 Fill in the missing value manually:

 tedious + infeasible?

 Fill in it automatically with

 A global constant : e.g., “unknown”, a new class?

 Use any central tendency measure:

 Mean for symmetric

 Median for skewed
How to Handle Missing Data?
The attribute mean for all samples belonging to the same
class:
 For example, while filling values of annual income
according to credit risk, fill value from the incomes of
customers of same class.
 The most probable value: inference-based such as
Bayesian formula or decision tree
Noisy Data
 Noise:

 Random error in a measured variable

 Incorrect attribute values may be due to

 faulty data collection instruments

 data entry problems
 data transmission problems

22
How to Handle Noisy Data?
 Regression

 Smooth by fitting the data into regression functions.

 Finding the best line to fit two attributes so that one

attribute can be used to predict other’s value.

 Clustering

 Detect and remove outliers

 Combined computer and human inspection

 Detect suspicious values and check by human (e.g., deal

with possible outliers)

Dimensionality Reduction
 Dimensionality reduction, e.g., remove unimportant attributes or

reduce no of input variable

 It represent the original data in the compressed or reduced

form by applying data encoding and transformation.

 It’s a encoding mechanism is used to reduce size.

 If original data can be reconstruct from compressed data

without loosing any information is called lossless.

 If reconstructed data is the approximation of compressed data

is called lossy.
 Wavelet transforms
 Principal Components Analysis (PCA)
 Feature subset selection, feature creation
Feature/Attribute Selection
 Feature Selection is the method of reducing the input

variable to your model by using only relevant data and

getting rid of noise in data.
 It is the process of automatically choosing relevant features for

your machine learning model based on the type of problem you

are trying to solve. Following are the feature selection method
 Filter Methods: ( IG, Chi-Square Test, and Correlation
Coefficient)
 Wrapper Methods: (Recursive Feature Elimination and Genetic

Algorithms)
 Embedded Methods (Decision Tree)
Feature/Attribute Subset Selection
 Remove redundant and irrelevant attributes

 Redundant attributes

 Duplicate much or all of the information contained in one or more

other attributes
 E.g., purchase price of a product and the amount of sales tax paid

 Irrelevant attributes

 Contain no information that is useful for the data mining task at

hand
 E.g., students' ID is often irrelevant to the task of predicting

students' GPA
Domain Expert
Heuristic Search
Heuristic Search in Attribute
Selection
There are 2d possible attribute combinations of d attributes
 Not feasible.

Typical heuristic attribute selection methods:

 Best single attribute under the attribute independence

assumption: choose by significance tests (in Chapter 7)

 Best step-wise feature selection:

 The best single-attribute is picked first

 Then next best attribute condition to the first, ...

 Step-wise attribute elimination:

 Repeatedly eliminate the worst attribute

 Best combined attribute selection and elimination

Heuristic Search in Attribute
Selection
Decision Tree Induction
 Decision tree constructs a flow chart like structure

where each internal node denotes a test on

attribute, each corresponds to an outcome of the
test, each external leaf denotes a class prediction.
 At each node, algorithm selects best attribute

to partition data in subclasses.

 All non-selected attributes are considered irrelevant.
Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

Reduced
> attribute set: {A1, A4, A6}
Numerosity Reduction
 Numerosity reduction (some simply call it: Data
Reduction)
 Replace original data volume by alternative, smaller forms of

data representation
 Parametric (Regression):

 Assume the data fits some model, estimate model

parameters, store only the parameters, and discard the
data (except possible outliers)
 Non-Parametric:
 Do not assume models
 Major families: histograms, clustering, sampling, …

 Data compression
Regression and Log-Linear
Models
 Linear Regression

 Data modeled to fit a straight line

Y = w X + b

 Two regression coefficients, w and b, specify the line and

are to be estimated by using the data at hand

 Multiple Regression

 Allows a response variable Y to be modeled as a linear

function of multidimensional feature vector.

 Y = b 0 + b1 X1 + b2 X2

 Many nonlinear functions can be transformed into the

above
Histogram Analysis
 Divide data into buckets and store average (sum) for each

bucket
 Partitioning rules:

 Equal-width: equal bucket range

 Equal-frequency (or equal-depth)

Clustering
 Partition data set into clusters based on similarity, and

store cluster representation (e.g., centroid and diameter)

only
 Centroid

 Diameter
Sampling
 Sampling: obtaining a small sample s to represent the whole

data set N
 Key principle: Choose a representative subset of the data

 Simple random sampling may have very poor performance

in the presence of skew

 Develop adaptive sampling methods, e.g., stratified
sampling:
Types of Sampling
 Simple random sampling

 There is an equal probability of selecting any particular item

 Sampling without replacement

 Once an object is selected, it is removed from the population

 Sampling with replacement

 A selected object is not removed from the population

 Stratified sampling:

 Partition the data set, and draw samples from each partition

(proportionally, i.e., approximately the same percentage of the

data)
 Used in conjunction with skewed data
Sampling: With or without
Replacement

R S WOR ndom
S le ra hout
i m p
(s
p l e wit
sam ment)
e p l a ce
r

SRSW
R

Raw Data
Data Compression
 String Compression

 There are extensive theories and well-tuned algorithms

(Huffman Encoding)
 Typically lossless, but only limited manipulation is
possible without expansion (Text Encryption)
 Audio/Video Compression

 Typically lossy compression, with progressive refinement

 Sometimes small fragments of signal can be

reconstructed without reconstructing the whole
 Compressed Ratio: Compressed Size/Original Size

 Compressed Factor: 1/Compressed Ratio

Data Compression

Original Data Compressed

Data
lossless

sy
los
Original Data
Approximated
Discretization
 It divides the range of attributes into intervals so as to

reduce number of values for a given continuous attribute.

 Splitting: Top down (Attribute is split into range of value)

 Merge: Bottom Up (Initially we consider all later remove

some during merging)

 Interval labels can then be used to replace actual data

values
 Reduce data size by discretization

 Supervised vs. unsupervised (if class information is used)

 Discretization can be performed recursively on an attribute

Discretization
 Concept Hierarchy
 Helps in reducing the data by collecting and replacing low level

concept with high level concepts. Eg. Mobile no and land line no
replace it with telephone number.

 Three types of attributes

 Nominal—values from an unordered set, e.g., color,

profession
 Ordinal—values from an ordered set, e.g., military or

academic rank
 Numeric—real numbers, e.g., integer or real numbers
Data Discretization Methods
 Typical methods: All the methods can be applied recursively

 Binning /Histogram

 Top-down split, unsupervised

 Clustering analysis (unsupervised, top-down split or
bottom-up merge)
 Decision-tree analysis (supervised, top-down split)

 Correlation (e.g., 2) analysis (unsupervised, bottom-up

merge)
Discretization by Classification
& Correlation Analysis
 Classification (e.g., decision tree analysis)

 Supervised: Given class labels, e.g., cancerous vs. benign

 Using entropy to determine split point (discretization point)

 Top-down, recursive split

 Details to be covered in Chapter 7

 Correlation analysis (e.g., Chi-merge: χ2-based discretization)

 Supervised: use class information

 Bottom-up merge: find the best neighboring intervals (those having

similar distributions of classes, i.e., low χ2 values) to merge

 Merge performed recursively, until a predefined stopping condition

43
How to Handle Noisy Data?
 Binning

 Smooths sorted data by consulting its

neighborhood.
 first sort data and partition into (equal-frequency) bins

 then one can smooth by bin

 means,
 median,
 bin boundaries, etc.
Simple Discretization:
Binning
 Equal-width (distance) partitioning
 Divides the range into N intervals of equal size: uniform

grid
 If A and B are the lowest and highest values of the

attribute, the width of intervals will be: W = (B –A)/N.

 The most straightforward, but outliers may dominate
presentation Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing
approximately same number of samples
Binning Methods for Data
Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Integration
 Combines data from multiple sources into a coherent store
 Careful integration can help reduce and avoid redundancies
and inconsistencies.
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g.,

Bill Clinton = William Clinton

 Be careful while Schema integration

E.g., A.cust-id  B.cust-#

 Detecting and resolving data value conflicts

 For the same real world entity, attribute values from

different sources are different
 Possible reasons: different representations, different
scales, e.g., metric vs. British units
49
Data Integration
 Integrate metadata from different sources

Metadata can be used to improve it

What is meta data ?
 Redundant data occur often when integration of multiple

databases
 Derivable data: One attribute may be a “derived” attribute in

another table, e.g., annual revenue

 Redundant attributes may be able to be detected by correlation

analysis and covariance analysis

 Careful integration of the data from multiple sources may help

reduce/avoid redundancies and inconsistencies and improve

mining speed and quality 50
Correlation Analysis (Nominal
Data)
 Χ2 (chi-square) test

2
(Observed  Expected )
 
2

Expected
 The larger the Χ2 value, the more likely the variables are
related

 The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
Correlation Analysis (Numeric
Data)
 Correlation coefficient (also called Pearson’s product moment

coefficient)
 
n n
(ai  A)(bi  B ) (ai bi )  n AB
rA, B  i 1
 i 1
(n) A B (n) A B

A Band
where n is the number of tuples, are the respective
means of A and B, σA and σB are the respective standard deviation

of A and B, and Σ(aibi) is the sum of the AB cross-product.

 If rA,B > 0, A and B are positively correlated (A’s values increase as

B’s). The higher, the stronger correlation.

 rA,B = 0: independent; rAB < 0: negatively correlated
Covariance (Numeric Data)
Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples,

A and
B are the respective
mean or expected values of A and B, σA and σB are the
respective standard deviation of A and B.
Covariance (Numeric Data)
 Positive covariance: If CovA,B > 0, then A and B both tend to be

larger than their expected values.

 Negative covariance: If CovA,B < 0 then if A is larger than its

expected value, B is likely to be smaller than its expected value.

 Independence: CovA,B = 0 but the converse is not true:

 Some pairs of random variables may have a covariance of 0

but are not independent. Only under some additional

assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
Co-Variance: An Example

 It can be simplified in computation as

Co-Variance: An Example
 Suppose two stocks A and B have the following values in one

week:

(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry
trends, will their prices rise or fall together?

 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

 Thus, A and B rise together since Cov(A, B) > 0.

Normalization
 Min-max normalization: to [new_minA, new_maxA]

v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA

 Ex. Let income range $12,000 to $98,000 normalized to

[0.0, 1.0]. Then $73,000 is mapped to

73,600  12,000
(1.0  0)  0 0.716
98,000  12,000
Normalization
 Z-score normalization (μ: mean, σ: standard deviation):

 Ex. Let μ = 54,000, σ = 16,000. Then

v  A
v' 
 A

73,600  54,000
1.225
16,000
Z-score Normalization
 Simply put, a z-score (also called a standard score) gives

you an idea of how far from the mean a data point is.
 However, technically it’s a measure of how many standard

deviations below or above the population mean a raw score

is.
 A z-score can be placed on a normal distribution curve. Z-

scores range from -3 standard deviations (which would fall

to the far left of the normal distribution curve) up to +3
standard deviations (which would fall to the far right of the
normal distribution curve).
 In order to use a z-score, you need to know the mean μ and
Z-score Normalization
 Z-scores are a way to compare results to a “normal”

population.
 Results from tests or surveys have thousands of possible

results and units; those results can often seem meaningless.

 For example, knowing that someone’s weight is 150 pounds

might be good information, but if you want to compare it to

the “average” person’s weight, looking at a vast table of
data can be overwhelming (especially if some weights are
recorded in kilograms).
 A z-score can tell you where that person’s weight is

compared to the average population’s mean weight.

Self Study
 Following topics are not included in syllabus for
exams/quizzes.
 If you are interested in any of these topics, you can come to

my office to learn/discuss.
Segmentation by Natural
Partitioning
 A simply 3-4-5 rule can be used to segment numeric data

into relatively uniform, “natural” intervals.

 If an interval covers 3, 6, 7 or 9 distinct values at the

most significant digit, partition the range into 3

(relatively) equi-width intervals
 If it covers 2, 4, or 8 distinct values at the most

significant digit, partition the range into 4 intervals

 If it covers 1, 5, or 10 distinct values at the most

significant digit, partition the range into 5 intervals

Example of 3-4-5 Rule
count

Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$4000 -$5,000)
Step 4:

($2,000 - $5, 000)

(-$400 - 0) (0 - $1,000) ($1,000 - $2, 000)
(0 -
($1,000 -
(-$400 - $200)
$1,200) ($2,000 -
-$300) $3,000)
($200 -
($1,200 -
$400)
(-$300 - $1,400)
($3,000 -
-$200)
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) ($600 - ($1,600 - $5,000)
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
Concept Hierarchy Generation
 Concept hierarchy organizes concepts (i.e., attribute values)

hierarchically
 Concept hierarchy formation: Recursively reduce the data by

collecting and replacing low level concepts (such as numeric

values for age) by higher level concepts (such as youth,
adult, or senior)
 Concept hierarchies can be explicitly specified by domain

experts and/or data warehouse designers

 Concept hierarchy can be automatically formed for both

numeric and nominal data. For numeric data, use 3-4-5 rule.
Concept Hierarchy Generation
for Nominal Data
 Specification of a partial/total ordering of attributes explicitly

at the schema level by users or experts

 street < city < state < country

 Specification of only a partial set of attributes

 E.g., only street < city, not others

Automatic Concept Hierarchy Generation
 Some hierarchies can be automatically generated based on the

analysis of the number of distinct values per attribute in the data set
 The attribute with the most distinct values is placed at the lowest

level of the hierarchy

 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
Data Preprocessing
No ratings yet
Data Preprocessing
15 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
Unit - II
No ratings yet
Unit - II
56 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Unit 2 Preprocessing
No ratings yet
Unit 2 Preprocessing
39 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Session 4
No ratings yet
Session 4
40 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Correlation
No ratings yet
Correlation
14 pages
Unit 3
No ratings yet
Unit 3
41 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Data Preprocessing for Tech Students
No ratings yet
Data Preprocessing for Tech Students
59 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
Preprocessing
No ratings yet
Preprocessing
90 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
L6 Data Preprocessing
No ratings yet
L6 Data Preprocessing
9 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
C2 - Data Cleaning & Preprocessing
No ratings yet
C2 - Data Cleaning & Preprocessing
59 pages
CH 3
No ratings yet
CH 3
68 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
9 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
2024 Benetti Oasis 40M New BO118 BF
No ratings yet
2024 Benetti Oasis 40M New BO118 BF
21 pages
Track Order Details - Celcom
100% (1)
Track Order Details - Celcom
2 pages
Scope of Work Q1 Civil Construction Work: IT Department and Data Center Preparation
No ratings yet
Scope of Work Q1 Civil Construction Work: IT Department and Data Center Preparation
2 pages
Assignment Computer Skills 1
No ratings yet
Assignment Computer Skills 1
22 pages
Real Estate CRM - LeadSquared
No ratings yet
Real Estate CRM - LeadSquared
25 pages
Megaraid Sas 8708em2: Eight-Port 3Gb/S Pci Express Sas/Sata Raid On Chip Adapter
No ratings yet
Megaraid Sas 8708em2: Eight-Port 3Gb/S Pci Express Sas/Sata Raid On Chip Adapter
2 pages
Basic Electrical & Electronics Course
No ratings yet
Basic Electrical & Electronics Course
2 pages
Advance Analytics Job Description
No ratings yet
Advance Analytics Job Description
2 pages
Navigating The Online World: Guidelines For Prevention and Regulation of Dark Patterns 2023
No ratings yet
Navigating The Online World: Guidelines For Prevention and Regulation of Dark Patterns 2023
5 pages
Figurate Numbers
100% (2)
Figurate Numbers
10 pages
Unit 4 - Software Engineering - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Software Engineering - WWW - Rgpvnotes.in
12 pages
Power Distribution System Load Flow Usin PDF
No ratings yet
Power Distribution System Load Flow Usin PDF
6 pages
Angular Code Style Guide
No ratings yet
Angular Code Style Guide
4 pages
Manual Toto Link Wireless
No ratings yet
Manual Toto Link Wireless
37 pages
Install Linux Os Using Virtualization
No ratings yet
Install Linux Os Using Virtualization
5 pages
ML Logcat 1742825998561
No ratings yet
ML Logcat 1742825998561
64 pages
Optiplex-790 Reference Guide En-Us
100% (1)
Optiplex-790 Reference Guide En-Us
3 pages
Homework 3 Solutions Problem 4.35 (10 Points)
No ratings yet
Homework 3 Solutions Problem 4.35 (10 Points)
4 pages
Brice Burger Mohamed Chaouch Christophe Saint-Jean Louahdi Khoudour Alain Crouzil Dr. Pierre Duthon Prof. Sergio A. Velastin
No ratings yet
Brice Burger Mohamed Chaouch Christophe Saint-Jean Louahdi Khoudour Alain Crouzil Dr. Pierre Duthon Prof. Sergio A. Velastin
1 page
SAP BW Extraction
100% (3)
SAP BW Extraction
161 pages
Objective: Nationality: Saudi Arabia. Marital Status: Married E-Mail:, Cell Phone# +966 - (566) 613-846
No ratings yet
Objective: Nationality: Saudi Arabia. Marital Status: Married E-Mail:, Cell Phone# +966 - (566) 613-846
1 page
Swann Security Android App Manual
No ratings yet
Swann Security Android App Manual
56 pages
Virtual Clusters & VMware vSphere
No ratings yet
Virtual Clusters & VMware vSphere
42 pages
Ultraflux MinisonicP Manual
No ratings yet
Ultraflux MinisonicP Manual
26 pages
Fci Assistant Grade III 2015 Paper 1 East Zone 52c4509d
No ratings yet
Fci Assistant Grade III 2015 Paper 1 East Zone 52c4509d
18 pages
Optimization of Rocker-Bogie Mechanism Using Heuristic Approaches
No ratings yet
Optimization of Rocker-Bogie Mechanism Using Heuristic Approaches
17 pages
Reporting Services - Change Default Filename On Export SSRS - Stack Overflow
No ratings yet
Reporting Services - Change Default Filename On Export SSRS - Stack Overflow
3 pages
Alex Blyth - Brilliant Online Marketing - How To Use The Internet To Market Your Business (Brilliant Business) - Prentice Hall (2010)
No ratings yet
Alex Blyth - Brilliant Online Marketing - How To Use The Internet To Market Your Business (Brilliant Business) - Prentice Hall (2010)
177 pages
Annex-2 (Inverter - SP-Manual)
50% (2)
Annex-2 (Inverter - SP-Manual)
16 pages
Lab7 Iaa202
No ratings yet
Lab7 Iaa202
6 pages