0% found this document useful (0 votes)

16 views30 pages

Statistical Transform Data Cleaning

The document discusses data cleansing techniques used to clean raw data for analysis. It describes identifying and handling missing values, duplicates, inconsistent data formats like dates and text. Standardizing data through techniques like scaling, normalization and fuzzy matching is also covered. The key steps are data profiling to understand data quality issues, then selecting and applying appropriate cleansing methods like imputation, dropping rows, converting data types and more. Support vector machines (SVM) are also introduced as a supervised machine learning algorithm used for classification.

Uploaded by

Anjali Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views30 pages

Statistical Transform Data Cleaning

Uploaded by

Anjali Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Statistical Transform

Data Cleaning
Data Cleansing
• Data Cleansing is the process of analyzing data for finding incorrect,
corrupt, and missing values it to make it suitable for input to data
analytics and various machine learning algorithms.
• It is the premier and fundamental step performed before any analysis
could be done on data.
• There are no set rules to be followed for data cleansing.
• It totally depends upon the quality of the dataset and the level of
accuracy to be achieved.
Reasons for data corruption:

• Data is collected from various structured and unstructured sources

and then combined, leading to duplicated and mislabeled values.
• Different data dictionary definitions for data stored at various
locations.
• Manual entry error/Typos.
• Incorrect capitalization.
• Mislabelled categories/classes.
Data Quality

• Data Quality is of utmost importance for the analysis. There are

several quality criteria that need to be checked upon:
Data Quality Attributes

• Completeness: It is defined as the percentage of entries that are filled in

the dataset. The percentage of missing values in the dataset is a good
indicator of the quality of the dataset.
• Accuracy: It is defined as the extent to which the entries in the dataset are
close to their actual values.
• Uniformity: It is defined as the extent to which data is specified using the
same unit of measure.
• Consistency: It is defined as the extent to which the data is consistent
within the same dataset and across multiple datasets.
• Validity: It is defined as the extent to which data conforms to the
constraints applied by the business rules.
There are various constraints:
Data Profiling Report

• Data Profiling is the process of exploring our data and finding insights
from it. Pandas profiling report is the quickest way to extract
complete information about the dataset. The first step for data
cleansing is to perform exploratory data analysis.
How to use pandas profiling:
• Step 1: The first step is to install the pandas profiling package using the pip command:

• pip install pandas-profiling

• Step 2: Load the dataset using pandas:

• import pandas as pd
• df = pd.read_csv(r"C:UsersDellDesktopDatasethousing.csv")

• Step 3: Read the first five rows:

• df.head()
• Step 4: Generate the profiling report using the following commands:

• from pandas_profiling import ProfileReport

• prof = ProfileReport(df)prof.to_file(output_file='output.html')
Profiling Report:
• The profiling report consists of five parts: overview, variables, interactions, correlation,
and missing values.
• 1. Overview gives the general statistics about the number of variables, number of
observations, missing values, duplicates, and number of categorical and numeric
variables.
• 2. Variable information tells detailed information about the distinct values,
missing values, mean, median, etc. Here statistics about a categorical variable
and a numerical variable is shown:
• 3. Correlation is defined as the degree to which two variables are related
to each other. The profiling report describes the correlation of different
variables with each other in form of a heatmap.
• 4.Interactions: This part of the report shows the interactions of the variables
with each other. You can select any variable on the respective axes.
• 5. Missing values: It depicts the number of missing values in each column.
Data Cleansing Techniques
• Handling missing values:
• Handling missing values is the most important step of data cleansing.
• The first question you should ask yourself is that
• why is the data missing?
• Is it missing just because it was not recorded by the data entry operator or is it
intentionally left empty?
• You can also go through the documentation to find the reason for the same.
There are different ways to handle these missing values:

• 1. Drop missing values: The easiest way to handle them is to simply drop all the rows that
contain missing values.
• If you don’t want to figure out why the values are missing and just have a small percentage
of missing values you can just drop them using the following command:

• df.dropna()

• It is not advisable although because every data is important and holds great significance to
the overall results.
• Usually, the percentage of missing entries in a particular column is high. So dropping it is
not a good option.
• 2. Imputation:
• Imputation is the process of replacing the null/missing values with some
value.
• For numeric columns, one option is to replace each missing entry in the
column with the mean value or median value.
• Another option could be generating random numbers between a range of
values suitable for the column.
• The range could be between the mean and standard deviation of the column.
• You can simply import an imputer from the scikit-learn package and perform
imputation as follows:

• from sklearn.impute import SimpleImputer

• #Imputation
• my_imputer = SimpleImputer()
• imputed_df = pd.DataFrame(my_imputer.fit_transform(df))
• Handling Duplicates:
• Duplicate rows occur usually when the data is combined from multiple
sources.
• It gets replicated sometimes. A common problem is when users have
the same identity number or the form has been submitted twice.
• The solution to these duplicate tuples is to simply remove them.
• You can use the unique() function to find out the unique values present
in the column and then decide which values need to be scraped.
Scaling and Normalization
• Scaling refers to transforming the range of data and shifting it to some other
value range. This is beneficial when we want to compare different attributes
on the same footing. One useful example could be currency conversion.

• For example, we will create random 100 points from exponential distribution
and then plot them. Finally, we will convert them to a scaled version using the
python mlxtend package.

• # for min_max scaling

• from mlxtend.preprocessing import minmax_scaling
• # plotting packages
• import seaborn as sns import matplotlib.pyplot as plt
Handling Dates
• The date field is an important attribute that needs to be handled during the cleansing of
data. There are multiple different formats in which data can be entered into the dataset.
• Therefore, standardizing the date column is a critical task. Some people may have treated
the date as a string column, some as a DateTime column.
• When the dataset gets combined from different sources then this might create a problem
for analysis.

• The solution is to first find the type of date column using the following command.

• df['Date'].dtype

• If the type of the column is other than DateTime, convert it to DateTime using the
following command:

• import datetime df['Date_parsed'] = pd.to_datetime(df['Date'], format="%m/%d/%y")

Handling inconsistent data entry issues
• There are a large number of inconsistent entries that cannot be found manually or
through direct computations.
• For example, if the same entry is written in upper case or lower case or a mixture of
upper case and lower case. Then such an entry should be standardized throughout the
column.

• One solution is to convert all the entries of a column to lowercase and trim the extra
space from each entry. This can later be reverted after the analysis is complete.

• # convert to lower case

• df['ReginonName'] = df['ReginonName'].str.lower()
• # remove trailing white spaces
• df['ReginonName'] = df['ReginonName'].str.strip()
• Another solution is to use fuzzy matching to find which strings in the column are closest
to each other and then replacing all those entries with a particular threshold with the
main entry.

• Firstly we will find out the unique region names:

• region = df['Regionname'].unique()
•
• Then we calculate the scores using fuzzy matching:

• import fuzzywuzzy fromfuzzywuzzy import process

regions=fuzzywuzzy.process.extract("WesternVictoria",region,limit=10,scorer=fuzzywuz
y.fuzz.token_sort_ratio)
Support Vector Machines
• Support vector machines (SVMs) are a set of supervised learning methods used for
classification, regression and outliers detection.

• In machine learning, support vector machines (SVMs, also support vector networks)
are supervised learning models with associated learning algorithms that analyze
data used for classification and regression analysis.

• Support Vector Machine or SVM is a supervised and linear Machine Learning

algorithm most commonly used for solving classification problems and is also
referred to as Support Vector Classification.
• There is also a subset of SVM called SVR which stands for Support Vector
Regression which uses the same principles to solve regression problems.

• SVM also supports the kernel method also called the kernel SVM which allows us to
tackle non-linearity.
What is Support Vector Machine?
• An SVM model is a
representation of the examples
as points in space, mapped so
that the examples of the
separate categories are divided
by a clear gap that is as wide as
possible.

• In addition to performing linear

classification, SVMs can
efficiently perform a non-linear
classification, implicitly mapping
their inputs into high-
dimensional feature spaces.
Now how would a machine using SVM,
classify a new fruit as either apple or orange
just based on the data on the size and
weights of some 20 apples and oranges that
were observed and labelled?
• The objective of SVM is to draw a line that best separates the two classes of data points.
• SVM generates a line that can cleanly separate the two classes.
• There are many possible ways of drawing a line that separates the two classes, however, in SVM, it
is determined by the margins and the support vectors.
• The margin is the area separating the two dotted green lines as shown in the image above. The
more the margin the better the classes are separated.
• The support vectors are the data points through which each of the green lines passes through.
These points are called support vectors as they contribute to the margins and hence the classifier
itself.
• These support vectors are simply the data points lying closest to the border of either of the classes
which has a probability of being in either one.
• The SVM then generates a hyperplane which has the maximum margin, in this case the black bold
line that separates the two classes which is at an optimum distance between both the classes.
• In case of more than 2 features and multiple dimensions, the line is replaced by a hyperplane that
separates multidimensional spaces.
• The advantages of support vector machines are:

• Effective in high dimensional spaces.

• Still effective in cases where number of dimensions is greater than the

number of samples.

• Uses a subset of training points in the decision function (called support

vectors), so it is also memory efficient.

• Versatile: different Kernel functions can be specified for the decision

function. Common kernels are provided, but it is also possible to specify
custom kernels.
• The disadvantages of support vector machines include:

• If the number of features is much greater than the number of samples,

avoid over-fitting in choosing Kernel functions and regularization term is
crucial.

• SVMs do not directly provide probability estimates, these are calculated

using an expensive five-fold cross-validation
Logistic Regression in Python With scikit-learn

• Scikit-learn to perform various functions:

• Preprocess data
• Reduce the dimensionality of problems
• Validate models
• Select the most appropriate model
• Solve regression and classification problems
• Implement cluster analysis
Logistic Regression in Python With scikit-learn

• How to preparing your classification models:

• Import packages, functions, and classes

• Get data to work with and, if appropriate, transform it
• Create a classification model and train (or fit) it with your existing data
• Evaluate your model to see if its performance is satisfactory

6.data Cleaning
No ratings yet
6.data Cleaning
20 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
Pandas Data Cleaning Presentation
No ratings yet
Pandas Data Cleaning Presentation
11 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
Unit V
No ratings yet
Unit V
47 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
Lecture Week5
No ratings yet
Lecture Week5
72 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Document
No ratings yet
Document
29 pages
Module 3
No ratings yet
Module 3
20 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Data Cleaningin ML
No ratings yet
Data Cleaningin ML
15 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
S08 Slides
No ratings yet
S08 Slides
14 pages
Exploratory Data Analysis-1 (EDA-1)
No ratings yet
Exploratory Data Analysis-1 (EDA-1)
38 pages
Data Cleaning
No ratings yet
Data Cleaning
28 pages
PDS Exp 7 To 9
No ratings yet
PDS Exp 7 To 9
10 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
III Unit
No ratings yet
III Unit
4 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
1-Introduction To Data Cleaning
No ratings yet
1-Introduction To Data Cleaning
22 pages
String (Pandas) - Removing $ After Int Sales ( Revenue') Sales ( Revenue') .STR - Strip ( $') #Convert String To Int
No ratings yet
String (Pandas) - Removing $ After Int Sales ( Revenue') Sales ( Revenue') .STR - Strip ( $') #Convert String To Int
12 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Trigonometry LAST PUSH
No ratings yet
Trigonometry LAST PUSH
128 pages
Railway TTE Syllabus New PDF
100% (1)
Railway TTE Syllabus New PDF
6 pages
Gu 5th Sem Bca Python Programming Solved Question Paper 2023
No ratings yet
Gu 5th Sem Bca Python Programming Solved Question Paper 2023
17 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Core of ML - Part 1 Handling Data
No ratings yet
Core of ML - Part 1 Handling Data
3 pages
Python Module-1 QB Solution (21EC643)
No ratings yet
Python Module-1 QB Solution (21EC643)
25 pages
1.5 PLCs - Programming - FBDs and Boolean Expressions
No ratings yet
1.5 PLCs - Programming - FBDs and Boolean Expressions
13 pages
Algebra CHAPTER TWO and THREE
No ratings yet
Algebra CHAPTER TWO and THREE
17 pages
Class Orientation PSA
No ratings yet
Class Orientation PSA
16 pages
Cem V8 Objects 000 en
No ratings yet
Cem V8 Objects 000 en
705 pages
Coordinate Geometry Free PDF Class 9th Practice
No ratings yet
Coordinate Geometry Free PDF Class 9th Practice
28 pages
Additional Exam Questions
No ratings yet
Additional Exam Questions
9 pages
Study Guide For Laboratory Technician: Human Resources Southern California Edison Company
No ratings yet
Study Guide For Laboratory Technician: Human Resources Southern California Edison Company
12 pages
Unit 1a Application of Integration Centroids and Centre of Gravity
No ratings yet
Unit 1a Application of Integration Centroids and Centre of Gravity
15 pages
Syllabus of Diploma in Agricultural Engineering From The Academic Year 2017-18
No ratings yet
Syllabus of Diploma in Agricultural Engineering From The Academic Year 2017-18
95 pages
Chapter 2 - Time Value of Money-1
No ratings yet
Chapter 2 - Time Value of Money-1
63 pages
Astm D5947 11
No ratings yet
Astm D5947 11
6 pages
Nikola Tesla and Swami Vivekananda
100% (1)
Nikola Tesla and Swami Vivekananda
14 pages
Manecon Module 3 Notes
No ratings yet
Manecon Module 3 Notes
5 pages
Reinforcement Worksheet 2: Name: Class: Date
No ratings yet
Reinforcement Worksheet 2: Name: Class: Date
6 pages
Yanbu University College General Physics-I PHYS-101: Object
No ratings yet
Yanbu University College General Physics-I PHYS-101: Object
8 pages
Ut-4 Datesheet-1
No ratings yet
Ut-4 Datesheet-1
1 page
Statistics & Probability: Tests of Hypothesis
No ratings yet
Statistics & Probability: Tests of Hypothesis
7 pages
Circle I and II
No ratings yet
Circle I and II
5 pages
Investigation (Complementary Angle)
No ratings yet
Investigation (Complementary Angle)
3 pages
Tutorial 8 - Newton-Euler Equations
No ratings yet
Tutorial 8 - Newton-Euler Equations
2 pages
25-06-23 - JR.C 120 - JEE Adv (2022-P2) - PAPER-2 - Q. Paper
No ratings yet
25-06-23 - JR.C 120 - JEE Adv (2022-P2) - PAPER-2 - Q. Paper
13 pages
DAC Concept
No ratings yet
DAC Concept
6 pages
Chen Et Al. - 2009 - Modified Sliding Mode Speed Control of Brushless
No ratings yet
Chen Et Al. - 2009 - Modified Sliding Mode Speed Control of Brushless
4 pages
A New Validated Physically Based IGCT Model For Circuit Simulation of Snubberless and Series Operation
No ratings yet
A New Validated Physically Based IGCT Model For Circuit Simulation of Snubberless and Series Operation
7 pages
Thin Walled Pressure Vessel Design Calculation Tutorial
No ratings yet
Thin Walled Pressure Vessel Design Calculation Tutorial
9 pages
Adaptive Control and The NASA X-15 Revisited
No ratings yet
Adaptive Control and The NASA X-15 Revisited
17 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet

Statistical Transform Data Cleaning

Uploaded by

Statistical Transform Data Cleaning

Uploaded by

Statistical Transform

• Data is collected from various structured and unstructured sources

• Data Quality is of utmost importance for the analysis. There are

• Completeness: It is defined as the percentage of entries that are filled in

• pip install pandas-profiling

• Step 2: Load the dataset using pandas:

• Step 3: Read the first five rows:

• from pandas_profiling import ProfileReport

• from sklearn.impute import SimpleImputer

• # for min_max scaling

• import datetime df['Date_parsed'] = pd.to_datetime(df['Date'], format="%m/%d/%y")

• # convert to lower case

• Firstly we will find out the unique region names:

• import fuzzywuzzy fromfuzzywuzzy import process

• Support Vector Machine or SVM is a supervised and linear Machine Learning

• In addition to performing linear

• Effective in high dimensional spaces.

• Still effective in cases where number of dimensions is greater than the

• Uses a subset of training points in the decision function (called support

• Versatile: different Kernel functions can be specified for the decision

• If the number of features is much greater than the number of samples,

• SVMs do not directly provide probability estimates, these are calculated

• Scikit-learn to perform various functions:

• How to preparing your classification models:

• Import packages, functions, and classes

You might also like