[go: up one dir, main page]

0% found this document useful (0 votes)
16 views30 pages

Statistical Transform Data Cleaning

The document discusses data cleansing techniques used to clean raw data for analysis. It describes identifying and handling missing values, duplicates, inconsistent data formats like dates and text. Standardizing data through techniques like scaling, normalization and fuzzy matching is also covered. The key steps are data profiling to understand data quality issues, then selecting and applying appropriate cleansing methods like imputation, dropping rows, converting data types and more. Support vector machines (SVM) are also introduced as a supervised machine learning algorithm used for classification.

Uploaded by

Anjali Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views30 pages

Statistical Transform Data Cleaning

The document discusses data cleansing techniques used to clean raw data for analysis. It describes identifying and handling missing values, duplicates, inconsistent data formats like dates and text. Standardizing data through techniques like scaling, normalization and fuzzy matching is also covered. The key steps are data profiling to understand data quality issues, then selecting and applying appropriate cleansing methods like imputation, dropping rows, converting data types and more. Support vector machines (SVM) are also introduced as a supervised machine learning algorithm used for classification.

Uploaded by

Anjali Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Statistical Transform

Data Cleaning
Data Cleansing
• Data Cleansing is the process of analyzing data for finding incorrect,
corrupt, and missing values it to make it suitable for input to data
analytics and various machine learning algorithms.
• It is the premier and fundamental step performed before any analysis
could be done on data.
• There are no set rules to be followed for data cleansing.
• It totally depends upon the quality of the dataset and the level of
accuracy to be achieved.
Reasons for data corruption:

• Data is collected from various structured and unstructured sources


and then combined, leading to duplicated and mislabeled values.
• Different data dictionary definitions for data stored at various
locations.
• Manual entry error/Typos.
• Incorrect capitalization.
• Mislabelled categories/classes.
Data Quality

• Data Quality is of utmost importance for the analysis. There are


several quality criteria that need to be checked upon:
Data Quality Attributes

• Completeness: It is defined as the percentage of entries that are filled in


the dataset. The percentage of missing values in the dataset is a good
indicator of the quality of the dataset.
• Accuracy: It is defined as the extent to which the entries in the dataset are
close to their actual values.
• Uniformity: It is defined as the extent to which data is specified using the
same unit of measure.
• Consistency: It is defined as the extent to which the data is consistent
within the same dataset and across multiple datasets.
• Validity: It is defined as the extent to which data conforms to the
constraints applied by the business rules.
There are various constraints:
Data Profiling Report

• Data Profiling is the process of exploring our data and finding insights
from it. Pandas profiling report is the quickest way to extract
complete information about the dataset. The first step for data
cleansing is to perform exploratory data analysis.
How to use pandas profiling:
• Step 1: The first step is to install the pandas profiling package using the pip command:

• pip install pandas-profiling

• Step 2: Load the dataset using pandas:

• import pandas as pd
• df = pd.read_csv(r"C:UsersDellDesktopDatasethousing.csv")

• Step 3: Read the first five rows:

• df.head()
• Step 4: Generate the profiling report using the following commands:

• from pandas_profiling import ProfileReport


• prof = ProfileReport(df)prof.to_file(output_file='output.html')
Profiling Report:
• The profiling report consists of five parts: overview, variables, interactions, correlation,
and missing values.
• 1. Overview gives the general statistics about the number of variables, number of
observations, missing values, duplicates, and number of categorical and numeric
variables.
• 2. Variable information tells detailed information about the distinct values,
missing values, mean, median, etc. Here statistics about a categorical variable
and a numerical variable is shown:
• 3. Correlation is defined as the degree to which two variables are related
to each other. The profiling report describes the correlation of different
variables with each other in form of a heatmap.
• 4.Interactions: This part of the report shows the interactions of the variables
with each other. You can select any variable on the respective axes.
• 5. Missing values: It depicts the number of missing values in each column.
Data Cleansing Techniques
• Handling missing values:
• Handling missing values is the most important step of data cleansing.
• The first question you should ask yourself is that
• why is the data missing?
• Is it missing just because it was not recorded by the data entry operator or is it
intentionally left empty?
• You can also go through the documentation to find the reason for the same.
There are different ways to handle these missing values:

• 1. Drop missing values: The easiest way to handle them is to simply drop all the rows that
contain missing values.
• If you don’t want to figure out why the values are missing and just have a small percentage
of missing values you can just drop them using the following command:

• df.dropna()

• It is not advisable although because every data is important and holds great significance to
the overall results.
• Usually, the percentage of missing entries in a particular column is high. So dropping it is
not a good option.
• 2. Imputation:
• Imputation is the process of replacing the null/missing values with some
value.
• For numeric columns, one option is to replace each missing entry in the
column with the mean value or median value.
• Another option could be generating random numbers between a range of
values suitable for the column.
• The range could be between the mean and standard deviation of the column.
• You can simply import an imputer from the scikit-learn package and perform
imputation as follows:

• from sklearn.impute import SimpleImputer


• #Imputation
• my_imputer = SimpleImputer()
• imputed_df = pd.DataFrame(my_imputer.fit_transform(df))
• Handling Duplicates:
• Duplicate rows occur usually when the data is combined from multiple
sources.
• It gets replicated sometimes. A common problem is when users have
the same identity number or the form has been submitted twice.
• The solution to these duplicate tuples is to simply remove them.
• You can use the unique() function to find out the unique values present
in the column and then decide which values need to be scraped.
Scaling and Normalization
• Scaling refers to transforming the range of data and shifting it to some other
value range. This is beneficial when we want to compare different attributes
on the same footing. One useful example could be currency conversion.

• For example, we will create random 100 points from exponential distribution
and then plot them. Finally, we will convert them to a scaled version using the
python mlxtend package.

• # for min_max scaling


• from mlxtend.preprocessing import minmax_scaling
• # plotting packages
• import seaborn as sns import matplotlib.pyplot as plt
Handling Dates
• The date field is an important attribute that needs to be handled during the cleansing of
data. There are multiple different formats in which data can be entered into the dataset.
• Therefore, standardizing the date column is a critical task. Some people may have treated
the date as a string column, some as a DateTime column.
• When the dataset gets combined from different sources then this might create a problem
for analysis.

• The solution is to first find the type of date column using the following command.

• df['Date'].dtype

• If the type of the column is other than DateTime, convert it to DateTime using the
following command:

• import datetime df['Date_parsed'] = pd.to_datetime(df['Date'], format="%m/%d/%y")


Handling inconsistent data entry issues
• There are a large number of inconsistent entries that cannot be found manually or
through direct computations.
• For example, if the same entry is written in upper case or lower case or a mixture of
upper case and lower case. Then such an entry should be standardized throughout the
column.

• One solution is to convert all the entries of a column to lowercase and trim the extra
space from each entry. This can later be reverted after the analysis is complete.

• # convert to lower case


• df['ReginonName'] = df['ReginonName'].str.lower()
• # remove trailing white spaces
• df['ReginonName'] = df['ReginonName'].str.strip()
• Another solution is to use fuzzy matching to find which strings in the column are closest
to each other and then replacing all those entries with a particular threshold with the
main entry.

• Firstly we will find out the unique region names:

• region = df['Regionname'].unique()

• Then we calculate the scores using fuzzy matching:

• import fuzzywuzzy fromfuzzywuzzy import process


regions=fuzzywuzzy.process.extract("WesternVictoria",region,limit=10,scorer=fuzzywuz
y.fuzz.token_sort_ratio)
Support Vector Machines
• Support vector machines (SVMs) are a set of supervised learning methods used for
classification, regression and outliers detection.

• In machine learning, support vector machines (SVMs, also support vector networks)
are supervised learning models with associated learning algorithms that analyze
data used for classification and regression analysis.

• Support Vector Machine or SVM is a supervised and linear Machine Learning


algorithm most commonly used for solving classification problems and is also
referred to as Support Vector Classification.
• There is also a subset of SVM called SVR which stands for Support Vector
Regression which uses the same principles to solve regression problems.

• SVM also supports the kernel method also called the kernel SVM which allows us to
tackle non-linearity.
What is Support Vector Machine?
• An SVM model is a
representation of the examples
as points in space, mapped so
that the examples of the
separate categories are divided
by a clear gap that is as wide as
possible.

• In addition to performing linear


classification, SVMs can
efficiently perform a non-linear
classification, implicitly mapping
their inputs into high-
dimensional feature spaces.
Now how would a machine using SVM,
classify a new fruit as either apple or orange
just based on the data on the size and
weights of some 20 apples and oranges that
were observed and labelled?
• The objective of SVM is to draw a line that best separates the two classes of data points.
• SVM generates a line that can cleanly separate the two classes.
• There are many possible ways of drawing a line that separates the two classes, however, in SVM, it
is determined by the margins and the support vectors.
• The margin is the area separating the two dotted green lines as shown in the image above. The
more the margin the better the classes are separated.
• The support vectors are the data points through which each of the green lines passes through.
These points are called support vectors as they contribute to the margins and hence the classifier
itself.
• These support vectors are simply the data points lying closest to the border of either of the classes
which has a probability of being in either one.
• The SVM then generates a hyperplane which has the maximum margin, in this case the black bold
line that separates the two classes which is at an optimum distance between both the classes.
• In case of more than 2 features and multiple dimensions, the line is replaced by a hyperplane that
separates multidimensional spaces.
• The advantages of support vector machines are:

• Effective in high dimensional spaces.

• Still effective in cases where number of dimensions is greater than the


number of samples.

• Uses a subset of training points in the decision function (called support


vectors), so it is also memory efficient.

• Versatile: different Kernel functions can be specified for the decision


function. Common kernels are provided, but it is also possible to specify
custom kernels.
• The disadvantages of support vector machines include:

• If the number of features is much greater than the number of samples,


avoid over-fitting in choosing Kernel functions and regularization term is
crucial.

• SVMs do not directly provide probability estimates, these are calculated


using an expensive five-fold cross-validation
Logistic Regression in Python With scikit-learn

• Scikit-learn to perform various functions:

• Preprocess data
• Reduce the dimensionality of problems
• Validate models
• Select the most appropriate model
• Solve regression and classification problems
• Implement cluster analysis
Logistic Regression in Python With scikit-learn

• How to preparing your classification models:

• Import packages, functions, and classes


• Get data to work with and, if appropriate, transform it
• Create a classification model and train (or fit) it with your existing data
• Evaluate your model to see if its performance is satisfactory

You might also like