Statistical Transform Data Cleaning
Statistical Transform Data Cleaning
Data Cleaning
Data Cleansing
• Data Cleansing is the process of analyzing data for finding incorrect,
corrupt, and missing values it to make it suitable for input to data
analytics and various machine learning algorithms.
• It is the premier and fundamental step performed before any analysis
could be done on data.
• There are no set rules to be followed for data cleansing.
• It totally depends upon the quality of the dataset and the level of
accuracy to be achieved.
Reasons for data corruption:
• Data Profiling is the process of exploring our data and finding insights
from it. Pandas profiling report is the quickest way to extract
complete information about the dataset. The first step for data
cleansing is to perform exploratory data analysis.
How to use pandas profiling:
• Step 1: The first step is to install the pandas profiling package using the pip command:
• import pandas as pd
• df = pd.read_csv(r"C:UsersDellDesktopDatasethousing.csv")
• df.head()
• Step 4: Generate the profiling report using the following commands:
• 1. Drop missing values: The easiest way to handle them is to simply drop all the rows that
contain missing values.
• If you don’t want to figure out why the values are missing and just have a small percentage
of missing values you can just drop them using the following command:
• df.dropna()
• It is not advisable although because every data is important and holds great significance to
the overall results.
• Usually, the percentage of missing entries in a particular column is high. So dropping it is
not a good option.
• 2. Imputation:
• Imputation is the process of replacing the null/missing values with some
value.
• For numeric columns, one option is to replace each missing entry in the
column with the mean value or median value.
• Another option could be generating random numbers between a range of
values suitable for the column.
• The range could be between the mean and standard deviation of the column.
• You can simply import an imputer from the scikit-learn package and perform
imputation as follows:
• For example, we will create random 100 points from exponential distribution
and then plot them. Finally, we will convert them to a scaled version using the
python mlxtend package.
• The solution is to first find the type of date column using the following command.
• df['Date'].dtype
• If the type of the column is other than DateTime, convert it to DateTime using the
following command:
• One solution is to convert all the entries of a column to lowercase and trim the extra
space from each entry. This can later be reverted after the analysis is complete.
• region = df['Regionname'].unique()
•
• Then we calculate the scores using fuzzy matching:
• In machine learning, support vector machines (SVMs, also support vector networks)
are supervised learning models with associated learning algorithms that analyze
data used for classification and regression analysis.
• SVM also supports the kernel method also called the kernel SVM which allows us to
tackle non-linearity.
What is Support Vector Machine?
• An SVM model is a
representation of the examples
as points in space, mapped so
that the examples of the
separate categories are divided
by a clear gap that is as wide as
possible.
• Preprocess data
• Reduce the dimensionality of problems
• Validate models
• Select the most appropriate model
• Solve regression and classification problems
• Implement cluster analysis
Logistic Regression in Python With scikit-learn