0% found this document useful (0 votes)

38 views51 pages

Practical Preprocessing and Data Cleaning

Uploaded by

hoangha43kd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views51 pages

Practical Preprocessing and Data Cleaning

Uploaded by

hoangha43kd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 51

Practical Preprocessing

Data Manipulation and Cleaning

Data cleaning technique depends on the types of error that your data contain. But some cleaning
activities on data are very common that almost half of data preprocessing may involve those
activities.
Before we begin recall that Data manipulation which are Slicing and Drilling and importation of
different data file are out of the scope of this module (meaning you should learn it privately).

Setting the work folder/Directory

# check working directory
> getwd()

#set working directory

> setwd("C:/Users/ebenu/Downloads/COMP1810Web AnalyticsLectures")

Importation of files
CSV file
See correct and Errors
data2 <- read.csv("input.csv", sep = "", header = TRUE)
From online source
df <- read.table("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.txt",
header = FALSE)

df1 <- read.table("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.csv",

header = FALSE,
sep = ",")

df2 <- read.csv("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.csv",

header = FALSE)

//////////////////////////////////////////////////////////////////////////////////////////////////////

read.delim() for Delimited Files

If separator character is a comma or a semicolon, use the read.delim() and read.delim2()

functions. These behave like read.table() function, just like the read.csv() function. See the data
below
df <- read.delim("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test_delim.txt", sep="$")

df <- read.delim2("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test_delim.txt", sep="$")

////////////////////////////////////////////////////////////////////////////////////////////////////////////

Identifying and Imputation of missing values.

Run = ctrl+Alt_enter
Pipeline = ctrl+shift+m
Install these packages used for data cleaning and manipulation
 (tidyr ) Tidy Messy Data • tidyr https://tidyr.tidyverse.org
# The easiest way to get tidyr is to install the whole tidyverse:
# You may install waldo package to use a function called compare

install.packages("tidyverse")
# Alternatively, install just dplyr:
install.packages("tidyr")
Data sets
Lemon2016 or starwars from the above package
1. Import the file.
2. Identify missing values, character values and Identifying NA and NAN values.
3. Count Missing values are na, NA, space, in each column for missing values.
4. Replacing values with required Numerics.
> data2 <- read.csv("lemonade2016.csv", header = TRUE)
> data2
/////////////////////////////////////////////////////////////////////////////////////////////////////////////

Identify the missing values only NA is identify, the – and na is not recognized.
Count the total number of NA missing values

> sum(is.na(data2))
[1] 11

Missing values for each column

Input NA using the mean of a column with the

data2$Lemon[is.na(data2$Lemon)] <- round(mean(data2$Lemon, na.rm = TRUE))

Recheck the number of missing values again

If the number of missing values is small that it will not affect the overall analysis, you may drop
it. Drop the missing value in the orange and Location.
Remove all row that has Na by this code
data2_new <- data2[, colSums(is.na(data2)) < nrow(data2)]
Using dplyr package (Grammar for data manipulation)
dplyr Verbs
 select() (Selecting columns)
 mutate() (Add or change columns).
 filter() (Selecting rows)
 summarise() (Summary of group of rows)
 arrange() (Ordering of the rows).

Using starwars dataset that comes with dplyr

Install package called magrittr to be able to used pipline %>%

Using filter() (Selecting rows) to select where eye color and not is black
starwars %>%
filter(eye_color =="black")

starwars %>%

filter(eye_color !="black")
Using select() (Selecting columns)

Selecting column using index

Selecting column using name of variable or columns

Selecting column using number of index

Selecting column using different construct.

Select helper functions of dplyr
These functions are used with select, hence is called helper functions.

Helpers Description
starts_with() Starts with a prefix
ends_with() Ends with a prefix
contains() Contains a literal string
matches() Matches a regular expression
num_range() Numerical range like x01, x02, x03.
one_of() Variables in character vector.
everything() All variables.

Re-arranging the column by using column headings and everything

How to use the Helper start_with(), end_with(), contain()
Using mutate() (Add or change columns).
Using the startwars create a BMI column

BMI = mass/((height/100) )^2

Rounding up the BMI column

starwars %>%

mutate(bmi = mass/ (height/100)^2) %>%

mutate(bmi= round(bmi,2)) %>%

select(name,height, mass,bmi)
Using factors on mutate
other examples
Using arrange() (Ordering of the rows).
By default, is by ascending order

starwars %>%

arrange(height)

starwars %>%

arrange(desc(height))
Using Summarise() with mtcars datasets

Group_by
Tidyr practical
Install and load the tidyr package

Pivot_longer

Pivot_wider

Making a tidy dataset from wide to long see below.

Load tidyr to Import relig_income dataset

pivot_longer(relig_income, -religion, names_to ="income" , values_to ="count" )

Using billboard datasets in tidyr
write.csv(billboard, "billboard.csv",row.names = FALSE)
billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to ="week",
values_to ="rank"
)
String Manipulation
install.packages (“stringr”) # install the package

library (stringr) #load the package

All string function starts with str

Funtions of stringr

Getting and setting individual characters

/////////////////////////////////////////////////////////////////////////////////////////////////////////
str_length("abc")

#> [1] 3

////////////////////////////////////////////////////////////////////////////////////////////////////////////

///////////////////////////////////////////////////////////////////////////////////////////////////////////
x <- c("abcdef", "ghifjk")

# The 3rd letter

str_sub(x, 3, 3)
#> [1] "c" "i"

# The 2nd to 2nd-to-last character

str_sub(x, 2, -2)

#> [1] "bcde" "hifj"

//////////////////////////////////////////////////////////////////////////////////////////////////////////

Whitespace
Three functions add, remove, or modify whitespace use

Add space: str_pad()

> x <- c("abc", "defghi")

> str_pad(x, 10) # default pads on left

combine str_pad() and str_trunc():

In this table we will remove wk and make all the number integers

billboard %>%
mutate(week = substr(week,3,4),
week= as.integer(week))
Detecting String using str_detect()

Regular Expression and str_detect()

To detect any letter upper or lower case
str_replace: Replace matched patterns in a string.
Count number of matched

str_locate(): Locate the position of patterns in a string

Web Scrapping
install.package (“rvest”)

We will be scrpping imdb IMDb Top 250 - IMDb

Using html_node() to extract
This will allow to extract any specific tags from html
html_nodes(“div”)#extract all div tags
If a class = “hello”
html_nodes(“.hello”) #extract the hello class
If a id=”hi”
html_nodes(“#hi”) #extract the hi id
//////////////////////////////////////////////////////////////////////////////////////////////////////////

///////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////////////////////

To Extract the text using html_text()

Assigned it to title to clean it up Cleaning the
//////////////////////////////////////////////Extract title//////////////////////////////////////////

titles<- top_movies %>%

html_nodes("tbody tr td.titleColumn ") %>%
html_text() %>%
str_trim() %>%
str_split("\n") %>%
lapply(function(movie){
movie[2]
}) %>%
unlist() %>%
str_trim()
////////////////////////////////////////////Extracting year////////////////////////////////////////////////

years<-top_movies %>%
html_nodes("tbody tr td.titleColumn ") %>%
html_text() %>%
str_trim() %>%
str_split("\n") %>%
lapply(function(movie){
movie[3]#extract the 3 element which is year
}) %>%
unlist() %>%
str_trim() %>%
str_replace("\$", "") %>% #replace the first parentesis by open space
str_replace("\$", "") %>% #replace the first parentesis by open space
as.integer()

/////////////////////////////////////Extracting the Ratings//////////////////////////////////////

ratings<-top_movies %>%
html_nodes(".imdbRating strong") %>%
html_text() %>%
as.numeric()
////////////////////////////////Ranks//////////////////////////////////////////////////////////////////

This is simple is just

ranks<- 1:250

//////////////////////////////////////////////////////////////////////////////////////////////////////////////

Putting all together by Creating a table >>>>tibble

top_movies_tables<- tibble(
Rank = ranks,
Title = titles,
Year = years,
Rating = ratings
)
/////////////////////////////////////////////////////////////////////////////////////////////////////////////
Cleaning Data
Package and dataset to install and load

library(dplyr)
library(tidyr)
library(skimr)
Starwars dataset
library(skimr) is use to show the data skim below

Let's extract height, mass and gender from the dataset

data <- starwars %>%

select(height, mass, gender)
data
Splitting the data by installing and loading
library(rsample)
///////////////////////////////////////////////////////////////////////////////////////////////////////////
data_split <- initial_split(data)
data_train <- training(data_split)
data_test <-testing(data_split)

Checking the number of split data

Keep the the data_test to be used for validation and clean the data_train

Creating a new feature bmi

data_train<- data_train %>%

mutate(bmi = mass/(height*height))
data_train
To check for missing values

Use skim() to check for missing values.

skim(data_train)
Or any.na(data_train)
any(is.na(data_train))

colSums(is.na(data_train))
Dropping the missing values that are very few
Dropping height and gender missing values

Imputation of missing values for mass and bmi

ifelse(condition,true ,false)

data_tr_imputed<-data_train %>%
mutate(mass =ifelse(is.na(mass), mean(mass,na.rm = TRUE),mass),
bmi =ifelse(is.na(bmi), mean(bmi,na.rm = TRUE),bmi))
data_tr_imputed
gender is a categorical variable and must be encoded
data_tr_imputed_encoded<-data_tr_imputed %>%
mutate(gender_masculine = ifelse(gender =="masculine",1,0)) %>%
select(-gender)
data_tr_imputed_encoded
Feature Scaling

Creating a function for normalisation

normalize<- function(feature){
(feature = mean(feature))/sd(feature)
}
Complete processes Pipeline
Putting the whole processes of data cleaning into one

Steps

I. Feature Engineering.
II. Missing values.
III. Encoding categorical variables.
IV. Feature Scaling.

data_train %>%
mutate(bmi = mass/(height*height)) %>%
drop_na(height,gender) %>%
mutate(mass =ifelse(is.na(mass), mean(mass,na.rm = TRUE),mass),
bmi =ifelse(is.na(bmi), mean(bmi,na.rm = TRUE),bmi)) %>%
mutate(gender_masculine = ifelse(gender =="masculine",1,0)) %>%
select(-gender) %>%
mutate_all(normalize)

Using Recipes for Data cleaning pipeline

install.packages("recipes")

[Recipes packages provides functions for doing all the coding above]
data_train %>%
recipe() %>%
step_mutate(BMI=mass/(height*height)) %>%
step_naomit(height,gender) %>%
step_meanimpute(mass,BMI) %>%
step_dummy(gender) %>%
step_normalize(everything()) %>%
prep()

/////////////////////////ENCODING CATEGORICAL DATASET Using Iris////////////////////////////////////

//////////////////////////////////////////////////////////////////////////////////////////
iris %>%
mutate(Species_versicolor = ifelse(Species =="versicolor",1,0),
Species_virginica = ifelse(Species =="virginica",1,0)) %>%#remove the Species
select(-Species)

Data Minig and Techniquezz
No ratings yet
Data Minig and Techniquezz
48 pages
Data Preparation: Treatment of Missing Values
No ratings yet
Data Preparation: Treatment of Missing Values
26 pages
DataCamp Week 5
No ratings yet
DataCamp Week 5
7 pages
Tidyverse - Tidyr and Dplyr
No ratings yet
Tidyverse - Tidyr and Dplyr
33 pages
R Studio: Scripts, Data Handling & Cleaning
No ratings yet
R Studio: Scripts, Data Handling & Cleaning
25 pages
Tutorial-Introduction To Dplyr
No ratings yet
Tutorial-Introduction To Dplyr
54 pages
MBA Sem 1 Unit 3 Fundamentals of R
No ratings yet
MBA Sem 1 Unit 3 Fundamentals of R
41 pages
Lect01 2
No ratings yet
Lect01 2
19 pages
R Master Sheet - All Codes, Inbuilt Functions and Packages Needed For The Course
No ratings yet
R Master Sheet - All Codes, Inbuilt Functions and Packages Needed For The Course
2 pages
Manipulating Data in R
No ratings yet
Manipulating Data in R
32 pages
R Data Manipulation Guide
No ratings yet
R Data Manipulation Guide
46 pages
Data Preparation and Cleaning Guide
No ratings yet
Data Preparation and Cleaning Guide
28 pages
R Basic and Advanced
No ratings yet
R Basic and Advanced
9 pages
Section 03
No ratings yet
Section 03
20 pages
R Module 8 - Data Cleaning
No ratings yet
R Module 8 - Data Cleaning
48 pages
R-Programming Lab Mannual
No ratings yet
R-Programming Lab Mannual
33 pages
Chapter 03 Wrangling
No ratings yet
Chapter 03 Wrangling
40 pages
R Tutorial2
No ratings yet
R Tutorial2
23 pages
Module 7 - (Data Analysis With R Programming)
No ratings yet
Module 7 - (Data Analysis With R Programming)
18 pages
Week2 DataWrangling DelimitedText PDF
No ratings yet
Week2 DataWrangling DelimitedText PDF
5 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
BS730 Class 12
No ratings yet
BS730 Class 12
36 pages
Unit 2
No ratings yet
Unit 2
76 pages
Data Handling and Manipulation
No ratings yet
Data Handling and Manipulation
18 pages
Data Cleaning Using R
No ratings yet
Data Cleaning Using R
26 pages
ULEOR 04 Tidyverse Handout
No ratings yet
ULEOR 04 Tidyverse Handout
59 pages
Lab4 Instructions
No ratings yet
Lab4 Instructions
52 pages
Practical Assignment-10 Mini Project Nutrition Calculator - Calculate Nutrition For Recipes
No ratings yet
Practical Assignment-10 Mini Project Nutrition Calculator - Calculate Nutrition For Recipes
16 pages
R Data Cleaning Techniques
No ratings yet
R Data Cleaning Techniques
26 pages
R Basics Continued - Factors and Data Frames - Intro To R and RStudio For Genomics
No ratings yet
R Basics Continued - Factors and Data Frames - Intro To R and RStudio For Genomics
17 pages
Intro To Data Science Lecture 4
No ratings yet
Intro To Data Science Lecture 4
13 pages
CleaningData Chapter 3
No ratings yet
CleaningData Chapter 3
29 pages
R Packages Dplyr Sem-III 2021
No ratings yet
R Packages Dplyr Sem-III 2021
13 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 6
No ratings yet
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 6
22 pages
CRC Data Science
No ratings yet
CRC Data Science
443 pages
Week5 Slides
No ratings yet
Week5 Slides
72 pages
Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16
No ratings yet
Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16
31 pages
Mda Practical2 Eda
No ratings yet
Mda Practical2 Eda
50 pages
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
No ratings yet
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
72 pages
Mod3 Tables EPP
No ratings yet
Mod3 Tables EPP
9 pages
R BasicCommands
No ratings yet
R BasicCommands
5 pages
R Tutorial #1: Applied Econometrics (Econ3005)
No ratings yet
R Tutorial #1: Applied Econometrics (Econ3005)
21 pages
Important R Codes and Notes
No ratings yet
Important R Codes and Notes
13 pages
Data Tidying With Tidyr::: Cheat Sheet
No ratings yet
Data Tidying With Tidyr::: Cheat Sheet
2 pages
BIO259 Note
No ratings yet
BIO259 Note
55 pages
Tidy Data Principles and R Packages
No ratings yet
Tidy Data Principles and R Packages
14 pages
R File Code
No ratings yet
R File Code
16 pages
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
No ratings yet
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
26 pages
R Language PDF
100% (1)
R Language PDF
619 pages
Data Cleaning Course Notes
No ratings yet
Data Cleaning Course Notes
27 pages
R Cheat Sheet PDF
100% (1)
R Cheat Sheet PDF
38 pages
Study Guide Data Manipulation With R
No ratings yet
Study Guide Data Manipulation With R
4 pages
Tidyverse Pres
No ratings yet
Tidyverse Pres
20 pages
Base R
No ratings yet
Base R
9 pages
Data Analysis with R for Beginners
No ratings yet
Data Analysis with R for Beginners
4 pages
How To Do Reliability Analysis and Basic Factor Analysis in R
No ratings yet
How To Do Reliability Analysis and Basic Factor Analysis in R
4 pages
Lecturer-Correlation Analysis
No ratings yet
Lecturer-Correlation Analysis
10 pages
Lecture1 Clickstream Data Collection Techniques
No ratings yet
Lecture1 Clickstream Data Collection Techniques
12 pages
Lecture2-Web Analytic Metrics and KPIs
No ratings yet
Lecture2-Web Analytic Metrics and KPIs
14 pages
Lecturer3-Descriptive Analysis
No ratings yet
Lecturer3-Descriptive Analysis
24 pages
R Programming Setup & Data Import Guide
No ratings yet
R Programming Setup & Data Import Guide
11 pages
Comp 1807 Week6
No ratings yet
Comp 1807 Week6
35 pages
Rev-Trac 7 Advanced Guide To Transport Management 1.01
0% (1)
Rev-Trac 7 Advanced Guide To Transport Management 1.01
36 pages
The Linux File System Structure Explained
No ratings yet
The Linux File System Structure Explained
5 pages
OBIEE Catalog Folder Structures
No ratings yet
OBIEE Catalog Folder Structures
4 pages
Xcelium Tutorial PDF
No ratings yet
Xcelium Tutorial PDF
6 pages
Name Synopsis: /requirepackage (Snapshot) % Needed by Bundledoc /documentclass (11Pt) (Article)
No ratings yet
Name Synopsis: /requirepackage (Snapshot) % Needed by Bundledoc /documentclass (11Pt) (Article)
5 pages
Install Koha
No ratings yet
Install Koha
5 pages
ClyphX Pro User Manual
No ratings yet
ClyphX Pro User Manual
46 pages
Unitex Manual
No ratings yet
Unitex Manual
217 pages
Symantec Endpoint Protection Quick Start: Updated: December 2020
100% (1)
Symantec Endpoint Protection Quick Start: Updated: December 2020
9 pages
2.2.3 Import & Export Project - Digital Factory Planning and Simulation With Tecnomatix
No ratings yet
2.2.3 Import & Export Project - Digital Factory Planning and Simulation With Tecnomatix
2 pages
Tekla User Assistance
No ratings yet
Tekla User Assistance
14 pages
WebUtil Setup and Debug Guide
No ratings yet
WebUtil Setup and Debug Guide
6 pages
Whip 01
No ratings yet
Whip 01
14 pages
Technical Document: Niagara Developer Driver Framework Guide
No ratings yet
Technical Document: Niagara Developer Driver Framework Guide
160 pages
NS2 Network Simulator Setup Guide
No ratings yet
NS2 Network Simulator Setup Guide
16 pages
MA TIRTOS Kernel Workshop Lab Manual Rev3.0
No ratings yet
MA TIRTOS Kernel Workshop Lab Manual Rev3.0
146 pages
How To Block Viruses and Ransomware Using Software Restriction Policies
No ratings yet
How To Block Viruses and Ransomware Using Software Restriction Policies
22 pages
Using Verilog-A in Advanced Design System: August 2005
No ratings yet
Using Verilog-A in Advanced Design System: August 2005
58 pages
14.0 Application Resources-2
No ratings yet
14.0 Application Resources-2
9 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
Linux Directory Structure
No ratings yet
Linux Directory Structure
7 pages
Portable Apps with ThinApp Guide
No ratings yet
Portable Apps with ThinApp Guide
11 pages
OS Assignment 4
No ratings yet
OS Assignment 4
5 pages
Netbackup Commands You Need
No ratings yet
Netbackup Commands You Need
6 pages
Installation - En-Magma 4.4
No ratings yet
Installation - En-Magma 4.4
156 pages
AssignmentAttachment 5073AF04
No ratings yet
AssignmentAttachment 5073AF04
7 pages
LogicalDoc User Manual
100% (4)
LogicalDoc User Manual
71 pages
Birch-Murnaghan Equation of State
No ratings yet
Birch-Murnaghan Equation of State
2 pages
TomTom GPS-Touch Screen Calibration
No ratings yet
TomTom GPS-Touch Screen Calibration
6 pages
Fallout Vore 3 Install Guide
No ratings yet
Fallout Vore 3 Install Guide
11 pages

Practical Preprocessing and Data Cleaning

Uploaded by

Practical Preprocessing and Data Cleaning

Uploaded by

Practical Preprocessing

Data Manipulation and Cleaning

Setting the work folder/Directory

#set working directory

df1 <- read.table("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.csv",

df2 <- read.csv("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.csv",

read.delim() for Delimited Files

If separator character is a comma or a semicolon, use the read.delim() and read.delim2()

df <- read.delim2("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test_delim.txt", sep="$")

Identifying and Imputation of missing values.

Missing values for each column

Input NA using the mean of a column with the

data2$Lemon[is.na(data2$Lemon)] <- round(mean(data2$Lemon, na.rm = TRUE))

Using starwars dataset that comes with dplyr

Install package called magrittr to be able to used pipline %>%

Selecting column using index

Selecting column using name of variable or columns

Selecting column using different construct.

Re-arranging the column by using column headings and everything

BMI = mass/((height/100) )^2

Rounding up the BMI column

mutate(bmi = mass/ (height/100)^2) %>%

mutate(bmi= round(bmi,2)) %>%

Making a tidy dataset from wide to long see below.

pivot_longer(relig_income, -religion, names_to ="income" , values_to ="count" )

library (stringr) #load the package

All string function starts with str

Getting and setting individual characters

# The 3rd letter

# The 2nd to 2nd-to-last character

#> [1] "bcde" "hifj"

Add space: str_pad()

> x <- c("abc", "defghi")

> str_pad(x, 10) # default pads on left

combine str_pad() and str_trunc():

Regular Expression and str_detect()

str_locate(): Locate the position of patterns in a string

We will be scrpping imdb IMDb Top 250 - IMDb

To Extract the text using html_text()

titles<- top_movies %>%

/////////////////////////////////////Extracting the Ratings//////////////////////////////////////

This is simple is just

Putting all together by Creating a table >>>>tibble

Let's extract height, mass and gender from the dataset

data <- starwars %>%

Checking the number of split data

Creating a new feature bmi

data_train<- data_train %>%

Use skim() to check for missing values.

Imputation of missing values for mass and bmi

Creating a function for normalisation

Using Recipes for Data cleaning pipeline

/////////////////////////ENCODING CATEGORICAL DATASET Using Iris////////////////////////////////////

You might also like