Practical Preprocessing
Data Manipulation and Cleaning
Data cleaning technique depends on the types of error that your data contain. But some cleaning
activities on data are very common that almost half of data preprocessing may involve those
activities.
Before we begin recall that Data manipulation which are Slicing and Drilling and importation of
different data file are out of the scope of this module (meaning you should learn it privately).
Setting the work folder/Directory
# check working directory
> getwd()
#set working directory
> setwd("C:/Users/ebenu/Downloads/COMP1810Web AnalyticsLectures")
Importation of files
CSV file
See correct and Errors
data2 <- read.csv("input.csv", sep = "", header = TRUE)
From online source
df <- read.table("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.txt",
header = FALSE)
df1 <- read.table("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.csv",
header = FALSE,
sep = ",")
df2 <- read.csv("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.csv",
header = FALSE)
//////////////////////////////////////////////////////////////////////////////////////////////////////
read.delim() for Delimited Files
If separator character is a comma or a semicolon, use the read.delim() and read.delim2()
functions. These behave like read.table() function, just like the read.csv() function. See the data
below
df <- read.delim("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test_delim.txt", sep="$")
df <- read.delim2("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test_delim.txt", sep="$")
////////////////////////////////////////////////////////////////////////////////////////////////////////////
Identifying and Imputation of missing values.
Run = ctrl+Alt_enter
Pipeline = ctrl+shift+m
Install these packages used for data cleaning and manipulation
(tidyr ) Tidy Messy Data • tidyr https://tidyr.tidyverse.org
# The easiest way to get tidyr is to install the whole tidyverse:
# You may install waldo package to use a function called compare
install.packages("tidyverse")
# Alternatively, install just dplyr:
install.packages("tidyr")
Data sets
Lemon2016 or starwars from the above package
1. Import the file.
2. Identify missing values, character values and Identifying NA and NAN values.
3. Count Missing values are na, NA, space, in each column for missing values.
4. Replacing values with required Numerics.
> data2 <- read.csv("lemonade2016.csv", header = TRUE)
> data2
/////////////////////////////////////////////////////////////////////////////////////////////////////////////
Identify the missing values only NA is identify, the – and na is not recognized.
Count the total number of NA missing values
> sum(is.na(data2))
[1] 11
Missing values for each column
Input NA using the mean of a column with the
data2$Lemon[is.na(data2$Lemon)] <- round(mean(data2$Lemon, na.rm = TRUE))
Recheck the number of missing values again
If the number of missing values is small that it will not affect the overall analysis, you may drop
it. Drop the missing value in the orange and Location.
Remove all row that has Na by this code
data2_new <- data2[, colSums(is.na(data2)) < nrow(data2)]
Using dplyr package (Grammar for data manipulation)
dplyr Verbs
select() (Selecting columns)
mutate() (Add or change columns).
filter() (Selecting rows)
summarise() (Summary of group of rows)
arrange() (Ordering of the rows).
Using starwars dataset that comes with dplyr
Install package called magrittr to be able to used pipline %>%
Using filter() (Selecting rows) to select where eye color and not is black
starwars %>%
filter(eye_color =="black")
starwars %>%
filter(eye_color !="black")
Using select() (Selecting columns)
Selecting column using index
Selecting column using name of variable or columns
Selecting column using number of index
Selecting column using different construct.
Select helper functions of dplyr
These functions are used with select, hence is called helper functions.
Helpers Description
starts_with() Starts with a prefix
ends_with() Ends with a prefix
contains() Contains a literal string
matches() Matches a regular expression
num_range() Numerical range like x01, x02, x03.
one_of() Variables in character vector.
everything() All variables.
Re-arranging the column by using column headings and everything
How to use the Helper start_with(), end_with(), contain()
Using mutate() (Add or change columns).
Using the startwars create a BMI column
BMI = mass/((height/100) )^2
Rounding up the BMI column
starwars %>%
mutate(bmi = mass/ (height/100)^2) %>%
mutate(bmi= round(bmi,2)) %>%
select(name,height, mass,bmi)
Using factors on mutate
other examples
Using arrange() (Ordering of the rows).
By default, is by ascending order
starwars %>%
arrange(height)
starwars %>%
arrange(desc(height))
Using Summarise() with mtcars datasets
Group_by
Tidyr practical
Install and load the tidyr package
Pivot_longer
Pivot_wider
Making a tidy dataset from wide to long see below.
Load tidyr to Import relig_income dataset
pivot_longer(relig_income, -religion, names_to ="income" , values_to ="count" )
Using billboard datasets in tidyr
write.csv(billboard, "billboard.csv",row.names = FALSE)
billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to ="week",
values_to ="rank"
)
String Manipulation
install.packages (“stringr”) # install the package
library (stringr) #load the package
All string function starts with str
Funtions of stringr
Getting and setting individual characters
/////////////////////////////////////////////////////////////////////////////////////////////////////////
str_length("abc")
#> [1] 3
////////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////////////////////
x <- c("abcdef", "ghifjk")
# The 3rd letter
str_sub(x, 3, 3)
#> [1] "c" "i"
# The 2nd to 2nd-to-last character
str_sub(x, 2, -2)
#> [1] "bcde" "hifj"
//////////////////////////////////////////////////////////////////////////////////////////////////////////
Whitespace
Three functions add, remove, or modify whitespace use
Add space: str_pad()
> x <- c("abc", "defghi")
> str_pad(x, 10) # default pads on left
combine str_pad() and str_trunc():
In this table we will remove wk and make all the number integers
billboard %>%
mutate(week = substr(week,3,4),
week= as.integer(week))
Detecting String using str_detect()
Regular Expression and str_detect()
To detect any letter upper or lower case
str_replace: Replace matched patterns in a string.
Count number of matched
str_locate(): Locate the position of patterns in a string
Web Scrapping
install.package (“rvest”)
We will be scrpping imdb IMDb Top 250 - IMDb
Using html_node() to extract
This will allow to extract any specific tags from html
html_nodes(“div”)#extract all div tags
If a class = “hello”
html_nodes(“.hello”) #extract the hello class
If a id=”hi”
html_nodes(“#hi”) #extract the hi id
//////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////////////////////
To Extract the text using html_text()
Assigned it to title to clean it up Cleaning the
//////////////////////////////////////////////Extract title//////////////////////////////////////////
titles<- top_movies %>%
html_nodes("tbody tr td.titleColumn ") %>%
html_text() %>%
str_trim() %>%
str_split("\n") %>%
lapply(function(movie){
movie[2]
}) %>%
unlist() %>%
str_trim()
////////////////////////////////////////////Extracting year////////////////////////////////////////////////
years<-top_movies %>%
html_nodes("tbody tr td.titleColumn ") %>%
html_text() %>%
str_trim() %>%
str_split("\n") %>%
lapply(function(movie){
movie[3]#extract the 3 element which is year
}) %>%
unlist() %>%
str_trim() %>%
str_replace("\\(", "") %>% #replace the first parentesis by open space
str_replace("\\)", "") %>% #replace the first parentesis by open space
as.integer()
/////////////////////////////////////Extracting the Ratings//////////////////////////////////////
ratings<-top_movies %>%
html_nodes(".imdbRating strong") %>%
html_text() %>%
as.numeric()
////////////////////////////////Ranks//////////////////////////////////////////////////////////////////
This is simple is just
ranks<- 1:250
//////////////////////////////////////////////////////////////////////////////////////////////////////////////
Putting all together by Creating a table >>>>tibble
top_movies_tables<- tibble(
Rank = ranks,
Title = titles,
Year = years,
Rating = ratings
)
/////////////////////////////////////////////////////////////////////////////////////////////////////////////
Cleaning Data
Package and dataset to install and load
library(dplyr)
library(tidyr)
library(skimr)
Starwars dataset
library(skimr) is use to show the data skim below
Let's extract height, mass and gender from the dataset
data <- starwars %>%
select(height, mass, gender)
data
Splitting the data by installing and loading
library(rsample)
///////////////////////////////////////////////////////////////////////////////////////////////////////////
data_split <- initial_split(data)
data_train <- training(data_split)
data_test <-testing(data_split)
Checking the number of split data
Keep the the data_test to be used for validation and clean the data_train
Creating a new feature bmi
data_train<- data_train %>%
mutate(bmi = mass/(height*height))
data_train
To check for missing values
Use skim() to check for missing values.
skim(data_train)
Or any.na(data_train)
any(is.na(data_train))
colSums(is.na(data_train))
Dropping the missing values that are very few
Dropping height and gender missing values
Imputation of missing values for mass and bmi
ifelse(condition,true ,false)
data_tr_imputed<-data_train %>%
mutate(mass =ifelse(is.na(mass), mean(mass,na.rm = TRUE),mass),
bmi =ifelse(is.na(bmi), mean(bmi,na.rm = TRUE),bmi))
data_tr_imputed
gender is a categorical variable and must be encoded
data_tr_imputed_encoded<-data_tr_imputed %>%
mutate(gender_masculine = ifelse(gender =="masculine",1,0)) %>%
select(-gender)
data_tr_imputed_encoded
Feature Scaling
Creating a function for normalisation
normalize<- function(feature){
(feature = mean(feature))/sd(feature)
}
Complete processes Pipeline
Putting the whole processes of data cleaning into one
Steps
I. Feature Engineering.
II. Missing values.
III. Encoding categorical variables.
IV. Feature Scaling.
data_train %>%
mutate(bmi = mass/(height*height)) %>%
drop_na(height,gender) %>%
mutate(mass =ifelse(is.na(mass), mean(mass,na.rm = TRUE),mass),
bmi =ifelse(is.na(bmi), mean(bmi,na.rm = TRUE),bmi)) %>%
mutate(gender_masculine = ifelse(gender =="masculine",1,0)) %>%
select(-gender) %>%
mutate_all(normalize)
Using Recipes for Data cleaning pipeline
install.packages("recipes")
[Recipes packages provides functions for doing all the coding above]
data_train %>%
recipe() %>%
step_mutate(BMI=mass/(height*height)) %>%
step_naomit(height,gender) %>%
step_meanimpute(mass,BMI) %>%
step_dummy(gender) %>%
step_normalize(everything()) %>%
prep()
/////////////////////////ENCODING CATEGORICAL DATASET Using Iris////////////////////////////////////
//////////////////////////////////////////////////////////////////////////////////////////
iris %>%
mutate(Species_versicolor = ifelse(Species =="versicolor",1,0),
Species_virginica = ifelse(Species =="virginica",1,0)) %>%#remove the Species
select(-Species)