0% found this document useful (0 votes)

104 views29 pages

CleaningData Chapter 3

This document discusses cleaning data in R. It covers type conversions between character, numeric, integer, factor and logical variables. It also discusses the lubridate package for coercing strings to dates and times. Additionally, it covers string manipulation functions like str_trim(), str_pad() and str_replace() from the stringr package. Finally, it discusses handling missing values, outliers, and obvious errors, including functions like is.na(), na.omit(), boxplot(), and summary().

Uploaded by

Mahmoud Trigui

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views29 pages

CleaningData Chapter 3

Uploaded by

Mahmoud Trigui

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

CLEANING DATA IN R

Type conversions

Cleaning Data in R

Types of variables in R

character: "treatment", "123", "A"

numeric: 23.44, 120, NaN, Inf

integer: 4L, 1123L

factor: factor("Hello"), factor(8)

logical: TRUE, FALSE, NA

Cleaning Data in R

Types of variables in R
> class("hello")
[1] "character"
> class(3.844)
[1] "numeric"
> class(77L)
[1] "integer"
> class(factor("yes"))
[1] "factor"
> class(TRUE)
[1] "logical"

Cleaning Data in R

Type conversions
> as.character(2016)
[1] "2016"
> as.numeric(TRUE)
[1] 1
> as.integer(99)
[1] 99
> as.factor("something")
[1] something
Levels: something
> as.logical(0)
[1] FALSE

Cleaning Data in R

Overview of lubridate

Wri!en by Garre! Grolemund & Hadley Wickham

Coerce strings to dates

Cleaning Data in R

Dates with lubridate

# Load the lubridate package
> library(lubridate)
# Experiment with basic lubridate functions
> ymd("2015-08-25")
year-month-day
[1] "2015-08-25 UTC"
> ymd("2015 August 25")
year-month-day
[1] "2015-08-25 UTC"
> mdy("August 25, 2015")
month-day-year
[1] "2015-08-25 UTC"
> hms("13:33:09")
hour-minute-second
[1] "13H 33M 9S"
> ymd_hms("2015/08/25 13.33.09")
[1] "2015-08-25 13:33:09 UTC" year-month-day hour-minute-second

CLEANING DATA IN R

Let's practice!

CLEANING DATA IN R

String manipulation

Cleaning Data in R

Overview of stringr

R package wri!en by Hadley Wickham

Suite of helpful functions for working with strings

Functions share consistent interface

Cleaning Data in R

Key functions in stringr for cleaning data

# Trim leading and trailing white space
> str_trim(" this is a test
")
[1] "this is a test"
white space removed
# Pad string with zeros
> str_pad("24493", width = 7, side = "left", pad = "0")
[1] "0024493" 7 digits
# Create character vector of names
> friends <- c("Sarah", "Tom", "Alice")
# Search for string in vector
> str_detect(friends, "Alice")
[1] FALSE FALSE TRUE
# Replace string in vector
> str_replace(friends, "Alice", "David")
[1] "Sarah" "Tom"
"David"

Cleaning Data in R

Key functions in stringr for cleaning data

str_trim() - Trim leading and trailing white space

str_pad() - Pad with additional characters

str_detect() - Detect a pa!ern

str_replace() - Find and replace a pa!ern

Cleaning Data in R

Other helpful functions in base R

tolower() - Make all lowercase

toupper() - Make all uppercase

# Make all lowercase
> tolower("I AM TALKING LOUDLY!!")
[1] "i am talking loudly!!"
# Make all uppercase
> toupper("I am whispering...")
[1] "I AM WHISPERING..."

CLEANING DATA IN R

Let's practice!

CLEANING DATA IN R

Missing and
special values

Cleaning Data in R

Missing values

May be random, but dangerous to assume

Sometimes associated with variable/outcome of interest

In R, represented as NA

May appear in other forms

#N/A (Excel)

Single dot (SPSS, SAS)

Empty string

Cleaning Data in R

Special values

Inf - "Infinite value" (indicative of outliers?)

1/0

1/0 + 1/0

33333^33333

NaN - "Not a number" (rethink a variable?)

0/0

1/0 - 1/0

Cleaning Data in R

Finding missing values

# Create small dataset
> df <- data.frame(A = c(1, NA, 8, NA),
B = c(3, NA, 88, 23), 4 rows, 3 columns
C = c(2, 45, 3, 1))
# Check for NAs
> is.na(df)
A
B
C
[1,] FALSE FALSE FALSE
[2,] TRUE TRUE FALSE
Same size: 4 rows, 3 columns
[3,] FALSE FALSE FALSE
[4,] TRUE FALSE FALSE
# Are there any NAs?
> any(is.na(df))
[1] TRUE
# Count number of NAs
> sum(is.na(df))
[1] 3

Cleaning Data in R

Finding missing values

# Use summary()
> summary(df)
A
Min.
:1.00
1st Qu.:2.75
Median :4.50
Mean
:4.50
3rd Qu.:6.25
Max.
:8.00
NA's
:2

to find NAs
B

Min.
: 3.0
1st Qu.:13.0
Median :23.0
Mean
:38.0
3rd Qu.:55.5
Max.
:88.0
NA's
:1

Min.
: 1.00
1st Qu.: 1.75
Median : 2.50
Mean
:12.75
3rd Qu.:13.50
Max.
:45.00

Cleaning Data in R

Dealing with missing values

# Find rows with no missing values
> complete.cases(df)
[1] TRUE FALSE TRUE FALSE
# Subset data, keeping only complete cases
> df[complete.cases(df), ]
A B C
1 1 3 2
3 8 88 3
# Another way to remove rows with NAs
> na.omit(df)
A B C
1 1 3 2
3 8 88 3

CLEANING DATA IN R

Let's practice!

CLEANING DATA IN R

Outliers and
obvious errors

Cleaning Data in R

Outliers
# Simulate some data
> set.seed(10)
> x <- c(rnorm(30, mean = 15, sd = 5), -5, 28, 35)
# View a boxplot
> boxplot(x, horizontal = TRUE)

Outliers

Cleaning Data in R

Outliers

Extreme values distant from other values

Several causes

Valid measurements

Variability in measurement

Experimental error

Data entry error

May be discarded or retained depending on cause

Cleaning Data in R

Obvious errors
What if these values are supposed to represent ages?

Cleaning Data in R

Obvious errors

May appear in many forms

Values so extreme they cant be plausible (e.g. person aged 243)

Values that dont make sense (e.g. negative age)

Several causes

Measurement error

Data entry error

Special code for missing data (e.g. -1 means missing)

Should generally be removed or replaced

Cleaning Data in R

Finding outliers and errors

# Create another small dataset
> df2 <- data.frame(A = rnorm(100, 50, 10),
B = c(rnorm(99, 50, 10), 500),
C = c(rnorm(99, 50, 10), -1))
# View a summary
> summary(df2)
A
B
Min.
:23.7
Min.
: 26.9
1st Qu.:43.7
1st Qu.: 43.7
Median :51.9
Median : 49.8
Mean
:50.4
Mean
: 54.9
3rd Qu.:56.9
3rd Qu.: 56.6
Max.
:77.2
Max.
:500.0

Min.
:-1.0
1st Qu.:40.3
Median :48.5
Mean
:47.8
3rd Qu.:56.3
Max.
:75.1

Cleaning Data in R

Finding outliers and errors

# View a histogram
> hist(df2$B, breaks = 20)

Cleaning Data in R

Finding outliers and errors

# View a boxplot
> boxplot(df2)

CLEANING DATA IN R

Let's practice!

Data Cleaning Using R
No ratings yet
Data Cleaning Using R
26 pages
R Data Cleaning Techniques
No ratings yet
R Data Cleaning Techniques
26 pages
Cleaning Data in R
No ratings yet
Cleaning Data in R
158 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Data Cleansing Using R
0% (1)
Data Cleansing Using R
10 pages
04 Data Cleaning in R
No ratings yet
04 Data Cleaning in R
36 pages
Unit2 Part2 Da
No ratings yet
Unit2 Part2 Da
45 pages
Cleaning Data3
No ratings yet
Cleaning Data3
41 pages
Section 03
No ratings yet
Section 03
20 pages
Intro To Data Science Lecture 4
No ratings yet
Intro To Data Science Lecture 4
13 pages
Data Cleaning Essentials
No ratings yet
Data Cleaning Essentials
42 pages
R Studio: Scripts, Data Handling & Cleaning
No ratings yet
R Studio: Scripts, Data Handling & Cleaning
25 pages
R Tutorial3
No ratings yet
R Tutorial3
17 pages
CleaningData Chapter 4
No ratings yet
CleaningData Chapter 4
22 pages
Learn R - Learn R - Data Cleaning Cheatsheet - Codecademy
No ratings yet
Learn R - Learn R - Data Cleaning Cheatsheet - Codecademy
4 pages
Data Clean R
100% (1)
Data Clean R
11 pages
Data Cleaning in R with Tidyverse
No ratings yet
Data Cleaning in R with Tidyverse
55 pages
Mda Practical2 Eda
No ratings yet
Mda Practical2 Eda
50 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Data Cleaning Using R
No ratings yet
Data Cleaning Using R
5 pages
R Module 8 - Data Cleaning
No ratings yet
R Module 8 - Data Cleaning
48 pages
R-Programming Lab Mannual
No ratings yet
R-Programming Lab Mannual
33 pages
Data - Analysis - With - R - 24
No ratings yet
Data - Analysis - With - R - 24
47 pages
Unit 2
No ratings yet
Unit 2
76 pages
FE418 RLectureNotes1
No ratings yet
FE418 RLectureNotes1
15 pages
Rlab Exp 8
No ratings yet
Rlab Exp 8
3 pages
Module 7 - (Data Analysis With R Programming)
No ratings yet
Module 7 - (Data Analysis With R Programming)
18 pages
Free Ebook - The Ultimate Guide To Basic Data Cleaning PDF
No ratings yet
Free Ebook - The Ultimate Guide To Basic Data Cleaning PDF
70 pages
Part 5
No ratings yet
Part 5
22 pages
6.data Cleaning
No ratings yet
6.data Cleaning
20 pages
R Data Types and Input Methods
No ratings yet
R Data Types and Input Methods
29 pages
Data Cleaning R
No ratings yet
Data Cleaning R
16 pages
Data Cleaning R
No ratings yet
Data Cleaning R
2 pages
Practical Preprocessing and Data Cleaning
No ratings yet
Practical Preprocessing and Data Cleaning
51 pages
Data Analytics Using R
No ratings yet
Data Analytics Using R
37 pages
ProgrammingForDS14 Rbasics
No ratings yet
ProgrammingForDS14 Rbasics
32 pages
Module 5 - Data Cleaning and Transformation
No ratings yet
Module 5 - Data Cleaning and Transformation
26 pages
Data Cleaning for Survey Analysts
No ratings yet
Data Cleaning for Survey Analysts
66 pages
Ex 4 R Objects
No ratings yet
Ex 4 R Objects
6 pages
A Data Scientist S Guide To Acquiring Cleaning and Managing Data in R 1st Edition Samuel E. Buttrey PDF Download
No ratings yet
A Data Scientist S Guide To Acquiring Cleaning and Managing Data in R 1st Edition Samuel E. Buttrey PDF Download
141 pages
Lecture 1
No ratings yet
Lecture 1
35 pages
Lecture 9: Data Wrangling With Dplyr: Kevin Lee
No ratings yet
Lecture 9: Data Wrangling With Dplyr: Kevin Lee
12 pages
Unit1 R Notes
No ratings yet
Unit1 R Notes
16 pages
CleaningData Chapter 2
No ratings yet
CleaningData Chapter 2
21 pages
Unit 4
No ratings yet
Unit 4
27 pages
Data Cleaning Exercise with RScript
No ratings yet
Data Cleaning Exercise with RScript
1 page
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
No ratings yet
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
26 pages
The Basic of Data Cleaning 1711767651
No ratings yet
The Basic of Data Cleaning 1711767651
64 pages
Basic Data Cleaning
100% (3)
Basic Data Cleaning
64 pages
Chapter 2
No ratings yet
Chapter 2
36 pages
Data Preparation: Treatment of Missing Values
No ratings yet
Data Preparation: Treatment of Missing Values
26 pages
Data Analysis with R for Beginners
No ratings yet
Data Analysis with R for Beginners
4 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Cleaning Data in R
No ratings yet
Cleaning Data in R
9 pages
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 6
No ratings yet
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 6
22 pages
Basic Data Cleaning
100% (2)
Basic Data Cleaning
66 pages
Efficient Data Table Indexing
No ratings yet
Efficient Data Table Indexing
17 pages
Student's T-Test Guide with R
No ratings yet
Student's T-Test Guide with R
20 pages
Introduction To Statistics With R - Histograms
No ratings yet
Introduction To Statistics With R - Histograms
26 pages
Introduction To Statistics With R
No ratings yet
Introduction To Statistics With R
17 pages
Java Fundamentals - Plan Kursa
No ratings yet
Java Fundamentals - Plan Kursa
7 pages
Optimize VBA for Faster Macros
No ratings yet
Optimize VBA for Faster Macros
3 pages
Airline Data Analysis Tool
No ratings yet
Airline Data Analysis Tool
28 pages
C - Practical Programs
No ratings yet
C - Practical Programs
36 pages
Core CSharp and NET Quick Reference
100% (5)
Core CSharp and NET Quick Reference
2 pages
Oops RGPV
No ratings yet
Oops RGPV
74 pages
Lecture#8 - String Matching Algorithm
No ratings yet
Lecture#8 - String Matching Algorithm
38 pages
10 ICSE APC Computer Applications Chapter 2 Library Classes UNSOLVED PROGRAMS SOLUTIONS
0% (1)
10 ICSE APC Computer Applications Chapter 2 Library Classes UNSOLVED PROGRAMS SOLUTIONS
18 pages
Theory of Automata
No ratings yet
Theory of Automata
22 pages
Java Programming Question Bank
100% (1)
Java Programming Question Bank
7 pages
NOTE 2998897 Release 617
No ratings yet
NOTE 2998897 Release 617
72 pages
GUI Programming With Python QT EDITION
80% (5)
GUI Programming With Python QT EDITION
641 pages
Leetss-Code Questions
No ratings yet
Leetss-Code Questions
224 pages
Unit 2 AP Computer Science Practice Exam
No ratings yet
Unit 2 AP Computer Science Practice Exam
5 pages
Funções VBA
No ratings yet
Funções VBA
4 pages
Java Programming Essentials
No ratings yet
Java Programming Essentials
377 pages
Programming in C Data Structures (15pcd13) - Notes PDF
No ratings yet
Programming in C Data Structures (15pcd13) - Notes PDF
108 pages
C#.net Exam Questions
No ratings yet
C#.net Exam Questions
6 pages
Tehnica Backtracking: Problema "Dame" Metoda I
No ratings yet
Tehnica Backtracking: Problema "Dame" Metoda I
4 pages
Project Oop Report
No ratings yet
Project Oop Report
27 pages
Robot Framework User Guide
No ratings yet
Robot Framework User Guide
299 pages
VCP Scripter
No ratings yet
VCP Scripter
8 pages
8086 Assembler Tutorial For Beginners (Part 5)
No ratings yet
8086 Assembler Tutorial For Beginners (Part 5)
4 pages
Java I/O: Classes, Interfaces, and Methods
No ratings yet
Java I/O: Classes, Interfaces, and Methods
8 pages
Java OOP Lab: Experiments & Outcomes
No ratings yet
Java OOP Lab: Experiments & Outcomes
3 pages
The Python Bible 7 in 1 Volumes One To Seven (Beginner, Intermediate, Data Science, Machine Learning, Finance, Neural Networks, Computer Vision) by Dedov, Florian
100% (2)
The Python Bible 7 in 1 Volumes One To Seven (Beginner, Intermediate, Data Science, Machine Learning, Finance, Neural Networks, Computer Vision) by Dedov, Florian
537 pages
Python Question Solution
No ratings yet
Python Question Solution
11 pages
Module 2 Bplck205b
No ratings yet
Module 2 Bplck205b
36 pages
Practical Index List (XII)
No ratings yet
Practical Index List (XII)
2 pages
Python String Concatenation Guide
No ratings yet
Python String Concatenation Guide
11 pages

CleaningData Chapter 3

Uploaded by

CleaningData Chapter 3

Uploaded by

CLEANING DATA IN R

character: "treatment", "123", "A"

numeric: 23.44, 120, NaN, Inf

integer: 4L, 1123L

factor: factor("Hello"), factor(8)

logical: TRUE, FALSE, NA

Wri!en by Garre! Grolemund & Hadley Wickham

Coerce strings to dates

Dates with lubridate

R package wri!en by Hadley Wickham

Suite of helpful functions for working with strings

Functions share consistent interface

Key functions in stringr for cleaning data

Key functions in stringr for cleaning data

str_trim() - Trim leading and trailing white space

str_pad() - Pad with additional characters

str_detect() - Detect a pa!ern

str_replace() - Find and replace a pa!ern

Other helpful functions in base R

tolower() - Make all lowercase

toupper() - Make all uppercase

May be random, but dangerous to assume

Sometimes associated with variable/outcome of interest

May appear in other forms

Single dot (SPSS, SAS)

Inf - "Infinite value" (indicative of outliers?)

NaN - "Not a number" (rethink a variable?)

Finding missing values

Finding missing values

Dealing with missing values

Extreme values distant from other values

Data entry error

May be discarded or retained depending on cause

May appear in many forms

Values so extreme they cant be plausible (e.g. person aged 243)

Values that dont make sense (e.g. negative age)

Data entry error

Special code for missing data (e.g. -1 means missing)

Should generally be removed or replaced

Finding outliers and errors

Finding outliers and errors

Finding outliers and errors

You might also like