[go: up one dir, main page]

0% found this document useful (0 votes)
42 views72 pages

Week5 Slides

The document provides an introduction to data manipulation using the dplyr package in R, which is essential for data wrangling tasks such as filtering, selecting, mutating, and arranging data. It emphasizes the importance of transforming data into a suitable format for analysis and introduces key functions like filter(), select(), and mutate(). The document also illustrates the use of the starwars dataset to demonstrate these functions and their advantages over base R methods.

Uploaded by

Tùng Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views72 pages

Week5 Slides

The document provides an introduction to data manipulation using the dplyr package in R, which is essential for data wrangling tasks such as filtering, selecting, mutating, and arranging data. It emphasizes the importance of transforming data into a suitable format for analysis and introduces key functions like filter(), select(), and mutate(). The document also illustrates the use of the starwars dataset to demonstrate these functions and their advantages over base R methods.

Uploaded by

Tùng Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

DSA2101

Essential Data Analytics Tools: Data Visualization

Yuting Huang

AY24/25

Weeks 5 Introduction to dplyr

1 / 72
What is data manipulation/ wrangling?

“Data janitor work”


It is extremely rare that the data you obtain will be in precisely the
right format for the analysis that you wish to do.
Very often, we need some or all of the following:
▶ Select only a subset of rows/ columns
▶ Create new variables or summaries
▶ Rename the variables
▶ Re-order the data
▶ ...
Artwork by Allison Horst.
2 / 72
What is data manipulation/data wrangling?

dplyr is a grammar of data manipulation, providing a set of functions


that help us solve the most common data manipulation challenges.

1. Data transformation
▶ filter(), select(), mutate(), arrange(), summarize()
▶ group_by() and %>%
2. Tidy data
▶ gather(), spread(), separate(), unite()
3. Relational data

3 / 72
Pre-requisites

The easiest way to get dplyr is to install the tidyverse package


(https://www.tidyverse.org/packages/).
▶ We will use the starwars data set from the package.

# install.packages("tidyverse")
library(tidyverse)
# Load data set from the package
data(starwars)

Artwork by Allison Horst.


4 / 72
Starwars data set
▶ glimpse() from dplyr allows us to examine the data and see as
much info as possible.

glimpse(starwars)

## Rows: 87
## Columns: 14
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader",
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 18
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0,
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "
## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue"
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.
## $ sex <chr> "male", "none", "none", "male", "female", "male",
## $ gender <chr> "masculine", "masculine", "masculine", "masculine"
## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alde
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Huma
## $ films <list> <"A New Hope", "The Empire Strikes Back", "Return
## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>,
## $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Adva
5 / 72
Starwars data set
head(starwars)

## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year s
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <
## 1 Luke Sky~ 172 77 blond fair blue 19 m
## 2 C-3PO 167 75 <NA> gold yellow 112 n
## 3 R2-D2 96 32 <NA> white, bl~ red 33 n
## 4 Darth Va~ 202 136 none white yellow 41.9 m
## 5 Leia Org~ 150 49 brown light brown 19 f
## 6 Owen Lars 178 120 brown, gr~ light blue 52 m
## # i 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>

class(starwars)

## [1] "tbl_df" "tbl" "data.frame"

6 / 72
Tibbles

The output shows that it is in fact not a data frame, it is a tibble.


▶ Base R functions import data as data frames.
▶ data.frame(), read.csv(), etc.
▶ dplyr imports data as tibbles.
▶ A modern version of data frame, typically useful for large data
sets.
▶ Easy to view the numbers of rows, columns, and variable types.

7 / 72
Key functions in dplyr

The following functions, and the combinations of them, will allow you
to accomplish the vast majority of data cleaning tasks.
▶ filter(): Select observations (rows) by the value in their
columns.
▶ select(): Select variables (columns) by their names.
▶ mutate(): Create new variables.
▶ arrange(): Reorder the rows by ascending or descending order.
▶ summarize() or summarise(): Collapse many values down to a
single value.

In conjunction with group_by(), which splits a data set by values in


a variable, these functions help us deal with common data
manipulation challenges.

8 / 72
Applying dplyr functions

Each of these functions is called in an identical manner.


▶ The first argument is the data frame (tibble).
▶ The subsequent arguments describe what to do with the data
frame, using variable names without quotes.
▶ The output is a new data frame.
▶ The original data frame is not modified.

These operations can be chained using the pipe operator %>%.

9 / 72
Let’s get started!

Artwork by Allison Horst.

10 / 72
The filter() function

The filter() function subsets the rows in a data frame by testing


against a conditional statement.
The output will be a data set with fewer rows than the original data.

11 / 72
filter(starwars, sex == "female")

## # A tibble: 16 x 14
## name height mass hair_color skin_color eye_color birth_year s
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <
## 1 Leia Or~ 150 49 brown light brown 19 f
## 2 Beru Wh~ 165 75 brown light blue 47 f
## 3 Mon Mot~ 150 NA auburn fair blue 48 f
## 4 Padmé A~ 185 45 brown light brown 46 f
## 5 Shmi Sk~ 163 NA black fair brown 72 f
## 6 Ayla Se~ 178 55 none blue hazel 48 f
## 7 Adi Gal~ 184 50 none dark blue NA f
## 8 Luminar~ 170 56.2 black yellow blue 58 f
## 9 Barriss~ 166 50 black yellow blue 40 f
## 10 Dormé 165 NA brown light brown NA f
## 11 Zam Wes~ 168 55 blonde fair, gre~ yellow NA f
## 12 Taun We 213 NA none grey black NA f
## 13 Jocasta~ 167 NA white fair blue NA f
## 14 Shaak Ti 178 57 none red, blue~ black NA f
## 15 Rey NA NA brown light hazel NA f
## 16 Captain~ NA NA none none unknown NA f
## # i 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
12 / 72
The filter() function

filter(starwars, sex == "female")

filter() allows us to keep rows based on a specified condition.


▶ The first argument is the name of the data frame.
▶ The second and subsequent arguments are conditions that must
be true to keep the row(s).

When we run filter(), the function executes the filtering operation


and print the result.
▶ It does not modify the original data frame. We need to assign the
output to a new object in order to save it.

13 / 72
Logical operators

▶ We can combine conditions with & or , to indicate and (check for


both conditions).
▶ . . . with | to indicate or (check for either condition).
▶ To filter all female or male character whose skin color is blue.

filter(starwars,
(sex == "female" | sex == "male"), skin_color == "blue")

## # A tibble: 2 x 14
## name height mass hair_color skin_color eye_color birth_year s
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <
## 1 Ayla Sec~ 178 55 none blue hazel 48 f
## 2 Mas Amed~ 196 NA none blue blue NA m
## # i 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>

14 / 72
Logical operators

The %in% operator matches conditions provided in a vector


constructed with c().
▶ A useful shortcut when we need to check whether certain values
are present in a column.

filter(starwars,
sex %in% c("female", "male"), skin_color == "blue")

▶ The code reads:


Give me the characters in starwars with blue-color skin,
whose sex is either “female” or “male”.

15 / 72
Compared to base R functions
▶ Base R uses the bracket method to select rows that satisfy
certain conditions.

# Base R
starwars[((starwars$sex == "female" | starwars$sex == "male") &
starwars$skin_color == "blue"), ]

▶ You can see the advantage of the dplyr syntax:

# dplyr method 1
filter(starwars,
(sex == "female" | sex == "male"), skin_color == "blue")

# dplyr method 2
filter(starwars,
sex %in% c("female", "male"), skin_color == "blue")

16 / 72
The select() function

The select() function returns a subset of columns.


The output will be a data set with fewer columns than the original
data.

17 / 72
The select() function

It is not uncommon to get data sets with hundreds (or even


thousands) or variables.
▶ When we want to zoom in on a particular set of variables, we can
use select().
▶ To select columns by name:

select(starwars, hair_color, birth_year)

▶ Compared to the base R bracket method

starwars[ , c("hair_color", "birth_year")]

18 / 72
The select() function

Compared to the base R method, the dplyr verbs are much more
flexible.
▶ To select columns located between hair_color and eye_color
(inclusive).

select(starwars, hair_color:eye_color)

▶ To select columns except those located between hair_color and


eye_color (inclusive).

select(starwars, -(hair_color:eye_color))

19 / 72
Helper functions

There are a number of helper functions you can use within select():
▶ starts_with("abc") matches column names that begin with
“abc”.
▶ ends_with("xyz") matches column names that end with “xyz”.
▶ contains("ijk") matches column names that contain “ijk”.
▶ num_range("x", 1:3) matches columns x1, x2, and x3.

For example, to select all columns that end with color.

select(starwars, ends_with("color"))

20 / 72
The mutate() function

The mutate() function adds new columns of data, thus “mutating”


the dimensions of the original data set.
The output will be a data frame with more columns than the original
data.

▶ By default, mutate() adds the new columns to the right hand


side of the data frame.

21 / 72
The mutate() function

Let us first create a new data frame, df1, with fewer columns so we
can see the manipulation results more easily.

df1 <- select(starwars, name, height, mass, species)


head(df1)

## # A tibble: 6 x 4
## name height mass species
## <chr> <int> <dbl> <chr>
## 1 Luke Skywalker 172 77 Human
## 2 C-3PO 167 75 Droid
## 3 R2-D2 96 32 Droid
## 4 Darth Vader 202 136 Human
## 5 Leia Organa 150 49 Human
## 6 Owen Lars 178 120 Human

22 / 72
The mutate() function

The following code creates a new column to the right of the original
data frame.

df2 <- mutate(df1, height_m = height/100)


head(df2, 3)

## # A tibble: 3 x 5
## name height mass species height_m
## <chr> <int> <dbl> <chr> <dbl>
## 1 Luke Skywalker 172 77 Human 1.72
## 2 C-3PO 167 75 Droid 1.67
## 3 R2-D2 96 32 Droid 0.96

▶ What is the corresponding command in base R?

23 / 72
The mutate() function
By default, mutate() adds new columns to the right of the data
frame.
▶ We can use the .before argument to add the variable before an
existing variable.

df2 <- mutate(df1, height_m = height/100, .before = name)


head(df2, 3)

## # A tibble: 3 x 5
## height_m name height mass species
## <dbl> <chr> <int> <dbl> <chr>
## 1 1.72 Luke Skywalker 172 77 Human
## 2 1.67 C-3PO 167 75 Droid
## 3 0.96 R2-D2 96 32 Droid

▶ We can also add the new variable after name by .after = name.

24 / 72
The arrange() function

The arrange() function changes the order of observations in a


data frame.

▶ It takes a data frame and a set of column names to order by.


▶ If you provide more than one column name, each additional
column will be used to break ties in the values of preceding ones.

25 / 72
The arrange() function
▶ By default, the function arranges observations in ascending order
of the provided variable.

arrange(df1, mass)

## # A tibble: 87 x 4
## name height mass species
## <chr> <int> <dbl> <chr>
## 1 Ratts Tyerel 79 15 Aleena
## 2 Yoda 66 17 Yoda’s species
## 3 Wicket Systri Warrick 88 20 Ewok
## 4 R2-D2 96 32 Droid
## 5 R5-D4 97 32 Droid
## 6 Sebulba 112 40 Dug
## 7 Padmé Amidala 185 45 Human
## 8 Dud Bolt 94 45 Vulptereen
## 9 Wat Tambor 193 48 Skakoan
## 10 Sly Moore 178 48 <NA>
## # i 77 more rows
26 / 72
The arrange() function
▶ To arrange a column in descending order, use the desc()
operator inside of arrange().

arrange(df1, desc(mass))

## # A tibble: 87 x 4
## name height mass species
## <chr> <int> <dbl> <chr>
## 1 Jabba Desilijic Tiure 175 1358 Hutt
## 2 Grievous 216 159 Kaleesh
## 3 IG-88 200 140 Droid
## 4 Darth Vader 202 136 Human
## 5 Tarfful 234 136 Wookiee
## 6 Owen Lars 178 120 Human
## 7 Bossk 190 113 Trandoshan
## 8 Chewbacca 228 112 Wookiee
## 9 Jek Tono Porkins 180 110 <NA>
## 10 Dexter Jettster 198 102 Besalisk
## # i 77 more rows
27 / 72
Compared to base R functions

▶ To arrange the data set in ascending order of mass,

df1[order(df1$mass), ]

▶ To do so in descending order of mass,

df1[order(-df1$mass), ]

28 / 72
arrange() multiple columns
▶ To arrange the data first by mass (in ascending order), then by
height (in descending order):

arrange(df1, mass, desc(height))

## # A tibble: 87 x 4
## name height mass species
## <chr> <int> <dbl> <chr>
## 1 Ratts Tyerel 79 15 Aleena
## 2 Yoda 66 17 Yoda’s species
## 3 Wicket Systri Warrick 88 20 Ewok
## 4 R5-D4 97 32 Droid
## 5 R2-D2 96 32 Droid
## 6 Sebulba 112 40 Dug
## 7 Padmé Amidala 185 45 Human
## 8 Dud Bolt 94 45 Vulptereen
## 9 Wat Tambor 193 48 Skakoan
## 10 Sly Moore 178 48 <NA>
## # i 77 more rows
29 / 72
The summarize() function

The summarize(), or summarise(), function creates individual


summary statistics from large data sets.

30 / 72
The summarize() function

To compute the average height for Star Wars characters:

summarize(starwars, height = mean(height, na.rm = TRUE))

## # A tibble: 1 x 1
## height
## <dbl>
## 1 175.

▶ na.rm = TRUE removes NA values from the calculation.


▶ The output data set collapses to a 1 × 1 tibble, containing the
mean heights of all Star Wars characters.

31 / 72
The summarize() function

summarize() returns a single row summarizing all observations in the


input.
▶ Not particularly useful on its own.
▶ More often, we are interested in group summaries:
▶ Academic performance for students by school and major.
▶ Gross monthly income for fresh grads by university and year.

When paired with group_by(), we can change the unit of analysis


from the entire data set to individual groups.

32 / 72
The group_by() function

The group_by() operator changes the unit of analysis from the entire
data set to individual groups.
▶ It has no effect on the select() function.
▶ The filter() and mutate() functions work within the group.
▶ The arrange() function ignores groupings by default. We can
turn it on by .by_group = TRUE.
▶ When paired with summarize(), we can compute group
summary statistics.

33 / 72
Example
Let us first use a simple data frame to understand the concepts.

df3 <- tibble(name = c("Alex", "Jay", "Cam", "Lily", "Haley", "Joe"),


status = c("full time", "part time", "part time",
"part time", "full time", "unknown"),
age = c(19, 49, 34, NA, NA, 10),
phones = c(1, 1, 1, 0, 1, 0))
df3

## # A tibble: 6 x 4
## name status age phones
## <chr> <chr> <dbl> <dbl>
## 1 Alex full time 19 1
## 2 Jay part time 49 1
## 3 Cam part time 34 1
## 4 Lily part time NA 0
## 5 Haley full time NA 1
## 6 Joe unknown 10 0

34 / 72
Create a group

▶ Create a group using the character values in status.


▶ Notice the second line in the code output: Groups: status [3].

df4 <- group_by(df3, status)


df4

## # A tibble: 6 x 4
## # Groups: status [3]
## name status age phones
## <chr> <chr> <dbl> <dbl>
## 1 Alex full time 19 1
## 2 Jay part time 49 1
## 3 Cam part time 34 1
## 4 Lily part time NA 0
## 5 Haley full time NA 1
## 6 Joe unknown 10 0

35 / 72
summarize() by group

▶ Compute group mean and name the new variable as mean_age.

summarize(df4, mean_age = mean(age, na.rm = TRUE))

## # A tibble: 3 x 2
## status mean_age
## <chr> <dbl>
## 1 full time 19
## 2 part time 41.5
## 3 unknown 10

▶ What happens if we remove na.rm = TRUE?

36 / 72
filter() by group

▶ Within each group, keep rows with value larger than or equal to
the group mean.

filter(df4, age >= mean(age))

## # A tibble: 1 x 4
## # Groups: status [1]
## name status age phones
## <chr> <chr> <dbl> <dbl>
## 1 Joe unknown 10 0

▶ Why are the full time and part time groups excluded from
the result?

37 / 72
mutate() by group

▶ Add a column that calculates the cumulative sum within each


group.

# Replace NAs with 0


mutate(df4, sum_phones = cumsum(phones))

## # A tibble: 6 x 5
## # Groups: status [3]
## name status age phones sum_phones
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Alex full time 19 1 1
## 2 Jay part time 49 1 1
## 3 Cam part time 34 1 2
## 4 Lily part time NA 0 2
## 5 Haley full time NA 1 2
## 6 Joe unknown 10 0 0

38 / 72
arrange() by group
arrange(df4, desc(age), .by_group = TRUE)

## # A tibble: 6 x 4
## # Groups: status [3]
## name status age phones
## <chr> <chr> <dbl> <dbl>
## 1 Alex full time 19 1
## 2 Haley full time NA 1
## 3 Jay part time 49 1
## 4 Cam part time 34 1
## 5 Lily part time NA 0
## 6 Joe unknown 10 0

▶ By default, arrange() ignores grouping.


▶ We specify .by_group = TRUE to sort the data within each
pre-defined group.
▶ Also, notice that NAs are sorted to the end of each group.

39 / 72
ungroup() after each group_by()
It is a good habit to use ungroup() at the end of a series of grouped
operations.
▶ Otherwise, the groupings will be carried in downstream analysis,
which is not always desirable.

df5 <- ungroup(df4)


df5

## # A tibble: 6 x 4
## name status age phones
## <chr> <chr> <dbl> <dbl>
## 1 Alex full time 19 1
## 2 Jay part time 49 1
## 3 Cam part time 34 1
## 4 Lily part time NA 0
## 5 Haley full time NA 1
## 6 Joe unknown 10 0

40 / 72
▶ Notice what we do in the following code.

df3 %>% group_by(status) %>%


arrange(desc(age), .by_group = TRUE) %>%
ungroup()

## # A tibble: 6 x 4
## name status age phones
## <chr> <chr> <dbl> <dbl>
## 1 Alex full time 19 1
## 2 Haley full time NA 1
## 3 Jay part time 49 1
## 4 Cam part time 34 1
## 5 Lily part time NA 0
## 6 Joe unknown 10 0

▶ %>% is called the forward pipe operator.


▶ Essentially, we “pipe” df3 forward into group_by(), and then
the subsequent output from group_by() into arrange().
▶ At the end, we ungroup().
41 / 72
The pipe operator

%>% simplifies our code and make it cleaner and readable.


▶ Introduced by the magrittr package in 2014, which is made
available under tidyverse.
▶ In 2021, base R (4.1.0) introduced its own pipe: |>

In our lectures, we will mainly use %>%, but it’s still good to know the
base-R pipe because you are likely to encounter it in wild-caught code.

42 / 72
Back to the starwars data
starwars_by_sex <- group_by(starwars, sex)
summarize(starwars_by_sex, mean_mass = mean(mass, na.rm = TRUE))

## # A tibble: 5 x 2
## sex mean_mass
## <chr> <dbl>
## 1 female 54.7
## 2 hermaphroditic 1358
## 3 male 80.2
## 4 none 69.8
## 5 <NA> 81

▶ The following is an equivalent approach using %>%

starwars_by_sex <- starwars %>%


group_by(sex) %>%
summarize(mean_mass = mean(mass, na.rm = TRUE))

43 / 72
The pipe operator %>%

▶ To remove the missing values (NA) in sex before summarizing:

starwars %>%
filter(!is.na(sex)) %>%
group_by(sex) %>%
summarize(mean_mass = mean(mass))

## # A tibble: 4 x 2
## sex mean_mass
## <chr> <dbl>
## 1 female NA
## 2 hermaphroditic 1358
## 3 male NA
## 4 none NA

44 / 72
Revisit the monthly rainfall data

▶ Recall that last week, we used dplyr verbs to prepare the rainfall
data.
▶ Now you should be able to understand the syntax.

rainfall <- read.csv("../data/rainfall.csv",


header = TRUE, stringsAsFactors = TRUE)
df <- rainfall %>%
mutate(month = lubridate::ym(month),
year = year(month),
total_rainfall = as.numeric(total_rainfall)) %>%
filter(year >= 2020)

45 / 72
Useful summary() functions

Here are some useful summary functions that come with dplyr:
▶ Measures of center: mean(), median()
▶ Measures of spread: sd(), var(), IQR()
▶ Measures of range: min(), quantile(), max()
▶ Measures of positions: first(x), nth(x, 2), last(x)
▶ Measures of count: n(), n_distinct().

46 / 72
# Find the shortest character
starwars %>%
summarize(shortest = first(name, order_by = height))

## # A tibble: 1 x 1
## shortest
## <chr>
## 1 Yoda

# Count the number of characters by gender


starwars %>%
group_by(gender) %>%
summarize(n = n())

## # A tibble: 3 x 2
## gender n
## <chr> <int>
## 1 feminine 17
## 2 masculine 66
## 3 <NA> 4

47 / 72
When we need to perform the same function(s) across a set of
columns, we can use the across() method.
▶ . . . applies the same transformation to multiple columns.
▶ Based on the column names or a certain condition.

Artwork by Allison Horst.

48 / 72
The across() method

starwars %>% summarize(across(height:mass, min, na.rm = TRUE))

## # A tibble: 1 x 2
## height mass
## <int> <dbl>
## 1 66 15

starwars %>% summarize(across(where(is.numeric), mean, na.rm = TRUE))

## # A tibble: 1 x 3
## height mass birth_year
## <dbl> <dbl> <dbl>
## 1 175. 97.3 87.6

49 / 72
The across() method
▶ across() also works on mutate().
▶ You can type vignette("colwise") in your Console for more
details.

starwars %>% mutate(across(where(is.character), tolower)) %>% head()

## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year s
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <
## 1 luke sky~ 172 77 blond fair blue 19 m
## 2 c-3po 167 75 <NA> gold yellow 112 n
## 3 r2-d2 96 32 <NA> white, bl~ red 33 n
## 4 darth va~ 202 136 none white yellow 41.9 m
## 5 leia org~ 150 49 brown light brown 19 f
## 6 owen lars 178 120 brown, gr~ light blue 52 m
## # i 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>

50 / 72
Dealing with missing value

▶ There are missing values in the starwars tibble.

# Remove all missing values


starwars %>% na.omit()

# Remove missing values in a specified column


starwars %>% filter(!is.na(gender))

51 / 72
Dealing with missing value

▶ Sometimes missing value is not coded as NA.


▶ Missing values can be represented by a value (e.g., -1, 999), a
string (e.g., none, unknown), or just an empty cell.
▶ The following code reads none into NA across all columns:

starwars %>% mutate(across(where(is.character), na_if, "none"))

▶ After converting the value to NA, we can remove missing values


with na.omit().

starwars %>%
mutate(across(where(is.character), na_if, "none")) %>%
na.omit()

52 / 72
Other useful functions

▶ distinct() finds all unique rows in a data set.

# Remove duplicated rows, if any


starwars %>% distinct()

# Find all unique categories of sex


starwars %>% distinct(sex)

## # A tibble: 5 x 1
## sex
## <chr>
## 1 male
## 2 none
## 3 female
## 4 hermaphroditic
## 5 <NA>

53 / 72
▶ The count() function counts the number of occurrences.

# Count occurrences of unique sex


starwars %>% count(sex)

## # A tibble: 5 x 2
## sex n
## <chr> <int>
## 1 female 16
## 2 hermaphroditic 1
## 3 male 60
## 4 none 6
## 5 <NA> 4

54 / 72
# Count occurrences of unique combinations of sex and species
starwars %>% count(sex, species)

## # A tibble: 41 x 3
## sex species n
## <chr> <chr> <int>
## 1 female Clawdite 1
## 2 female Human 9
## 3 female Kaminoan 1
## 4 female Mirialan 2
## 5 female Tholothian 1
## 6 female Togruta 1
## 7 female Twi’lek 1
## 8 hermaphroditic Hutt 1
## 9 male Aleena 1
## 10 male Besalisk 1
## # i 31 more rows

55 / 72
# ... and sort in descending order of occurrences
starwars %>% count(sex, species, sort = TRUE)

## # A tibble: 41 x 3
## sex species n
## <chr> <chr> <int>
## 1 male Human 26
## 2 female Human 9
## 3 none Droid 6
## 4 <NA> <NA> 4
## 5 male Gungan 3
## 6 female Mirialan 2
## 7 male Wookiee 2
## 8 male Zabrak 2
## 9 female Clawdite 1
## 10 female Kaminoan 1
## # i 31 more rows

56 / 72
▶ The rename() function helps s to rename variables.
▶ The syntax is df %>% rename(new_name = old_name).

# Rename "name" as "character_name"


starwars %>% rename(character_name = name) %>% head(3)

## # A tibble: 3 x 14
## character_name height mass hair_color skin_color eye_color birth_
## <chr> <int> <dbl> <chr> <chr> <chr> <
## 1 Luke Skywalker 172 77 blond fair blue
## 2 C-3PO 167 75 <NA> gold yellow
## 3 R2-D2 96 32 <NA> white, blue red
## # i 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>

57 / 72
▶ slice_head() and slice_tail() select the first or last rows.
▶ slice_min() and slice_max() select rows with the smallest or
largest values of a variable.

starwars %>%
group_by(gender) %>%
slice_max(mass, n = 1) %>%
relocate(gender, mass) # relocate column positions

## # A tibble: 3 x 14
## # Groups: gender [3]
## gender mass name height hair_color skin_color eye_color birth
## <chr> <dbl> <chr> <int> <chr> <chr> <chr>
## 1 feminine 75 Beru ~ 165 brown light blue
## 2 masculine 1358 Jabba~ 175 <NA> green-tan~ orange
## 3 <NA> 110 Jek T~ 180 brown fair blue
## # i 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>

58 / 72
The starwars tibble contains three columns of lists: films,
vehicles, and starships.
▶ The unnest() function can expand a list column of a tibble into
rows and columns.

starwars_films <- starwars %>%


select(name, films) %>%
unnest(films)
head(starwars_films)

## # A tibble: 6 x 2
## name films
## <chr> <chr>
## 1 Luke Skywalker A New Hope
## 2 Luke Skywalker The Empire Strikes Back
## 3 Luke Skywalker Return of the Jedi
## 4 Luke Skywalker Revenge of the Sith
## 5 Luke Skywalker The Force Awakens
## 6 C-3PO A New Hope

59 / 72
Case study: New York flights data

▶ The nycflights13::flights data contain all 336,776 flights


that departed from New York City in 2013.
▶ You can read its documentation in ?flights.
▶ We will use it to practice dplyr skills.

60 / 72
flights data

# install.packages("nycflights13")
library(nycflights13)
data(flights)
head(flights)

## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## # i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance
## # hour <dbl>, minute <dbl>, time_hour <dttm>

61 / 72
Subset observations by their values

▶ Select all flights on February 14.

flights %>% filter(month ==2, day == 14)

Exercises: Find all flights in 2013 that


▶ were delayed (on arrival or departure) by more than two hours.
▶ departed in summer (July, August, and September).
▶ had a missing dep_time.

62 / 72
Subset columns by their names

What do the following commands do? Try them out.

# 1
flights %>% select(dep_time, dep_delay, arr_time, arr_delay)

# 2
flights %>% select(starts_with("dep_"), starts_with("arr_"))

# 3
flights_JFK <- flights %>%
filter(origin == "JFK") %>%
select(year:day, dest, ends_with("delay"), distance, dep_time)

63 / 72
Extract hours and minutes from departure time

▶ Let’s work with the flights_JFK tibble from #3 on the previous


slide.
▶ Compute hour and minute from dep_time:

flights_JFK <- flights_JFK %>%


mutate(hour = dep_time %/% 100,
minute = dep_time %% 100, .after = day)
flights_JFK %>% head(3)

## # A tibble: 3 x 10
## year month day hour minute dest dep_delay arr_delay distance d
## <int> <int> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 42 MIA 2 33 1089
## 2 2013 1 1 5 44 BQN -1 -18 1576
## 3 2013 1 1 5 57 MCO -3 -8 944

64 / 72
Flight status

▶ Categorize the arrival status into three groups: late, on time,


and cancelled.
▶ Then summarize the occurrences of status.

flights_JFK %>%
mutate(arr_status = ifelse(is.na(arr_delay), "cancelled",
ifelse(arr_delay > 0, "late", "on time")))
count(arr_status)

## # A tibble: 3 x 2
## arr_status n
## <chr> <int>
## 1 cancelled 2200
## 2 late 42885
## 3 on time 66194

65 / 72
Bucket flight status

▶ There is a better option for categorization with more than two


categories. Here is how it works:

flights_JFK %>%
mutate(arr_status =
case_when(is.na(arr_delay) ~ "cancelled",
arr_delay <= 0 ~ "on time",
arr_delay > 0 ~ "late")) %>%
count(arr_status)

▶ Much clearer code. Also more robust since it does not depend on
the order we list the conditions.

66 / 72
More on case_when()

▶ case_when() vectorizes multiple ifelse() – more concise code!


▶ It accepts a sequence of two-sided formula: condition ~ result.
▶ TRUE can be specified as a catch-all, which is used when all
previous conditions return FALSE.

Artwork by Allison Horst.

67 / 72
Monthly mean departure delay
flights_JFK %>% group_by(month) %>%
summarize(mean_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
arrange(desc(mean_dep_delay))

## # A tibble: 12 x 2
## month mean_dep_delay
## <int> <dbl>
## 1 7 23.8
## 2 6 20.5
## 3 12 14.8
## 4 8 12.9
## 5 5 12.5
## 6 4 12.2
## 7 2 11.8
## 8 3 10.7
## 9 1 8.62
## 10 9 6.64
## 11 11 4.68
## 12 10 4.59

68 / 72
Destinations by the number of flights

▶ Destination airports with more than 5000 flights originated from


JFK in 2013.

flights %>% filter(origin == "JFK") %>%


count(dest) %>%
filter(n > 5000)

## # A tibble: 4 x 2
## dest n
## <chr> <int>
## 1 BOS 5898
## 2 LAX 11262
## 3 MCO 5464
## 4 SFO 8204

69 / 72
Summary

This week we learn the key dplyr functions that solve the vast
majority of data manipulation challenges:
▶ Subset observations by their values: filter()
▶ Subset variables by their names: select()
▶ Create new variables: mutate()
▶ Reorder the rows: arrange()
▶ Collapse many values down to a single summary: summarize()
or summarise()

These functions can be used in conjunction with group_by(), which


changes the scope of each function from operating on the entire data
set to operating on it within groups.

70 / 72
Summary

71 / 72
More on tidyverse

▶ tidyverse is a suite of R packages designed for data science.


▶ Its core packages include dplyr, readr, stringr, forcats,
purrr, ggplot2.
▶ Learn more about it at https://www.tidyverse.org/packages/.

72 / 72

You might also like