Week5 Slides
Week5 Slides
Yuting Huang
AY24/25
1 / 72
What is data manipulation/ wrangling?
1. Data transformation
▶ filter(), select(), mutate(), arrange(), summarize()
▶ group_by() and %>%
2. Tidy data
▶ gather(), spread(), separate(), unite()
3. Relational data
3 / 72
Pre-requisites
# install.packages("tidyverse")
library(tidyverse)
# Load data set from the package
data(starwars)
glimpse(starwars)
## Rows: 87
## Columns: 14
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader",
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 18
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0,
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "
## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue"
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.
## $ sex <chr> "male", "none", "none", "male", "female", "male",
## $ gender <chr> "masculine", "masculine", "masculine", "masculine"
## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alde
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Huma
## $ films <list> <"A New Hope", "The Empire Strikes Back", "Return
## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>,
## $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Adva
5 / 72
Starwars data set
head(starwars)
## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year s
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <
## 1 Luke Sky~ 172 77 blond fair blue 19 m
## 2 C-3PO 167 75 <NA> gold yellow 112 n
## 3 R2-D2 96 32 <NA> white, bl~ red 33 n
## 4 Darth Va~ 202 136 none white yellow 41.9 m
## 5 Leia Org~ 150 49 brown light brown 19 f
## 6 Owen Lars 178 120 brown, gr~ light blue 52 m
## # i 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
class(starwars)
6 / 72
Tibbles
7 / 72
Key functions in dplyr
The following functions, and the combinations of them, will allow you
to accomplish the vast majority of data cleaning tasks.
▶ filter(): Select observations (rows) by the value in their
columns.
▶ select(): Select variables (columns) by their names.
▶ mutate(): Create new variables.
▶ arrange(): Reorder the rows by ascending or descending order.
▶ summarize() or summarise(): Collapse many values down to a
single value.
8 / 72
Applying dplyr functions
9 / 72
Let’s get started!
10 / 72
The filter() function
11 / 72
filter(starwars, sex == "female")
## # A tibble: 16 x 14
## name height mass hair_color skin_color eye_color birth_year s
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <
## 1 Leia Or~ 150 49 brown light brown 19 f
## 2 Beru Wh~ 165 75 brown light blue 47 f
## 3 Mon Mot~ 150 NA auburn fair blue 48 f
## 4 Padmé A~ 185 45 brown light brown 46 f
## 5 Shmi Sk~ 163 NA black fair brown 72 f
## 6 Ayla Se~ 178 55 none blue hazel 48 f
## 7 Adi Gal~ 184 50 none dark blue NA f
## 8 Luminar~ 170 56.2 black yellow blue 58 f
## 9 Barriss~ 166 50 black yellow blue 40 f
## 10 Dormé 165 NA brown light brown NA f
## 11 Zam Wes~ 168 55 blonde fair, gre~ yellow NA f
## 12 Taun We 213 NA none grey black NA f
## 13 Jocasta~ 167 NA white fair blue NA f
## 14 Shaak Ti 178 57 none red, blue~ black NA f
## 15 Rey NA NA brown light hazel NA f
## 16 Captain~ NA NA none none unknown NA f
## # i 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
12 / 72
The filter() function
13 / 72
Logical operators
filter(starwars,
(sex == "female" | sex == "male"), skin_color == "blue")
## # A tibble: 2 x 14
## name height mass hair_color skin_color eye_color birth_year s
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <
## 1 Ayla Sec~ 178 55 none blue hazel 48 f
## 2 Mas Amed~ 196 NA none blue blue NA m
## # i 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
14 / 72
Logical operators
filter(starwars,
sex %in% c("female", "male"), skin_color == "blue")
15 / 72
Compared to base R functions
▶ Base R uses the bracket method to select rows that satisfy
certain conditions.
# Base R
starwars[((starwars$sex == "female" | starwars$sex == "male") &
starwars$skin_color == "blue"), ]
# dplyr method 1
filter(starwars,
(sex == "female" | sex == "male"), skin_color == "blue")
# dplyr method 2
filter(starwars,
sex %in% c("female", "male"), skin_color == "blue")
16 / 72
The select() function
17 / 72
The select() function
18 / 72
The select() function
Compared to the base R method, the dplyr verbs are much more
flexible.
▶ To select columns located between hair_color and eye_color
(inclusive).
select(starwars, hair_color:eye_color)
select(starwars, -(hair_color:eye_color))
19 / 72
Helper functions
There are a number of helper functions you can use within select():
▶ starts_with("abc") matches column names that begin with
“abc”.
▶ ends_with("xyz") matches column names that end with “xyz”.
▶ contains("ijk") matches column names that contain “ijk”.
▶ num_range("x", 1:3) matches columns x1, x2, and x3.
select(starwars, ends_with("color"))
20 / 72
The mutate() function
21 / 72
The mutate() function
Let us first create a new data frame, df1, with fewer columns so we
can see the manipulation results more easily.
## # A tibble: 6 x 4
## name height mass species
## <chr> <int> <dbl> <chr>
## 1 Luke Skywalker 172 77 Human
## 2 C-3PO 167 75 Droid
## 3 R2-D2 96 32 Droid
## 4 Darth Vader 202 136 Human
## 5 Leia Organa 150 49 Human
## 6 Owen Lars 178 120 Human
22 / 72
The mutate() function
The following code creates a new column to the right of the original
data frame.
## # A tibble: 3 x 5
## name height mass species height_m
## <chr> <int> <dbl> <chr> <dbl>
## 1 Luke Skywalker 172 77 Human 1.72
## 2 C-3PO 167 75 Droid 1.67
## 3 R2-D2 96 32 Droid 0.96
23 / 72
The mutate() function
By default, mutate() adds new columns to the right of the data
frame.
▶ We can use the .before argument to add the variable before an
existing variable.
## # A tibble: 3 x 5
## height_m name height mass species
## <dbl> <chr> <int> <dbl> <chr>
## 1 1.72 Luke Skywalker 172 77 Human
## 2 1.67 C-3PO 167 75 Droid
## 3 0.96 R2-D2 96 32 Droid
▶ We can also add the new variable after name by .after = name.
24 / 72
The arrange() function
25 / 72
The arrange() function
▶ By default, the function arranges observations in ascending order
of the provided variable.
arrange(df1, mass)
## # A tibble: 87 x 4
## name height mass species
## <chr> <int> <dbl> <chr>
## 1 Ratts Tyerel 79 15 Aleena
## 2 Yoda 66 17 Yoda’s species
## 3 Wicket Systri Warrick 88 20 Ewok
## 4 R2-D2 96 32 Droid
## 5 R5-D4 97 32 Droid
## 6 Sebulba 112 40 Dug
## 7 Padmé Amidala 185 45 Human
## 8 Dud Bolt 94 45 Vulptereen
## 9 Wat Tambor 193 48 Skakoan
## 10 Sly Moore 178 48 <NA>
## # i 77 more rows
26 / 72
The arrange() function
▶ To arrange a column in descending order, use the desc()
operator inside of arrange().
arrange(df1, desc(mass))
## # A tibble: 87 x 4
## name height mass species
## <chr> <int> <dbl> <chr>
## 1 Jabba Desilijic Tiure 175 1358 Hutt
## 2 Grievous 216 159 Kaleesh
## 3 IG-88 200 140 Droid
## 4 Darth Vader 202 136 Human
## 5 Tarfful 234 136 Wookiee
## 6 Owen Lars 178 120 Human
## 7 Bossk 190 113 Trandoshan
## 8 Chewbacca 228 112 Wookiee
## 9 Jek Tono Porkins 180 110 <NA>
## 10 Dexter Jettster 198 102 Besalisk
## # i 77 more rows
27 / 72
Compared to base R functions
df1[order(df1$mass), ]
df1[order(-df1$mass), ]
28 / 72
arrange() multiple columns
▶ To arrange the data first by mass (in ascending order), then by
height (in descending order):
## # A tibble: 87 x 4
## name height mass species
## <chr> <int> <dbl> <chr>
## 1 Ratts Tyerel 79 15 Aleena
## 2 Yoda 66 17 Yoda’s species
## 3 Wicket Systri Warrick 88 20 Ewok
## 4 R5-D4 97 32 Droid
## 5 R2-D2 96 32 Droid
## 6 Sebulba 112 40 Dug
## 7 Padmé Amidala 185 45 Human
## 8 Dud Bolt 94 45 Vulptereen
## 9 Wat Tambor 193 48 Skakoan
## 10 Sly Moore 178 48 <NA>
## # i 77 more rows
29 / 72
The summarize() function
30 / 72
The summarize() function
## # A tibble: 1 x 1
## height
## <dbl>
## 1 175.
31 / 72
The summarize() function
32 / 72
The group_by() function
The group_by() operator changes the unit of analysis from the entire
data set to individual groups.
▶ It has no effect on the select() function.
▶ The filter() and mutate() functions work within the group.
▶ The arrange() function ignores groupings by default. We can
turn it on by .by_group = TRUE.
▶ When paired with summarize(), we can compute group
summary statistics.
33 / 72
Example
Let us first use a simple data frame to understand the concepts.
## # A tibble: 6 x 4
## name status age phones
## <chr> <chr> <dbl> <dbl>
## 1 Alex full time 19 1
## 2 Jay part time 49 1
## 3 Cam part time 34 1
## 4 Lily part time NA 0
## 5 Haley full time NA 1
## 6 Joe unknown 10 0
34 / 72
Create a group
## # A tibble: 6 x 4
## # Groups: status [3]
## name status age phones
## <chr> <chr> <dbl> <dbl>
## 1 Alex full time 19 1
## 2 Jay part time 49 1
## 3 Cam part time 34 1
## 4 Lily part time NA 0
## 5 Haley full time NA 1
## 6 Joe unknown 10 0
35 / 72
summarize() by group
## # A tibble: 3 x 2
## status mean_age
## <chr> <dbl>
## 1 full time 19
## 2 part time 41.5
## 3 unknown 10
36 / 72
filter() by group
▶ Within each group, keep rows with value larger than or equal to
the group mean.
## # A tibble: 1 x 4
## # Groups: status [1]
## name status age phones
## <chr> <chr> <dbl> <dbl>
## 1 Joe unknown 10 0
▶ Why are the full time and part time groups excluded from
the result?
37 / 72
mutate() by group
## # A tibble: 6 x 5
## # Groups: status [3]
## name status age phones sum_phones
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Alex full time 19 1 1
## 2 Jay part time 49 1 1
## 3 Cam part time 34 1 2
## 4 Lily part time NA 0 2
## 5 Haley full time NA 1 2
## 6 Joe unknown 10 0 0
38 / 72
arrange() by group
arrange(df4, desc(age), .by_group = TRUE)
## # A tibble: 6 x 4
## # Groups: status [3]
## name status age phones
## <chr> <chr> <dbl> <dbl>
## 1 Alex full time 19 1
## 2 Haley full time NA 1
## 3 Jay part time 49 1
## 4 Cam part time 34 1
## 5 Lily part time NA 0
## 6 Joe unknown 10 0
39 / 72
ungroup() after each group_by()
It is a good habit to use ungroup() at the end of a series of grouped
operations.
▶ Otherwise, the groupings will be carried in downstream analysis,
which is not always desirable.
## # A tibble: 6 x 4
## name status age phones
## <chr> <chr> <dbl> <dbl>
## 1 Alex full time 19 1
## 2 Jay part time 49 1
## 3 Cam part time 34 1
## 4 Lily part time NA 0
## 5 Haley full time NA 1
## 6 Joe unknown 10 0
40 / 72
▶ Notice what we do in the following code.
## # A tibble: 6 x 4
## name status age phones
## <chr> <chr> <dbl> <dbl>
## 1 Alex full time 19 1
## 2 Haley full time NA 1
## 3 Jay part time 49 1
## 4 Cam part time 34 1
## 5 Lily part time NA 0
## 6 Joe unknown 10 0
In our lectures, we will mainly use %>%, but it’s still good to know the
base-R pipe because you are likely to encounter it in wild-caught code.
42 / 72
Back to the starwars data
starwars_by_sex <- group_by(starwars, sex)
summarize(starwars_by_sex, mean_mass = mean(mass, na.rm = TRUE))
## # A tibble: 5 x 2
## sex mean_mass
## <chr> <dbl>
## 1 female 54.7
## 2 hermaphroditic 1358
## 3 male 80.2
## 4 none 69.8
## 5 <NA> 81
43 / 72
The pipe operator %>%
starwars %>%
filter(!is.na(sex)) %>%
group_by(sex) %>%
summarize(mean_mass = mean(mass))
## # A tibble: 4 x 2
## sex mean_mass
## <chr> <dbl>
## 1 female NA
## 2 hermaphroditic 1358
## 3 male NA
## 4 none NA
44 / 72
Revisit the monthly rainfall data
▶ Recall that last week, we used dplyr verbs to prepare the rainfall
data.
▶ Now you should be able to understand the syntax.
45 / 72
Useful summary() functions
Here are some useful summary functions that come with dplyr:
▶ Measures of center: mean(), median()
▶ Measures of spread: sd(), var(), IQR()
▶ Measures of range: min(), quantile(), max()
▶ Measures of positions: first(x), nth(x, 2), last(x)
▶ Measures of count: n(), n_distinct().
46 / 72
# Find the shortest character
starwars %>%
summarize(shortest = first(name, order_by = height))
## # A tibble: 1 x 1
## shortest
## <chr>
## 1 Yoda
## # A tibble: 3 x 2
## gender n
## <chr> <int>
## 1 feminine 17
## 2 masculine 66
## 3 <NA> 4
47 / 72
When we need to perform the same function(s) across a set of
columns, we can use the across() method.
▶ . . . applies the same transformation to multiple columns.
▶ Based on the column names or a certain condition.
48 / 72
The across() method
## # A tibble: 1 x 2
## height mass
## <int> <dbl>
## 1 66 15
## # A tibble: 1 x 3
## height mass birth_year
## <dbl> <dbl> <dbl>
## 1 175. 97.3 87.6
49 / 72
The across() method
▶ across() also works on mutate().
▶ You can type vignette("colwise") in your Console for more
details.
## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year s
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <
## 1 luke sky~ 172 77 blond fair blue 19 m
## 2 c-3po 167 75 <NA> gold yellow 112 n
## 3 r2-d2 96 32 <NA> white, bl~ red 33 n
## 4 darth va~ 202 136 none white yellow 41.9 m
## 5 leia org~ 150 49 brown light brown 19 f
## 6 owen lars 178 120 brown, gr~ light blue 52 m
## # i 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
50 / 72
Dealing with missing value
51 / 72
Dealing with missing value
starwars %>%
mutate(across(where(is.character), na_if, "none")) %>%
na.omit()
52 / 72
Other useful functions
## # A tibble: 5 x 1
## sex
## <chr>
## 1 male
## 2 none
## 3 female
## 4 hermaphroditic
## 5 <NA>
53 / 72
▶ The count() function counts the number of occurrences.
## # A tibble: 5 x 2
## sex n
## <chr> <int>
## 1 female 16
## 2 hermaphroditic 1
## 3 male 60
## 4 none 6
## 5 <NA> 4
54 / 72
# Count occurrences of unique combinations of sex and species
starwars %>% count(sex, species)
## # A tibble: 41 x 3
## sex species n
## <chr> <chr> <int>
## 1 female Clawdite 1
## 2 female Human 9
## 3 female Kaminoan 1
## 4 female Mirialan 2
## 5 female Tholothian 1
## 6 female Togruta 1
## 7 female Twi’lek 1
## 8 hermaphroditic Hutt 1
## 9 male Aleena 1
## 10 male Besalisk 1
## # i 31 more rows
55 / 72
# ... and sort in descending order of occurrences
starwars %>% count(sex, species, sort = TRUE)
## # A tibble: 41 x 3
## sex species n
## <chr> <chr> <int>
## 1 male Human 26
## 2 female Human 9
## 3 none Droid 6
## 4 <NA> <NA> 4
## 5 male Gungan 3
## 6 female Mirialan 2
## 7 male Wookiee 2
## 8 male Zabrak 2
## 9 female Clawdite 1
## 10 female Kaminoan 1
## # i 31 more rows
56 / 72
▶ The rename() function helps s to rename variables.
▶ The syntax is df %>% rename(new_name = old_name).
## # A tibble: 3 x 14
## character_name height mass hair_color skin_color eye_color birth_
## <chr> <int> <dbl> <chr> <chr> <chr> <
## 1 Luke Skywalker 172 77 blond fair blue
## 2 C-3PO 167 75 <NA> gold yellow
## 3 R2-D2 96 32 <NA> white, blue red
## # i 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
57 / 72
▶ slice_head() and slice_tail() select the first or last rows.
▶ slice_min() and slice_max() select rows with the smallest or
largest values of a variable.
starwars %>%
group_by(gender) %>%
slice_max(mass, n = 1) %>%
relocate(gender, mass) # relocate column positions
## # A tibble: 3 x 14
## # Groups: gender [3]
## gender mass name height hair_color skin_color eye_color birth
## <chr> <dbl> <chr> <int> <chr> <chr> <chr>
## 1 feminine 75 Beru ~ 165 brown light blue
## 2 masculine 1358 Jabba~ 175 <NA> green-tan~ orange
## 3 <NA> 110 Jek T~ 180 brown fair blue
## # i 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
58 / 72
The starwars tibble contains three columns of lists: films,
vehicles, and starships.
▶ The unnest() function can expand a list column of a tibble into
rows and columns.
## # A tibble: 6 x 2
## name films
## <chr> <chr>
## 1 Luke Skywalker A New Hope
## 2 Luke Skywalker The Empire Strikes Back
## 3 Luke Skywalker Return of the Jedi
## 4 Luke Skywalker Revenge of the Sith
## 5 Luke Skywalker The Force Awakens
## 6 C-3PO A New Hope
59 / 72
Case study: New York flights data
60 / 72
flights data
# install.packages("nycflights13")
library(nycflights13)
data(flights)
head(flights)
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## # i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance
## # hour <dbl>, minute <dbl>, time_hour <dttm>
61 / 72
Subset observations by their values
62 / 72
Subset columns by their names
# 1
flights %>% select(dep_time, dep_delay, arr_time, arr_delay)
# 2
flights %>% select(starts_with("dep_"), starts_with("arr_"))
# 3
flights_JFK <- flights %>%
filter(origin == "JFK") %>%
select(year:day, dest, ends_with("delay"), distance, dep_time)
63 / 72
Extract hours and minutes from departure time
## # A tibble: 3 x 10
## year month day hour minute dest dep_delay arr_delay distance d
## <int> <int> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 42 MIA 2 33 1089
## 2 2013 1 1 5 44 BQN -1 -18 1576
## 3 2013 1 1 5 57 MCO -3 -8 944
64 / 72
Flight status
flights_JFK %>%
mutate(arr_status = ifelse(is.na(arr_delay), "cancelled",
ifelse(arr_delay > 0, "late", "on time")))
count(arr_status)
## # A tibble: 3 x 2
## arr_status n
## <chr> <int>
## 1 cancelled 2200
## 2 late 42885
## 3 on time 66194
65 / 72
Bucket flight status
flights_JFK %>%
mutate(arr_status =
case_when(is.na(arr_delay) ~ "cancelled",
arr_delay <= 0 ~ "on time",
arr_delay > 0 ~ "late")) %>%
count(arr_status)
▶ Much clearer code. Also more robust since it does not depend on
the order we list the conditions.
66 / 72
More on case_when()
67 / 72
Monthly mean departure delay
flights_JFK %>% group_by(month) %>%
summarize(mean_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
arrange(desc(mean_dep_delay))
## # A tibble: 12 x 2
## month mean_dep_delay
## <int> <dbl>
## 1 7 23.8
## 2 6 20.5
## 3 12 14.8
## 4 8 12.9
## 5 5 12.5
## 6 4 12.2
## 7 2 11.8
## 8 3 10.7
## 9 1 8.62
## 10 9 6.64
## 11 11 4.68
## 12 10 4.59
68 / 72
Destinations by the number of flights
## # A tibble: 4 x 2
## dest n
## <chr> <int>
## 1 BOS 5898
## 2 LAX 11262
## 3 MCO 5464
## 4 SFO 8204
69 / 72
Summary
This week we learn the key dplyr functions that solve the vast
majority of data manipulation challenges:
▶ Subset observations by their values: filter()
▶ Subset variables by their names: select()
▶ Create new variables: mutate()
▶ Reorder the rows: arrange()
▶ Collapse many values down to a single summary: summarize()
or summarise()
70 / 72
Summary
71 / 72
More on tidyverse
72 / 72