0% found this document useful (0 votes)

42 views72 pages

Week5 Slides

The document provides an introduction to data manipulation using the dplyr package in R, which is essential for data wrangling tasks such as filtering, selecting, mutating, and arranging data. It emphasizes the importance of transforming data into a suitable format for analysis and introduces key functions like filter(), select(), and mutate(). The document also illustrates the use of the starwars dataset to demonstrate these functions and their advantages over base R methods.

Uploaded by

Tùng Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views72 pages

Week5 Slides

Uploaded by

Tùng Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

DSA2101

Essential Data Analytics Tools: Data Visualization

Yuting Huang

AY24/25

Weeks 5 Introduction to dplyr

1 / 72
What is data manipulation/ wrangling?

“Data janitor work”

It is extremely rare that the data you obtain will be in precisely the
right format for the analysis that you wish to do.
Very often, we need some or all of the following:
▶ Select only a subset of rows/ columns
▶ Create new variables or summaries
▶ Rename the variables
▶ Re-order the data
▶ ...
Artwork by Allison Horst.
2 / 72
What is data manipulation/data wrangling?

dplyr is a grammar of data manipulation, providing a set of functions

that help us solve the most common data manipulation challenges.

1. Data transformation
▶ filter(), select(), mutate(), arrange(), summarize()
▶ group_by() and %>%
2. Tidy data
▶ gather(), spread(), separate(), unite()
3. Relational data

3 / 72
Pre-requisites

The easiest way to get dplyr is to install the tidyverse package

(https://www.tidyverse.org/packages/).
▶ We will use the starwars data set from the package.

# install.packages("tidyverse")
library(tidyverse)
# Load data set from the package
data(starwars)

Artwork by Allison Horst.

4 / 72
Starwars data set
▶ glimpse() from dplyr allows us to examine the data and see as
much info as possible.

glimpse(starwars)

## Rows: 87
## Columns: 14
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader",
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 18
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0,
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "
## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue"
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.
## $ sex <chr> "male", "none", "none", "male", "female", "male",
## $ gender <chr> "masculine", "masculine", "masculine", "masculine"
## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alde
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Huma
## $ films <list> <"A New Hope", "The Empire Strikes Back", "Return
## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>,
## $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Adva
5 / 72
Starwars data set
head(starwars)

## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year s
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <
## 1 Luke Sky~ 172 77 blond fair blue 19 m
## 2 C-3PO 167 75 <NA> gold yellow 112 n
## 3 R2-D2 96 32 <NA> white, bl~ red 33 n
## 4 Darth Va~ 202 136 none white yellow 41.9 m
## 5 Leia Org~ 150 49 brown light brown 19 f
## 6 Owen Lars 178 120 brown, gr~ light blue 52 m
## # i 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>

class(starwars)

## [1] "tbl_df" "tbl" "data.frame"

6 / 72
Tibbles

The output shows that it is in fact not a data frame, it is a tibble.

▶ Base R functions import data as data frames.
▶ data.frame(), read.csv(), etc.
▶ dplyr imports data as tibbles.
▶ A modern version of data frame, typically useful for large data
sets.
▶ Easy to view the numbers of rows, columns, and variable types.

7 / 72
Key functions in dplyr

The following functions, and the combinations of them, will allow you
to accomplish the vast majority of data cleaning tasks.
▶ filter(): Select observations (rows) by the value in their
columns.
▶ select(): Select variables (columns) by their names.
▶ mutate(): Create new variables.
▶ arrange(): Reorder the rows by ascending or descending order.
▶ summarize() or summarise(): Collapse many values down to a
single value.

In conjunction with group_by(), which splits a data set by values in

a variable, these functions help us deal with common data
manipulation challenges.

8 / 72
Applying dplyr functions

Each of these functions is called in an identical manner.

▶ The first argument is the data frame (tibble).
▶ The subsequent arguments describe what to do with the data
frame, using variable names without quotes.
▶ The output is a new data frame.
▶ The original data frame is not modified.

These operations can be chained using the pipe operator %>%.

9 / 72
Let’s get started!

Artwork by Allison Horst.

10 / 72
The filter() function

The filter() function subsets the rows in a data frame by testing

against a conditional statement.
The output will be a data set with fewer rows than the original data.

11 / 72
filter(starwars, sex == "female")

## # A tibble: 16 x 14
## name height mass hair_color skin_color eye_color birth_year s
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <
## 1 Leia Or~ 150 49 brown light brown 19 f
## 2 Beru Wh~ 165 75 brown light blue 47 f
## 3 Mon Mot~ 150 NA auburn fair blue 48 f
## 4 Padmé A~ 185 45 brown light brown 46 f
## 5 Shmi Sk~ 163 NA black fair brown 72 f
## 6 Ayla Se~ 178 55 none blue hazel 48 f
## 7 Adi Gal~ 184 50 none dark blue NA f
## 8 Luminar~ 170 56.2 black yellow blue 58 f
## 9 Barriss~ 166 50 black yellow blue 40 f
## 10 Dormé 165 NA brown light brown NA f
## 11 Zam Wes~ 168 55 blonde fair, gre~ yellow NA f
## 12 Taun We 213 NA none grey black NA f
## 13 Jocasta~ 167 NA white fair blue NA f
## 14 Shaak Ti 178 57 none red, blue~ black NA f
## 15 Rey NA NA brown light hazel NA f
## 16 Captain~ NA NA none none unknown NA f
## # i 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
12 / 72
The filter() function

filter(starwars, sex == "female")

filter() allows us to keep rows based on a specified condition.

▶ The first argument is the name of the data frame.
▶ The second and subsequent arguments are conditions that must
be true to keep the row(s).

When we run filter(), the function executes the filtering operation

and print the result.
▶ It does not modify the original data frame. We need to assign the
output to a new object in order to save it.

13 / 72
Logical operators

▶ We can combine conditions with & or , to indicate and (check for

both conditions).
▶ . . . with | to indicate or (check for either condition).
▶ To filter all female or male character whose skin color is blue.

filter(starwars,
(sex == "female" | sex == "male"), skin_color == "blue")

## # A tibble: 2 x 14
## name height mass hair_color skin_color eye_color birth_year s
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <
## 1 Ayla Sec~ 178 55 none blue hazel 48 f
## 2 Mas Amed~ 196 NA none blue blue NA m
## # i 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>

14 / 72
Logical operators

The %in% operator matches conditions provided in a vector

constructed with c().
▶ A useful shortcut when we need to check whether certain values
are present in a column.

filter(starwars,
sex %in% c("female", "male"), skin_color == "blue")

▶ The code reads:

Give me the characters in starwars with blue-color skin,
whose sex is either “female” or “male”.

15 / 72
Compared to base R functions
▶ Base R uses the bracket method to select rows that satisfy
certain conditions.

# Base R
starwars[((starwars$sex == "female" | starwars$sex == "male") &
starwars$skin_color == "blue"), ]

▶ You can see the advantage of the dplyr syntax:

# dplyr method 1
filter(starwars,
(sex == "female" | sex == "male"), skin_color == "blue")

# dplyr method 2
filter(starwars,
sex %in% c("female", "male"), skin_color == "blue")

16 / 72
The select() function

The select() function returns a subset of columns.

The output will be a data set with fewer columns than the original
data.

17 / 72
The select() function

It is not uncommon to get data sets with hundreds (or even

thousands) or variables.
▶ When we want to zoom in on a particular set of variables, we can
use select().
▶ To select columns by name:

select(starwars, hair_color, birth_year)

▶ Compared to the base R bracket method

starwars[ , c("hair_color", "birth_year")]

18 / 72
The select() function

Compared to the base R method, the dplyr verbs are much more
flexible.
▶ To select columns located between hair_color and eye_color
(inclusive).

select(starwars, hair_color:eye_color)

▶ To select columns except those located between hair_color and

eye_color (inclusive).

select(starwars, -(hair_color:eye_color))

19 / 72
Helper functions

There are a number of helper functions you can use within select():
▶ starts_with("abc") matches column names that begin with
“abc”.
▶ ends_with("xyz") matches column names that end with “xyz”.
▶ contains("ijk") matches column names that contain “ijk”.
▶ num_range("x", 1:3) matches columns x1, x2, and x3.

For example, to select all columns that end with color.

select(starwars, ends_with("color"))

20 / 72
The mutate() function

The mutate() function adds new columns of data, thus “mutating”

the dimensions of the original data set.
The output will be a data frame with more columns than the original
data.

▶ By default, mutate() adds the new columns to the right hand

side of the data frame.

21 / 72
The mutate() function

Let us first create a new data frame, df1, with fewer columns so we
can see the manipulation results more easily.

df1 <- select(starwars, name, height, mass, species)

head(df1)

## # A tibble: 6 x 4
## name height mass species
## <chr> <int> <dbl> <chr>
## 1 Luke Skywalker 172 77 Human
## 2 C-3PO 167 75 Droid
## 3 R2-D2 96 32 Droid
## 4 Darth Vader 202 136 Human
## 5 Leia Organa 150 49 Human
## 6 Owen Lars 178 120 Human

22 / 72
The mutate() function

The following code creates a new column to the right of the original
data frame.

df2 <- mutate(df1, height_m = height/100)

head(df2, 3)

## # A tibble: 3 x 5
## name height mass species height_m
## <chr> <int> <dbl> <chr> <dbl>
## 1 Luke Skywalker 172 77 Human 1.72
## 2 C-3PO 167 75 Droid 1.67
## 3 R2-D2 96 32 Droid 0.96

▶ What is the corresponding command in base R?

23 / 72
The mutate() function
By default, mutate() adds new columns to the right of the data
frame.
▶ We can use the .before argument to add the variable before an
existing variable.

df2 <- mutate(df1, height_m = height/100, .before = name)

head(df2, 3)

## # A tibble: 3 x 5
## height_m name height mass species
## <dbl> <chr> <int> <dbl> <chr>
## 1 1.72 Luke Skywalker 172 77 Human
## 2 1.67 C-3PO 167 75 Droid
## 3 0.96 R2-D2 96 32 Droid

▶ We can also add the new variable after name by .after = name.

24 / 72
The arrange() function

The arrange() function changes the order of observations in a

data frame.

▶ It takes a data frame and a set of column names to order by.

▶ If you provide more than one column name, each additional
column will be used to break ties in the values of preceding ones.

25 / 72
The arrange() function
▶ By default, the function arranges observations in ascending order
of the provided variable.

arrange(df1, mass)

## # A tibble: 87 x 4
## name height mass species
## <chr> <int> <dbl> <chr>
## 1 Ratts Tyerel 79 15 Aleena
## 2 Yoda 66 17 Yoda’s species
## 3 Wicket Systri Warrick 88 20 Ewok
## 4 R2-D2 96 32 Droid
## 5 R5-D4 97 32 Droid
## 6 Sebulba 112 40 Dug
## 7 Padmé Amidala 185 45 Human
## 8 Dud Bolt 94 45 Vulptereen
## 9 Wat Tambor 193 48 Skakoan
## 10 Sly Moore 178 48 <NA>
## # i 77 more rows
26 / 72
The arrange() function
▶ To arrange a column in descending order, use the desc()
operator inside of arrange().

arrange(df1, desc(mass))

## # A tibble: 87 x 4
## name height mass species
## <chr> <int> <dbl> <chr>
## 1 Jabba Desilijic Tiure 175 1358 Hutt
## 2 Grievous 216 159 Kaleesh
## 3 IG-88 200 140 Droid
## 4 Darth Vader 202 136 Human
## 5 Tarfful 234 136 Wookiee
## 6 Owen Lars 178 120 Human
## 7 Bossk 190 113 Trandoshan
## 8 Chewbacca 228 112 Wookiee
## 9 Jek Tono Porkins 180 110 <NA>
## 10 Dexter Jettster 198 102 Besalisk
## # i 77 more rows
27 / 72
Compared to base R functions

▶ To arrange the data set in ascending order of mass,

df1[order(df1$mass), ]

▶ To do so in descending order of mass,

df1[order(-df1$mass), ]

28 / 72
arrange() multiple columns
▶ To arrange the data first by mass (in ascending order), then by
height (in descending order):

arrange(df1, mass, desc(height))

## # A tibble: 87 x 4
## name height mass species
## <chr> <int> <dbl> <chr>
## 1 Ratts Tyerel 79 15 Aleena
## 2 Yoda 66 17 Yoda’s species
## 3 Wicket Systri Warrick 88 20 Ewok
## 4 R5-D4 97 32 Droid
## 5 R2-D2 96 32 Droid
## 6 Sebulba 112 40 Dug
## 7 Padmé Amidala 185 45 Human
## 8 Dud Bolt 94 45 Vulptereen
## 9 Wat Tambor 193 48 Skakoan
## 10 Sly Moore 178 48 <NA>
## # i 77 more rows
29 / 72
The summarize() function

The summarize(), or summarise(), function creates individual

summary statistics from large data sets.

30 / 72
The summarize() function

To compute the average height for Star Wars characters:

summarize(starwars, height = mean(height, na.rm = TRUE))

## # A tibble: 1 x 1
## height
## <dbl>
## 1 175.

▶ na.rm = TRUE removes NA values from the calculation.

▶ The output data set collapses to a 1 × 1 tibble, containing the
mean heights of all Star Wars characters.

31 / 72
The summarize() function

summarize() returns a single row summarizing all observations in the

input.
▶ Not particularly useful on its own.
▶ More often, we are interested in group summaries:
▶ Academic performance for students by school and major.
▶ Gross monthly income for fresh grads by university and year.

When paired with group_by(), we can change the unit of analysis

from the entire data set to individual groups.

32 / 72
The group_by() function

The group_by() operator changes the unit of analysis from the entire
data set to individual groups.
▶ It has no effect on the select() function.
▶ The filter() and mutate() functions work within the group.
▶ The arrange() function ignores groupings by default. We can
turn it on by .by_group = TRUE.
▶ When paired with summarize(), we can compute group
summary statistics.

33 / 72
Example
Let us first use a simple data frame to understand the concepts.

df3 <- tibble(name = c("Alex", "Jay", "Cam", "Lily", "Haley", "Joe"),

status = c("full time", "part time", "part time",
"part time", "full time", "unknown"),
age = c(19, 49, 34, NA, NA, 10),
phones = c(1, 1, 1, 0, 1, 0))
df3

## # A tibble: 6 x 4
## name status age phones
## <chr> <chr> <dbl> <dbl>
## 1 Alex full time 19 1
## 2 Jay part time 49 1
## 3 Cam part time 34 1
## 4 Lily part time NA 0
## 5 Haley full time NA 1
## 6 Joe unknown 10 0

34 / 72
Create a group

▶ Create a group using the character values in status.

▶ Notice the second line in the code output: Groups: status [3].

df4 <- group_by(df3, status)

df4

## # A tibble: 6 x 4
## # Groups: status [3]
## name status age phones
## <chr> <chr> <dbl> <dbl>
## 1 Alex full time 19 1
## 2 Jay part time 49 1
## 3 Cam part time 34 1
## 4 Lily part time NA 0
## 5 Haley full time NA 1
## 6 Joe unknown 10 0

35 / 72
summarize() by group

▶ Compute group mean and name the new variable as mean_age.

summarize(df4, mean_age = mean(age, na.rm = TRUE))

## # A tibble: 3 x 2
## status mean_age
## <chr> <dbl>
## 1 full time 19
## 2 part time 41.5
## 3 unknown 10

▶ What happens if we remove na.rm = TRUE?

36 / 72
filter() by group

▶ Within each group, keep rows with value larger than or equal to
the group mean.

filter(df4, age >= mean(age))

## # A tibble: 1 x 4
## # Groups: status [1]
## name status age phones
## <chr> <chr> <dbl> <dbl>
## 1 Joe unknown 10 0

▶ Why are the full time and part time groups excluded from
the result?

37 / 72
mutate() by group

▶ Add a column that calculates the cumulative sum within each

group.

# Replace NAs with 0

mutate(df4, sum_phones = cumsum(phones))

## # A tibble: 6 x 5
## # Groups: status [3]
## name status age phones sum_phones
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Alex full time 19 1 1
## 2 Jay part time 49 1 1
## 3 Cam part time 34 1 2
## 4 Lily part time NA 0 2
## 5 Haley full time NA 1 2
## 6 Joe unknown 10 0 0

38 / 72
arrange() by group
arrange(df4, desc(age), .by_group = TRUE)

## # A tibble: 6 x 4
## # Groups: status [3]
## name status age phones
## <chr> <chr> <dbl> <dbl>
## 1 Alex full time 19 1
## 2 Haley full time NA 1
## 3 Jay part time 49 1
## 4 Cam part time 34 1
## 5 Lily part time NA 0
## 6 Joe unknown 10 0

▶ By default, arrange() ignores grouping.

▶ We specify .by_group = TRUE to sort the data within each
pre-defined group.
▶ Also, notice that NAs are sorted to the end of each group.

39 / 72
ungroup() after each group_by()
It is a good habit to use ungroup() at the end of a series of grouped
operations.
▶ Otherwise, the groupings will be carried in downstream analysis,
which is not always desirable.

df5 <- ungroup(df4)

df5

40 / 72
▶ Notice what we do in the following code.

df3 %>% group_by(status) %>%

arrange(desc(age), .by_group = TRUE) %>%
ungroup()

## # A tibble: 6 x 4
## name status age phones
## <chr> <chr> <dbl> <dbl>
## 1 Alex full time 19 1
## 2 Haley full time NA 1
## 3 Jay part time 49 1
## 4 Cam part time 34 1
## 5 Lily part time NA 0
## 6 Joe unknown 10 0

▶ %>% is called the forward pipe operator.

▶ Essentially, we “pipe” df3 forward into group_by(), and then
the subsequent output from group_by() into arrange().
▶ At the end, we ungroup().
41 / 72
The pipe operator

%>% simplifies our code and make it cleaner and readable.

▶ Introduced by the magrittr package in 2014, which is made
available under tidyverse.
▶ In 2021, base R (4.1.0) introduced its own pipe: |>

In our lectures, we will mainly use %>%, but it’s still good to know the
base-R pipe because you are likely to encounter it in wild-caught code.

42 / 72
Back to the starwars data
starwars_by_sex <- group_by(starwars, sex)
summarize(starwars_by_sex, mean_mass = mean(mass, na.rm = TRUE))

## # A tibble: 5 x 2
## sex mean_mass
## <chr> <dbl>
## 1 female 54.7
## 2 hermaphroditic 1358
## 3 male 80.2
## 4 none 69.8
## 5 <NA> 81

▶ The following is an equivalent approach using %>%

starwars_by_sex <- starwars %>%

group_by(sex) %>%
summarize(mean_mass = mean(mass, na.rm = TRUE))

43 / 72
The pipe operator %>%

▶ To remove the missing values (NA) in sex before summarizing:

starwars %>%
filter(!is.na(sex)) %>%
group_by(sex) %>%
summarize(mean_mass = mean(mass))

## # A tibble: 4 x 2
## sex mean_mass
## <chr> <dbl>
## 1 female NA
## 2 hermaphroditic 1358
## 3 male NA
## 4 none NA

44 / 72
Revisit the monthly rainfall data

▶ Recall that last week, we used dplyr verbs to prepare the rainfall
data.
▶ Now you should be able to understand the syntax.

rainfall <- read.csv("../data/rainfall.csv",

header = TRUE, stringsAsFactors = TRUE)
df <- rainfall %>%
mutate(month = lubridate::ym(month),
year = year(month),
total_rainfall = as.numeric(total_rainfall)) %>%
filter(year >= 2020)

45 / 72
Useful summary() functions

Here are some useful summary functions that come with dplyr:
▶ Measures of center: mean(), median()
▶ Measures of spread: sd(), var(), IQR()
▶ Measures of range: min(), quantile(), max()
▶ Measures of positions: first(x), nth(x, 2), last(x)
▶ Measures of count: n(), n_distinct().

46 / 72
# Find the shortest character
starwars %>%
summarize(shortest = first(name, order_by = height))

## # A tibble: 1 x 1
## shortest
## <chr>
## 1 Yoda

# Count the number of characters by gender

starwars %>%
group_by(gender) %>%
summarize(n = n())

## # A tibble: 3 x 2
## gender n
## <chr> <int>
## 1 feminine 17
## 2 masculine 66
## 3 <NA> 4

47 / 72
When we need to perform the same function(s) across a set of
columns, we can use the across() method.
▶ . . . applies the same transformation to multiple columns.
▶ Based on the column names or a certain condition.

Artwork by Allison Horst.

48 / 72
The across() method

starwars %>% summarize(across(height:mass, min, na.rm = TRUE))

## # A tibble: 1 x 2
## height mass
## <int> <dbl>
## 1 66 15

starwars %>% summarize(across(where(is.numeric), mean, na.rm = TRUE))

## # A tibble: 1 x 3
## height mass birth_year
## <dbl> <dbl> <dbl>
## 1 175. 97.3 87.6

49 / 72
The across() method
▶ across() also works on mutate().
▶ You can type vignette("colwise") in your Console for more
details.

starwars %>% mutate(across(where(is.character), tolower)) %>% head()

## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year s
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <
## 1 luke sky~ 172 77 blond fair blue 19 m
## 2 c-3po 167 75 <NA> gold yellow 112 n
## 3 r2-d2 96 32 <NA> white, bl~ red 33 n
## 4 darth va~ 202 136 none white yellow 41.9 m
## 5 leia org~ 150 49 brown light brown 19 f
## 6 owen lars 178 120 brown, gr~ light blue 52 m
## # i 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>

50 / 72
Dealing with missing value

▶ There are missing values in the starwars tibble.

# Remove all missing values

starwars %>% na.omit()

# Remove missing values in a specified column

starwars %>% filter(!is.na(gender))

51 / 72
Dealing with missing value

▶ Sometimes missing value is not coded as NA.

▶ Missing values can be represented by a value (e.g., -1, 999), a
string (e.g., none, unknown), or just an empty cell.
▶ The following code reads none into NA across all columns:

starwars %>% mutate(across(where(is.character), na_if, "none"))

▶ After converting the value to NA, we can remove missing values

with na.omit().

starwars %>%
mutate(across(where(is.character), na_if, "none")) %>%
na.omit()

52 / 72
Other useful functions

▶ distinct() finds all unique rows in a data set.

# Remove duplicated rows, if any

starwars %>% distinct()

# Find all unique categories of sex

starwars %>% distinct(sex)

## # A tibble: 5 x 1
## sex
## <chr>
## 1 male
## 2 none
## 3 female
## 4 hermaphroditic
## 5 <NA>

53 / 72
▶ The count() function counts the number of occurrences.

# Count occurrences of unique sex

starwars %>% count(sex)

## # A tibble: 5 x 2
## sex n
## <chr> <int>
## 1 female 16
## 2 hermaphroditic 1
## 3 male 60
## 4 none 6
## 5 <NA> 4

54 / 72
# Count occurrences of unique combinations of sex and species
starwars %>% count(sex, species)

## # A tibble: 41 x 3
## sex species n
## <chr> <chr> <int>
## 1 female Clawdite 1
## 2 female Human 9
## 3 female Kaminoan 1
## 4 female Mirialan 2
## 5 female Tholothian 1
## 6 female Togruta 1
## 7 female Twi’lek 1
## 8 hermaphroditic Hutt 1
## 9 male Aleena 1
## 10 male Besalisk 1
## # i 31 more rows

55 / 72
# ... and sort in descending order of occurrences
starwars %>% count(sex, species, sort = TRUE)

## # A tibble: 41 x 3
## sex species n
## <chr> <chr> <int>
## 1 male Human 26
## 2 female Human 9
## 3 none Droid 6
## 4 <NA> <NA> 4
## 5 male Gungan 3
## 6 female Mirialan 2
## 7 male Wookiee 2
## 8 male Zabrak 2
## 9 female Clawdite 1
## 10 female Kaminoan 1
## # i 31 more rows

56 / 72
▶ The rename() function helps s to rename variables.
▶ The syntax is df %>% rename(new_name = old_name).

# Rename "name" as "character_name"

starwars %>% rename(character_name = name) %>% head(3)

## # A tibble: 3 x 14
## character_name height mass hair_color skin_color eye_color birth_
## <chr> <int> <dbl> <chr> <chr> <chr> <
## 1 Luke Skywalker 172 77 blond fair blue
## 2 C-3PO 167 75 <NA> gold yellow
## 3 R2-D2 96 32 <NA> white, blue red
## # i 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>

57 / 72
▶ slice_head() and slice_tail() select the first or last rows.
▶ slice_min() and slice_max() select rows with the smallest or
largest values of a variable.

starwars %>%
group_by(gender) %>%
slice_max(mass, n = 1) %>%
relocate(gender, mass) # relocate column positions

## # A tibble: 3 x 14
## # Groups: gender [3]
## gender mass name height hair_color skin_color eye_color birth
## <chr> <dbl> <chr> <int> <chr> <chr> <chr>
## 1 feminine 75 Beru ~ 165 brown light blue
## 2 masculine 1358 Jabba~ 175 <NA> green-tan~ orange
## 3 <NA> 110 Jek T~ 180 brown fair blue
## # i 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>

58 / 72
The starwars tibble contains three columns of lists: films,
vehicles, and starships.
▶ The unnest() function can expand a list column of a tibble into
rows and columns.

starwars_films <- starwars %>%

select(name, films) %>%
unnest(films)
head(starwars_films)

## # A tibble: 6 x 2
## name films
## <chr> <chr>
## 1 Luke Skywalker A New Hope
## 2 Luke Skywalker The Empire Strikes Back
## 3 Luke Skywalker Return of the Jedi
## 4 Luke Skywalker Revenge of the Sith
## 5 Luke Skywalker The Force Awakens
## 6 C-3PO A New Hope

59 / 72
Case study: New York flights data

▶ The nycflights13::flights data contain all 336,776 flights

that departed from New York City in 2013.
▶ You can read its documentation in ?flights.
▶ We will use it to practice dplyr skills.

60 / 72
flights data

# install.packages("nycflights13")
library(nycflights13)
data(flights)
head(flights)

## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## # i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance
## # hour <dbl>, minute <dbl>, time_hour <dttm>

61 / 72
Subset observations by their values

▶ Select all flights on February 14.

flights %>% filter(month ==2, day == 14)

Exercises: Find all flights in 2013 that

▶ were delayed (on arrival or departure) by more than two hours.
▶ departed in summer (July, August, and September).
▶ had a missing dep_time.

62 / 72
Subset columns by their names

What do the following commands do? Try them out.

# 1
flights %>% select(dep_time, dep_delay, arr_time, arr_delay)

# 2
flights %>% select(starts_with("dep_"), starts_with("arr_"))

# 3
flights_JFK <- flights %>%
filter(origin == "JFK") %>%
select(year:day, dest, ends_with("delay"), distance, dep_time)

63 / 72
Extract hours and minutes from departure time

▶ Let’s work with the flights_JFK tibble from #3 on the previous

slide.
▶ Compute hour and minute from dep_time:

flights_JFK <- flights_JFK %>%

mutate(hour = dep_time %/% 100,
minute = dep_time %% 100, .after = day)
flights_JFK %>% head(3)

## # A tibble: 3 x 10
## year month day hour minute dest dep_delay arr_delay distance d
## <int> <int> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 42 MIA 2 33 1089
## 2 2013 1 1 5 44 BQN -1 -18 1576
## 3 2013 1 1 5 57 MCO -3 -8 944

64 / 72
Flight status

▶ Categorize the arrival status into three groups: late, on time,

and cancelled.
▶ Then summarize the occurrences of status.

flights_JFK %>%
mutate(arr_status = ifelse(is.na(arr_delay), "cancelled",
ifelse(arr_delay > 0, "late", "on time")))
count(arr_status)

## # A tibble: 3 x 2
## arr_status n
## <chr> <int>
## 1 cancelled 2200
## 2 late 42885
## 3 on time 66194

65 / 72
Bucket flight status

▶ There is a better option for categorization with more than two

categories. Here is how it works:

flights_JFK %>%
mutate(arr_status =
case_when(is.na(arr_delay) ~ "cancelled",
arr_delay <= 0 ~ "on time",
arr_delay > 0 ~ "late")) %>%
count(arr_status)

▶ Much clearer code. Also more robust since it does not depend on
the order we list the conditions.

66 / 72
More on case_when()

▶ case_when() vectorizes multiple ifelse() – more concise code!

▶ It accepts a sequence of two-sided formula: condition ~ result.
▶ TRUE can be specified as a catch-all, which is used when all
previous conditions return FALSE.

Artwork by Allison Horst.

67 / 72
Monthly mean departure delay
flights_JFK %>% group_by(month) %>%
summarize(mean_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
arrange(desc(mean_dep_delay))

## # A tibble: 12 x 2
## month mean_dep_delay
## <int> <dbl>
## 1 7 23.8
## 2 6 20.5
## 3 12 14.8
## 4 8 12.9
## 5 5 12.5
## 6 4 12.2
## 7 2 11.8
## 8 3 10.7
## 9 1 8.62
## 10 9 6.64
## 11 11 4.68
## 12 10 4.59

68 / 72
Destinations by the number of flights

▶ Destination airports with more than 5000 flights originated from

JFK in 2013.

flights %>% filter(origin == "JFK") %>%

count(dest) %>%
filter(n > 5000)

## # A tibble: 4 x 2
## dest n
## <chr> <int>
## 1 BOS 5898
## 2 LAX 11262
## 3 MCO 5464
## 4 SFO 8204

69 / 72
Summary

This week we learn the key dplyr functions that solve the vast
majority of data manipulation challenges:
▶ Subset observations by their values: filter()
▶ Subset variables by their names: select()
▶ Create new variables: mutate()
▶ Reorder the rows: arrange()
▶ Collapse many values down to a single summary: summarize()
or summarise()

These functions can be used in conjunction with group_by(), which

changes the scope of each function from operating on the entire data
set to operating on it within groups.

70 / 72
Summary

71 / 72
More on tidyverse

▶ tidyverse is a suite of R packages designed for data science.

▶ Its core packages include dplyr, readr, stringr, forcats,
purrr, ggplot2.
▶ Learn more about it at https://www.tidyverse.org/packages/.

72 / 72

Dplyr - Grammar of Data Manipulation
No ratings yet
Dplyr - Grammar of Data Manipulation
3 pages
Data Visualization Notes-2
No ratings yet
Data Visualization Notes-2
223 pages
Data Manipulation in Dplyr
No ratings yet
Data Manipulation in Dplyr
29 pages
Starwars Dataset
No ratings yet
Starwars Dataset
17 pages
R Packages Dplyr Sem-III 2021
No ratings yet
R Packages Dplyr Sem-III 2021
13 pages
DP Unit1 Notes
No ratings yet
DP Unit1 Notes
18 pages
Do The Same For All The Questions
No ratings yet
Do The Same For All The Questions
6 pages
Practical Preprocessing and Data Cleaning
No ratings yet
Practical Preprocessing and Data Cleaning
51 pages
ULEOR 04 Tidyverse Handout
No ratings yet
ULEOR 04 Tidyverse Handout
59 pages
Tutorial-Introduction To Dplyr
No ratings yet
Tutorial-Introduction To Dplyr
54 pages
R Programming Cont..
No ratings yet
R Programming Cont..
24 pages
fancyDPLYR Funcs
No ratings yet
fancyDPLYR Funcs
31 pages
R Basic and Advanced
No ratings yet
R Basic and Advanced
9 pages
Practical Assignment-10 Mini Project Nutrition Calculator - Calculate Nutrition For Recipes
No ratings yet
Practical Assignment-10 Mini Project Nutrition Calculator - Calculate Nutrition For Recipes
16 pages
MIT 302 - Statistical Computing II - Tutorial 02
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 02
5 pages
Reshaping Data With TidyR in R
No ratings yet
Reshaping Data With TidyR in R
1 page
6 Working With Data Frames in R
No ratings yet
6 Working With Data Frames in R
8 pages
DS-R Block 3-1 All
No ratings yet
DS-R Block 3-1 All
43 pages
Lab4 Instructions
No ratings yet
Lab4 Instructions
52 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
BS730 Class 12
No ratings yet
BS730 Class 12
36 pages
Study Guide Data Manipulation With R
No ratings yet
Study Guide Data Manipulation With R
4 pages
Charlotte Wickham: Happy R Users Purrr: Using Functional Programming To Solve Iteration Problems
No ratings yet
Charlotte Wickham: Happy R Users Purrr: Using Functional Programming To Solve Iteration Problems
81 pages
FDP Indoglobal Group of Colleges: 27 April To 1 May R Programming Language Assignment Submission
No ratings yet
FDP Indoglobal Group of Colleges: 27 April To 1 May R Programming Language Assignment Submission
12 pages
Tidy Data Principles and R Packages
No ratings yet
Tidy Data Principles and R Packages
14 pages
Tidyverse - Tidyr and Dplyr
No ratings yet
Tidyverse - Tidyr and Dplyr
33 pages
Data Minig and Techniquezz
No ratings yet
Data Minig and Techniquezz
48 pages
EM622 Data Analysis and Visualization Techniques For Decision-Making
No ratings yet
EM622 Data Analysis and Visualization Techniques For Decision-Making
47 pages
Data Structure R
No ratings yet
Data Structure R
25 pages
Basic R Dplyr Session 4 Demonstration
No ratings yet
Basic R Dplyr Session 4 Demonstration
18 pages
Introduction To Dplyr
No ratings yet
Introduction To Dplyr
14 pages
Dplyr Tutorial2
No ratings yet
Dplyr Tutorial2
5 pages
Introduction To Basics of R - Assignment: Log2 (2 5) Log (Exp (1) Exp (2) )
No ratings yet
Introduction To Basics of R - Assignment: Log2 (2 5) Log (Exp (1) Exp (2) )
10 pages
MBA Sem 1 Unit 3 Fundamentals of R
No ratings yet
MBA Sem 1 Unit 3 Fundamentals of R
41 pages
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 6
No ratings yet
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 6
22 pages
Lab 1 (With Answers)
No ratings yet
Lab 1 (With Answers)
44 pages
Manipulating Data in R
No ratings yet
Manipulating Data in R
32 pages
CH 3
No ratings yet
CH 3
33 pages
4.18 Data Wrangling Slides Part1
No ratings yet
4.18 Data Wrangling Slides Part1
54 pages
Learning R Programming For Data Science Enthusiasts
No ratings yet
Learning R Programming For Data Science Enthusiasts
8 pages
BIO259 Note
No ratings yet
BIO259 Note
55 pages
Week 5 Data Wrangling
No ratings yet
Week 5 Data Wrangling
96 pages
Data Wrangling
No ratings yet
Data Wrangling
32 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
Chapter 03 Wrangling
No ratings yet
Chapter 03 Wrangling
40 pages
R Sharing
No ratings yet
R Sharing
16 pages
R Programming Basics for Beginners
No ratings yet
R Programming Basics for Beginners
2 pages
Statistics and Data Science With R Part - 4
No ratings yet
Statistics and Data Science With R Part - 4
23 pages
Lect01 2
No ratings yet
Lect01 2
19 pages
Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16
No ratings yet
Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16
31 pages
Module IV
No ratings yet
Module IV
43 pages
Tidy Verse
No ratings yet
Tidy Verse
76 pages
Week6 Slides Updated
No ratings yet
Week6 Slides Updated
57 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
R Data Subsetting & Manipulation Guide
No ratings yet
R Data Subsetting & Manipulation Guide
44 pages
R1 Uptovisualisation
No ratings yet
R1 Uptovisualisation
122 pages
Basics of R: Installation & Data Types
No ratings yet
Basics of R: Installation & Data Types
43 pages
Lab11
No ratings yet
Lab11
2 pages
Data Manipulation with dplyr
100% (1)
Data Manipulation with dplyr
39 pages
Week12 Slides
No ratings yet
Week12 Slides
46 pages
Week11 Slides
No ratings yet
Week11 Slides
27 pages
Week13 Slides Review
No ratings yet
Week13 Slides Review
23 pages
Week3 Slides
No ratings yet
Week3 Slides
36 pages
Week2 Slides
No ratings yet
Week2 Slides
76 pages
Faculty Recruitment Jul 2025 SRD
No ratings yet
Faculty Recruitment Jul 2025 SRD
4 pages
Federal Mutual 7400
No ratings yet
Federal Mutual 7400
13 pages
Construction of The New National Load Dispatch Center-3222
No ratings yet
Construction of The New National Load Dispatch Center-3222
2 pages
Good DESIGN THINKING Book by Dharam Mentor
100% (1)
Good DESIGN THINKING Book by Dharam Mentor
78 pages
NeighbourNet Project Report As On 1st June 2024
No ratings yet
NeighbourNet Project Report As On 1st June 2024
38 pages
Proposal of Masterplan Siam Maspion Terminal - English Version
100% (1)
Proposal of Masterplan Siam Maspion Terminal - English Version
9 pages
List of Hospital
No ratings yet
List of Hospital
61 pages
SSX Diaphragm: Truextent® Replacement Diaphragms For JBL and RADIAN
No ratings yet
SSX Diaphragm: Truextent® Replacement Diaphragms For JBL and RADIAN
2 pages
Home Science Extension and Community Development
No ratings yet
Home Science Extension and Community Development
2 pages
Introduction of Philosophy MODULE 5
100% (1)
Introduction of Philosophy MODULE 5
3 pages
Ce 3401 Ahe It 1 Set 2
No ratings yet
Ce 3401 Ahe It 1 Set 2
2 pages
Somatics & Phenomenology Study
No ratings yet
Somatics & Phenomenology Study
17 pages
Adani P001 Corrosion Protection
No ratings yet
Adani P001 Corrosion Protection
119 pages
Ecological Psychology Insights
No ratings yet
Ecological Psychology Insights
2 pages
3-IELTS Academic Reading True False Not Given-WITH ANSWERS
No ratings yet
3-IELTS Academic Reading True False Not Given-WITH ANSWERS
15 pages
Chi - Square - Test of Association Notes
No ratings yet
Chi - Square - Test of Association Notes
1 page
OpenVas Vulnerability Scanning
No ratings yet
OpenVas Vulnerability Scanning
7 pages
Q3SCIE7
No ratings yet
Q3SCIE7
6 pages
Notes - Working With Functions
No ratings yet
Notes - Working With Functions
13 pages
MXV-10 00
No ratings yet
MXV-10 00
2 pages
An Overview of Solid Waste Management Practices in Pune Maharashtra India
No ratings yet
An Overview of Solid Waste Management Practices in Pune Maharashtra India
13 pages
Custom BAPI Creation
No ratings yet
Custom BAPI Creation
24 pages
Oct 2024
No ratings yet
Oct 2024
15 pages
Jadual Mba Odl Sem 1 2022-2023 (Update 13102022) 3
No ratings yet
Jadual Mba Odl Sem 1 2022-2023 (Update 13102022) 3
16 pages
Principles of Psychology 1 Ed Jarvis Ebook and TestBank Bundle Fast Access
No ratings yet
Principles of Psychology 1 Ed Jarvis Ebook and TestBank Bundle Fast Access
316 pages
LDR 736 Week 4 Reflection
No ratings yet
LDR 736 Week 4 Reflection
3 pages
1986 Lionnet A Boiling House Recovery
No ratings yet
1986 Lionnet A Boiling House Recovery
3 pages
Job Safety Analysis Form: Law M. Mechanical Supervisor Alex A./ Egbejimi Adebayo PSC
0% (1)
Job Safety Analysis Form: Law M. Mechanical Supervisor Alex A./ Egbejimi Adebayo PSC
4 pages
The Book of (PLC & SCADA) Dosing System by HMI
No ratings yet
The Book of (PLC & SCADA) Dosing System by HMI
102 pages
Case Study - Bakery House App
No ratings yet
Case Study - Bakery House App
4 pages