diff --git a/.README.md.swp b/.README.md.swp deleted file mode 100644 index 7abe798..0000000 Binary files a/.README.md.swp and /dev/null differ diff --git a/.gitignore b/.gitignore index 7b732e7..76150ea 100644 --- a/.gitignore +++ b/.gitignore @@ -3,3 +3,4 @@ .RData .Ruserdata .DS_Store +*.swp diff --git a/automating-workflows-in-R.Rproj b/R-Functional-Programming.Rproj similarity index 100% rename from automating-workflows-in-R.Rproj rename to R-Functional-Programming.Rproj diff --git a/README.html b/README.html deleted file mode 100644 index f202ff8..0000000 --- a/README.html +++ /dev/null @@ -1,460 +0,0 @@ - - - - -
- - - - - - - - -by Jae Yeon Kim
-File an issue if you have problems, questions or suggestions.
-This workshop helps you to step up your R skills with functional programming. The purrr
package provides easy-to-use tools to automate repeated things in your entire R workflow (e.g., wrangling, modeling, and visualization). The end result is cleaner, faster, more readable and extendable code. I highly recommend you to take this workshop (1) if you still write copy-and-paste code, (2) exclusively rely on for loops for automation, and (3) want to know about the joy and power of R functional programming.
How to use purrr
to automate workflow in a cleaner, faster, and more extendable way [Notebook]
How to use map2()
and pmap()
to avoid writing nested loops. [Notebook]
How to use map()
and glue()
to automate creating multiple plots [Notebook]
How to use reduce()
to automate joining multiple dataframes [Notebook]
How to use slowly()
and future_
to make automation process either slower or faster [Notebook]
How to use safely()
and possibly()
to make error handling easier [Notebook]
Launch the . Please do so before attending the worskshop as it takes a while (especially, if you do it for the first time).
This work is licensed under a Creative Commons Attribution 4.0 International License.
library(tidyverse)
-library(tictoc)
-library(broom)
-library(patchwork)
-purrr
to automate workflow in a cleaner, faster, and more extendable way--Copy-and-paste programming, sometimes referred to as just pasting, is the production of highly repetitive computer programming code, as produced by copy and paste operations. It is primarily a pejorative term; those who use the term are often implying a lack of programming competence. It may also be the result of technology limitations (e.g., an insufficiently expressive development environment) as subroutines or libraries would normally be used instead. However, there are occasions when copy-and-paste programming is considered acceptable or necessary, such as for boilerplate, loop unrolling (when not supported automatically by the compiler), or certain programming idioms, and it is supported by some source code editors in the form of snippets. - Wikipedia
-
The following exercise was inspired by Wickham’s example.
Let’s imagine df
is a survey data.
a, b, c, d = Survey questions
-99: non-responses
Your goal: replace -99 with NA
# Data
-df <- tibble("a" = -99,
- "b" = -99,
- "c" = -99,
- "d" = -99)
-# Copy and paste
-df$a[df$a == -99] <- NA
-df$b[df$b == -99] <- NA
-df$c[df$c == -99] <- NA
-df$d[df$d == -99] <- NA
-df$a[df$a == -99] <- NA
has an error, how are you going to fix it?) A solution is not scalable if it’s not automatable and, thus, scalable.Let’s recall what’s function in R: input + computation + output
If you write a function, you gain efficiency because you don’t need to copy and paste the computation part.
` function(input){
-computation
-return(output)
-} `
-# Function
-fix_missing <- function(x) {
- x[x == -99] <- NA
- # This is better
- return(x)
-}
-
-# Apply function to each column (vector)
-df$a <- fix_missing(df$a)
-df$b <- fix_missing(df$b)
-df$c <- fix_missing(df$c)
-df$d <- fix_missing(df$d)
-Challenge 2 Why using function is more efficient than 100% copying and pasting? Can you think about a way we can automate the process?
Many options for automation in R: for loop
, apply
family, etc.
Here’s a tidy solution comes from purrr
package.
The power and joy of one-liner.
purrr::map_df(df, fix_missing)
-map()
is a higher-order function that applies a given function to each element of a list/vector.
This is how map() works. It’s easier to understand with a picture.
-- Input: Takes a vector/list.
-
-- Computation: Calls the function once for each element of the vector
-
-- Output: Returns in a list or whatever data format you prefer (e.g., `_df helper: dataframe`)
-Challenge 3 If you run the code below, what’s going to be the data type of the output?
-map_chr(df, fix_missing)
-## a b c d
-## NA NA NA NA
-map()
is a good alternative to for loop
. (For more information, watch Hadley Wickam’s talk titled “The Joy of Functional Programming (for Data Science)”.)# Built-in data
-data("airquality")
-tic()
-
-out1 <- vector("double", ncol(airquality)) # Placeholder
-
-for (i in seq_along(airquality)) { # Sequence variable
-
- out1[[i]] <- mean(airquality[[i]], na.rm = TRUE) # Assign a computation result to each element
-
-}
-
-toc()
-## 0.006 sec elapsed
-tic()
-
-out1 <- airquality %>% map_dbl(mean, na.rm = TRUE)
-
-toc()
-## 0.033 sec elapsed
-In short, map()
is more readable, faster, and easily extensive with other data science tasks (e.g., wrangling, modeling, and visualization) using %>%
.
Final point: Why not base R apply
family?
Short answer: purrr::map()
is simpler to write. For instance,
map_dbl(x, mean, na.rm = TRUE)
= vapply(x, mean, na.rm = TRUE, FUN.VALUE = double(1))
Additional tips
-Performance testing (profiling) is an important part of programming. tictoc()
measures the time that needs to take to run a target function for once. If you want a more robust measure of timing as well as information on memory (speed and space both matter for performance testing), consider using the bench
package that is designed for high precising timing of R expressions.
map_mark <- bench::mark(
-
- out1 <- airquality %>% map_dbl(mean, na.rm = TRUE)
-
- )
-
-map_mark
-## # A tibble: 1 × 6
-## expression min median `itr/sec`
-## <bch:expr> <bch:tm> <bch:tm> <dbl>
-## 1 out1 <- airquality %>% map_dbl(mean, na.rm = TRUE) 62.1µs 78.1µs 11966.
-## # … with 2 more variables: mem_alloc <bch:byt>, gc/sec <dbl>
-map()
is to run regression models (or whatever model you want to run) on list-columns. No more copying and pasting for running many regression models on subgroups!# Have you ever tried this?
-lm_A <- lm(y ~ x, subset(data, subgroup == "group_A"))
-lm_B <- lm(y ~ x, subset(data, subgroup == "group_B"))
-lm_C <- lm(y ~ x, subset(data, subgroup == "group_C"))
-lm_D <- lm(y ~ x, subset(data, subgroup == "group_D"))
-lm_E <- lm(y ~ x, subset(data, subgroup == "group_E"))
-# Function
-lm_model <- function(df) {
- lm(Temp ~ Ozone, data = df)
-}
-# Map
-models <- airquality %>%
- # Determines group variable
- group_by(Month) %>%
- nest() %>% # Create list-columns
- mutate(ols = map(data, lm_model)) # Map
-# Add tidying
-tidy_lm_model <- purrr::compose( # compose multiple functions
- broom::tidy, # convert lm objects into tidy tibbles
- lm_model
-)
-
-tidied_models <- airquality %>%
- group_by(Month) %>%
- nest() %>% # Create list-columns
- mutate(ols = map(data, tidy_lm_model))
-
-tidied_models$ols[1]
-## [[1]]
-## # A tibble: 2 × 5
-## term estimate std.error statistic p.value
-## <chr> <dbl> <dbl> <dbl> <dbl>
-## 1 (Intercept) 62.9 1.61 39.2 2.88e-23
-## 2 Ozone 0.163 0.0500 3.26 3.31e- 3
-A good friend of map()
function is rerun()
function. This comibination is really useful for simulations. Consider the following example.
set.seed(1234)
-
-small_n <- 100 ; k <- 1000 ; mu <- 500 ; sigma <- 20
-
-y_list <- rep(list(NA), k)
-
-for (i in seq(k)) {
-
- y_list[[i]] <- rnorm(small_n, mu, sigma)
-
-}
-
-y_means <- unlist(lapply(y_list, mean))
-
-qplot(y_means) +
- geom_vline(xintercept = 500, linetype = "dotted", color = "red")
-## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
-small_n <- 100 ; k <- 1000
-
-y_tidy <- rerun(k, rnorm(small_n, mu, sigma))
-
-y_means_tidy <- map_dbl(y_tidy, mean)
-
-# Visualize
-(qplot(y_means) +
- geom_vline(xintercept = 500, linetype = "dotted", color = "red")) +
-(qplot(y_means_tidy) +
- geom_vline(xintercept = 500, linetype = "dotted", color = "red"))
-## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
-## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
-library(tidyverse)
-map2()
and pmap()
to avoid writing nested loops.[1] “University = Berkeley | Department = waterbenders”
-[1] “University = Berkeley | Department = earthbenders”
-[1] “University = Berkeley | Department = firebenders”
-[1] “University = Berkeley | Department = airbenders”
-[1] “University = Stanford | Department = waterbenders”
-[1] “University = Stanford | Department = earthbenders”
-[1] “University = Stanford | Department = firebenders”
-[1] “University = Stanford | Department = airbenders”
-paste("University = Berkeley | Department = CS")
-## [1] "University = Berkeley | Department = CS"
-A slightly more efficient way: using a for loop.
Think about which part of the statement is constant and which part varies ( = parameters).
-
Do we need a placeholder? No. We don’t need a placeholder because we don’t store the result of iterations.
Challenge 1: How many parameters do you need to solve the problem below?
-- Fun fact: The department names are from [Avatar: The Last Airbender](https://en.wikipedia.org/wiki/Avatar:_The_Last_Airbender).
-# Outer loop for univ variable
-
-for (univ in c("Berkeley", "Stanford")) {
-
- # Inner loop for dept variable
- for (dept in c("waterbenders", "earthbenders", "firebenders", "airbenders")) {
-
- print(paste("University = ", univ, "|", "Department = ", dept))
-
- }
-
-}
-## [1] "University = Berkeley | Department = waterbenders"
-## [1] "University = Berkeley | Department = earthbenders"
-## [1] "University = Berkeley | Department = firebenders"
-## [1] "University = Berkeley | Department = airbenders"
-## [1] "University = Stanford | Department = waterbenders"
-## [1] "University = Stanford | Department = earthbenders"
-## [1] "University = Stanford | Department = firebenders"
-## [1] "University = Stanford | Department = airbenders"
---To become significantly more reliable, code must become more transparent. In particular, nested conditions and loops must be viewed with great suspicion. Complicated control flows confuse programmers. Messy code often hides bugs. — Bjarne Stroustrup
-
Challenge 2 Why are we using rep()
to create input vectors? For instance, for univ_list
why not just use c("Berkeley", "Stanford")
?
# Inputs (remember the length of these inputs should be identical)
-univ_list <- rep(c("Berkeley", "Stanford"),4)
-
-dept_list <- rep(c("waterbenders", "earthbenders", "firebenders", "airbenders"),2)
-# Function
-print_lists <- function(univ, dept){
-
- print(paste("University = ", univ, "|", "Department = ", dept))
-
-}
-# Test
-print_lists(univ_list[1], dept_list[1])
-## [1] "University = Berkeley | Department = waterbenders"
-map2()
or pmap()
# 2 arguments
-map2_output <- map2(univ_list, dept_list,
- print_lists)
-## [1] "University = Berkeley | Department = waterbenders"
-## [1] "University = Stanford | Department = earthbenders"
-## [1] "University = Berkeley | Department = firebenders"
-## [1] "University = Stanford | Department = airbenders"
-## [1] "University = Berkeley | Department = waterbenders"
-## [1] "University = Stanford | Department = earthbenders"
-## [1] "University = Berkeley | Department = firebenders"
-## [1] "University = Stanford | Department = airbenders"
-# 3+ arguments
-pmap_output <- pmap(list(univ_list, dept_list), print_lists)
-## [1] "University = Berkeley | Department = waterbenders"
-## [1] "University = Stanford | Department = earthbenders"
-## [1] "University = Berkeley | Department = firebenders"
-## [1] "University = Stanford | Department = airbenders"
-## [1] "University = Berkeley | Department = waterbenders"
-## [1] "University = Stanford | Department = earthbenders"
-## [1] "University = Berkeley | Department = firebenders"
-## [1] "University = Stanford | Department = airbenders"
-Challenge 3 Have you noticed that we used a slightly different input for pmap()
compared to map()
or map2()
? What is the difference?
library(tidyverse)
-library(glue)
-map()
and glue()
to automate creating multiple plotsdata("airquality")
-
-airquality
-airquality %>%
- ggplot(aes(x = Ozone, y = Solar.R)) +
- geom_point() +
- labs(title = "Relationship between Ozone and Solar.R",
- y = "Solar.R")
-## Warning: Removed 42 rows containing missing values (geom_point).
-airquality %>%
- ggplot(aes(x = Ozone, y = Wind)) +
- geom_point() +
- labs(title = "Relationship between Ozone and Wind",
- y = "Wind")
-## Warning: Removed 37 rows containing missing values (geom_point).
-airquality %>%
- ggplot(aes(x = Ozone, y = Temp)) +
- geom_point() +
- labs(title = "Relationship between Ozone and Temp",
- y = "Temp")
-## Warning: Removed 37 rows containing missing values (geom_point).
-glue()
works.glue()
combines strings and objects and it works simpler and faster than paste()
or sprintif()
.names <- c("Jae", "Aniket", "Avery")
-
-fields <- c("Political Science", "Law", "Public Health")
-
-library(glue)
-glue("{names} studies {fields}.")
-## Jae studies Political Science.
-## Aniket studies Law.
-## Avery studies Public Health.
-So, our next step is to combine glue()
and map()
.
Let’s first think about writing a function that includes glue()
.
Challenge 1 How can you create the character vector of column names?
-Challenge 2 How can make ggplot2()
take strings as x and y variable names?
airquality %>%
- ggplot(aes(x = .data[[names(airquality)[1]]], y = .data[[names(airquality)[2]]]))+
- geom_point()+
- labs(title = glue("Relationship between Ozone and {names(airquality)[2]}"),
- y = glue("{names(airquality)[2]}"))
-The next step is to write an automatic plotting function.
-create_point_plot <- function(i){
-
- p <- airquality %>%
- ggplot(aes(x = .data[[names(airquality)[1]]], y = .data[[names(airquality)[i]]]))+
- geom_point() +
- labs(title = glue("Relationship between Ozone and {names(airquality)[i]}"),
- y = glue("{names(airquality)[i]}"))
- print(p)
-}
-walk()
.walk(2:ncol(airquality), create_point_plot)
-library(tidyverse)
-reduce()
to automate row-binding multiple dataframesdf1 <- tibble(x = sample(1:10, size = 3, replace = TRUE),
- y = sample(1:10, size = 3, replace = TRUE),
- z = sample(1:10, size = 3, replace = TRUE))
-
-df2 <- tibble(x = sample(1:10, size = 3, replace = TRUE),
- y = sample(1:10, size = 3, replace = TRUE),
- z = sample(1:10, size = 3, replace = TRUE))
-
-df3 <- tibble(x = sample(1:10, size = 3, replace = TRUE),
- y = sample(1:10, size = 3, replace = TRUE),
- z = sample(1:10, size = 3, replace = TRUE))
-first_bind <- bind_rows(df1, df2)
-
-second_bind <- bind_rows(first_bind, df3)
-Challenge Why the above solution is not efficient?
-How reduce() works.
-- Input: Takes a vector of length n
-
-- Computation: Calls a function with a pair of values at a time
-
-- Output: Returns a vector of length 1
-reduced <- reduce(list(df1, df2, df3), bind_rows)
-library(tidyverse)
-library(tictoc)
-library(furrr)
-slowly()
and future_
to make automation process either slower or fasterwalk()
works same as map()
but doesn’t store its output.
If you’re web scraping, one problem with this approach is it’s too fast by human standards.
tic()
-walk(1:50, function(x){message("Scraping page", x)}) # Anonymous function; I don't name the function
-## Scraping page1
-## Scraping page2
-## Scraping page3
-## Scraping page4
-## Scraping page5
-## Scraping page6
-## Scraping page7
-## Scraping page8
-## Scraping page9
-## Scraping page10
-## Scraping page11
-## Scraping page12
-## Scraping page13
-## Scraping page14
-## Scraping page15
-## Scraping page16
-## Scraping page17
-## Scraping page18
-## Scraping page19
-## Scraping page20
-## Scraping page21
-## Scraping page22
-## Scraping page23
-## Scraping page24
-## Scraping page25
-## Scraping page26
-## Scraping page27
-## Scraping page28
-## Scraping page29
-## Scraping page30
-## Scraping page31
-## Scraping page32
-## Scraping page33
-## Scraping page34
-## Scraping page35
-## Scraping page36
-## Scraping page37
-## Scraping page38
-## Scraping page39
-## Scraping page40
-## Scraping page41
-## Scraping page42
-## Scraping page43
-## Scraping page44
-## Scraping page45
-## Scraping page46
-## Scraping page47
-## Scraping page48
-## Scraping page49
-## Scraping page50
-toc()
-## 0.007 sec elapsed
---slowly() takes a function and modifies it to wait a given amount of time between each call. -
-purrr
package vignette
In a different situation, you want to make your function run faster. This is a common situation when you collect and analyze data at large-scale. You can solve this problem using parallel processing. For more on the parallel processing in R, read this review.
-Parallel processing setup
-Step1: Determine the number of max workers (availableCores()
)
Step2: Determine the parallel processing mode (plan()
)
# Setup
-n_cores <- availableCores() - 1
-n_cores # This number depends on your computer spec.
-## system
-## 3
-plan(multiprocess, # multicore, if supported, otherwise multisession
- workers = n_cores) # the maximum number of workers
-## Warning: Strategy 'multiprocess' is deprecated in future (>= 1.20.0). Instead,
-## explicitly specify either 'multisession' or 'multicore'. In the current R
-## session, 'multiprocess' equals 'multisession'.
-## Warning in supportsMulticoreAndRStudio(...): [ONE-TIME WARNING] Forked
-## processing ('multicore') is not supported when running R from RStudio
-## because it is considered unstable. For more details, how to control forked
-## processing or not, and how to silence this warning in future R sessions, see ?
-## parallelly::supportsMulticore
-tic()
-mean100 <- map(1:100000, mean)
-toc()
-## 0.392 sec elapsed
-tic()
-mean100 <- future_map(1:100000, mean)
-toc()
-## 0.657 sec elapsed
-library(tidyverse)
-library(rvest)
-safely()
and possibly()
to make error handling easierChallenge 1
-map(url_lists, read_html)
url_lists <- c("https://en.wikipedia.org/wiki/University_of_California,_Berkeley",
-"https://en.wikipedia.org/wiki/Stanford_University",
-"https://en.wikipedia.org/wiki/Carnegie_Mellon_University",
-"https://DLAB"
-)
-map(url_lists, read_html)
-There are three kinds of messages you will run into, if your code has an error based on the following functions.
-The basic logic of try-catch
, R’s basic error handling function, works like the following.
tryCatch(
- {map(url_lists, read_html)
- }, warning = function(w) {
- "Warning"
- }, error = function(e) {
- "Error"
- }, finally = {
- "Message"
-})
-## [1] "Error"
-purrr
version of the try-catch
mechanism (evaluates code and assigns exception handlers).Outputs
-NULL
NULL
or error
test <- map(url_lists, safely(read_html))
-map(url_lists, safely(read_html)) %>%
- map("result") %>% # = map(function(x) x[["result"]]) = map(~.x[["name"]])
- purrr::compact() # Remove empty elements
-## [[1]]
-## {html_document}
-## <html class="client-nojs" lang="en" dir="ltr">
-## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
-## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject ...
-##
-## [[2]]
-## {html_document}
-## <html class="client-nojs" lang="en" dir="ltr">
-## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
-## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject ...
-##
-## [[3]]
-## {html_document}
-## <html class="client-nojs" lang="en" dir="ltr">
-## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
-## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject ...
-What if the best way to solve the problem is not ignoring the error …
-# If error occurred, "The URL is broken." will be stored in that element(s).
-out <- map(url_lists,
-
- possibly(read_html,
- otherwise = "The URL is broken.")
-
- )
-
-out
-## [[1]]
-## {html_document}
-## <html class="client-nojs" lang="en" dir="ltr">
-## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
-## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject ...
-##
-## [[2]]
-## {html_document}
-## <html class="client-nojs" lang="en" dir="ltr">
-## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
-## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject ...
-##
-## [[3]]
-## {html_document}
-## <html class="client-nojs" lang="en" dir="ltr">
-## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
-## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject ...
-##
-## [[4]]
-## [1] "The URL is broken."
-# Let's find the broken URL.
-url_lists[out[seq(out)] == "The URL is broken."]
-## [1] "https://DLAB"
-