diff --git a/.README.md.swp b/.README.md.swp deleted file mode 100644 index 7abe798..0000000 Binary files a/.README.md.swp and /dev/null differ diff --git a/.gitignore b/.gitignore index 7b732e7..76150ea 100644 --- a/.gitignore +++ b/.gitignore @@ -3,3 +3,4 @@ .RData .Ruserdata .DS_Store +*.swp diff --git a/automating-workflows-in-R.Rproj b/R-Functional-Programming.Rproj similarity index 100% rename from automating-workflows-in-R.Rproj rename to R-Functional-Programming.Rproj diff --git a/README.html b/README.html deleted file mode 100644 index f202ff8..0000000 --- a/README.html +++ /dev/null @@ -1,460 +0,0 @@ - - - - - - - - - - - - - -README.utf8 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - -

How to Automate Repeated Things in R

by Jae Yeon Kim

File an issue if you have problems, questions or suggestions.

Overview

This workshop helps you to step up your R skills with functional programming. The purrr package provides easy-to-use tools to automate repeated things in your entire R workflow (e.g., wrangling, modeling, and visualization). The end result is cleaner, faster, more readable and extendable code. I highly recommend you to take this workshop (1) if you still write copy-and-paste code, (2) exclusively rely on for loops for automation, and (3) want to know about the joy and power of R functional programming.

Learning objectives

How to use purrr to automate workflow in a cleaner, faster, and more extendable way [Notebook]
How to use map2() and pmap() to avoid writing nested loops. [Notebook]
How to use map() and glue() to automate creating multiple plots [Notebook]
How to use reduce() to automate joining multiple dataframes [Notebook]
How to use slowly() and future_ to make automation process either slower or faster [Notebook]
How to use safely() and possibly() to make error handling easier [Notebook]

Prerequisites

Some experience with writing functions in R/Python

Setup

Launch the . Please do so before attending the worskshop as it takes a while (especially, if you do it for the first time).

This work is licensed under a Creative Commons Attribution 4.0 International License.

- - - - -

- - - - - - - - - - - - - - - diff --git a/README.md b/README.md index 98a808a..116d500 100644 --- a/README.md +++ b/README.md @@ -1,36 +1,91 @@ -# R-Functional-Programming +# Functional Programming in R -## Workshop Description +## Workshop Goals -This workshop helps you to step up your R skills with functional programming. The `purrr` package provides easy-to-use tools to automate repeated things in your entire R workflow (e.g., wrangling, modeling, and visualization). The end result is cleaner, faster, more readable and extendable code. I highly recommend you to take this workshop (1) if you still write copy-and-paste code, (2) exclusively rely on for loops for automation, and (3) want to know about the joy and power of R functional programming. +This workshop helps you to step up your R skills with functional programming. The `purrr` package provides easy-to-use tools to automate repeated things in your entire R workflow (e.g., wrangling, modeling, and visualization). The end result is cleaner, faster, more readable and extendable code. -## Learning objectives +Prior experience with R at the level of [R Fundamentals](https://github.com/dlab-berkeley/R-Fundamentals) and [R Data Wrangling](https://github.com/dlab-berkeley/R-Data-Wrangling) is recommended. -1. How to use `purrr` to automate workflow in a cleaner, faster, and more extendable way [[Notebook](https://rawcdn.githack.com/dlab-berkeley/R-functional-programming/1650e53a815d7c6e5449e035fd61a21b646b43d7/lecture_notes/01_why_map.html)] +## Install Instructions -2. How to use `map2()` and `pmap()` to avoid writing nested loops. [[Notebook](https://rawcdn.githack.com/dlab-berkeley/R-functional-programming/1650e53a815d7c6e5449e035fd61a21b646b43d7/lecture_notes/02_more_inputs.html)] +We will use RStudio to go through the workshop materials, which requires installation of both the R language and the RStudio software. -3. How to use `map()` and `glue()` to automate creating multiple plots [[Notebook](https://rawcdn.githack.com/dlab-berkeley/R-functional-programming/1650e53a815d7c6e5449e035fd61a21b646b43d7/lecture_notes/03_map_glue.html)] +1. [Download R](https://www.r-project.org/): Follow the links according to the operating system that you are running. Download the package, and install R onto your compute. You should install the most recent version (at least version 4.0). -4. How to use `reduce()` to automate joining multiple dataframes [[Notebook](https://rawcdn.githack.com/dlab-berkeley/R-functional-programming/1650e53a815d7c6e5449e035fd61a21b646b43d7/lecture_notes/04_reduce_join.html)] +2. [Download RStudio](https://www.rstudio.com/products/rstudio/download/): Install RStudio Desktop. This should be free. Do this after you have already installed R. -5. How to use `slowly()` and `future_` to make automation process either slower or faster [[Notebook](https://rawcdn.githack.com/dlab-berkeley/R-functional-programming/1650e53a815d7c6e5449e035fd61a21b646b43d7/lecture_notes/05_slower_faster.html)] +3. Download these workshop materials: -6. How to use `safely()` and `possibly()` to make error handling easier [[Notebook](https://rawcdn.githack.com/dlab-berkeley/R-functional-programming/1650e53a815d7c6e5449e035fd61a21b646b43d7/lecture_notes/06_make_error_handling_easier.html)] + - Click the green "Code" button in the top right of the repository information. -## Prerequisites + - Click "Download Zip". -- Some experience with writing functions in R + - Extract this file to a folder on your computer where you can easily access it (we recommend Desktop). -## Setup +4. Optional: If you are familiar with `git`, you can instead clone this repository by opening a terminal and entering -Launch the [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/dlab-berkeley/R-functional-programming/master?urlpath=rstudio). Please do so before attending the worskshop as it would take a while (especially, if you do it for the first time). +5. Make sure that the following packages are installed on your computer ---- +``` +here +tibble +tidyverse +``` -### Contributions to these materials by: +## Run the Code +Now that you have all the required software and materials, you need to run the code: + +1. Launch the RStudio software. + +2. Use the file navigator to find the `R-functional-programming` folder that you downloaded. + +3. Open the `R-Functional-Programming.Rproj` by double clicking to open the code in an R project. + +4. Open up the file corresponding to the part of the workshop currently in focus. + +5. Place your cursor on a given line and press "Command + Enter" (Mac) or "Control + Enter" (PC) to run an individual line of code. + +6. The `solutions` folder contains the solutions to the challenge problems. + +## Is R not working on your laptop? + +If you do not have R installed and the materials loaded for your workshop by the time it starts, we *strongly* recommend using the UC Berkeley DataHub to run the materials. You can access the DataHub by clicking [this link](https://datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fdlab-berkeley%2FR-functional-programming&urlpath=rstudio%2F&branch=main) + +The DataHub downloads this repository, along with any necessary packages, and allows you to run the materials in an RStudio instance on UC Berkeley's servers. No installation is necessary from your end--you only need an internet browser and a CalNet ID to log in. By using the DataHub, you can save your work and come back to it at any time. When you want to return to your saved work, go straight to [DataHub](https://datahub.berkeley.edu/), sign in, and click on the `advanced-data-wrangling-in-R` folder. + +## About the UC Berkeley D-Lab + +D-Lab works with Berkeley faculty, research staff, and students to advance data-intensive social science and humanities research. Our goal at D-Lab is to provide practical training, staff support, resources, and space to enable you to use R for your own research applications. Our services cater to all skill levels and no programming, statistical, or computer science backgrounds are necessary. We offer these services in the form of workshops, one-to-one consulting, and working groups that cover a variety of research topics, digital tools, and programming languages. + +Visit the [D-Lab homepage](https://dlab.berkeley.edu/) to learn more about us. You can view our [calendar](https://dlab.berkeley.edu/events/calendar) for upcoming events, learn about how to utilize our [consulting](https://dlab.berkeley.edu/consulting) and [data](https://dlab.berkeley.edu/data) services, and check out upcoming [workshops](https://dlab.berkeley.edu/events/workshops). + + +## Other D-Lab R Workshops + +Here are other R workshops offered by the D-Lab: + +### Basic Competency +- [R Fundamentals](https://github.com/dlab-berkeley/R-Fundamentals) +- [R Data Wrangling](https://github.com/dlab-berkeley/R-Data-Wrangling) +- [R Graphics with ggplot2](https://github.com/dlab-berkeley/R-graphics) +- [Project Management in R](https://github.com/dlab-berkeley/efficient-reproducible-project-management-in-R) +- [Geospatial Fundamentals in R with sf](https://github.com/dlab-berkeley/Geospatial-Fundamentals-in-R-with-sf) +- [Census Data in R](https://github.com/dlab-berkeley/Census-Data-in-R) + +### Intermediate/Advanced Competency +- [Advanced Data Wrangling in R](https://github.com/dlab-berkeley/advanced-data-wrangling-in-R) +- [Introduction to Machine Learning in R](https://github.com/dlab-berkeley/Machine-Learning-in-R) +- [Unsupervised Learning in R](https://github.com/dlab-berkeley/Unsupervised-Learning-in-R) +- [R Machine Learning with tidymodels](https://github.com/dlab-berkeley/Machine-Learning-with-tidymodels) +- [Introduction to Deep Learning in R](https://github.com/dlab-berkeley/Deep-Learning-in-R) +- [Fairness and Bias in Machine Learning](https://github.com/dlab-berkeley/fairML) +- [R Package Development](https://github.com/dlab-berkeley/R-package-development) + +## Contributors + +- [Alex Stephenson](https://github.com/asteves) - [Jae Yeon Kim](https://jaeyk.github.io) -- [Alex Stephenson](https://alexstephenson.me) +- [Avery Richards](https://github.com/Averysaurus) ![](https://i.creativecommons.org/l/by/4.0/88x31.png) This work is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/). diff --git a/install.R b/install.R deleted file mode 100644 index dc17146..0000000 --- a/install.R +++ /dev/null @@ -1,7 +0,0 @@ -install.packages(c("pacman", - "tidyverse", - "glue", - "furrr", - "tictoc", - "rvest", - "broom")) diff --git a/lecture_notes/01_why_map.html b/lecture_notes/01_why_map.html deleted file mode 100644 index 37df3b5..0000000 --- a/lecture_notes/01_why_map.html +++ /dev/null @@ -1,553 +0,0 @@ - - - - - - - - - - - - - - -Why Functional Programming: No More Copying and Pasting - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - -

- -

- - - - - - - -

1 Setup

library(tidyverse)
-library(tictoc)
-library(broom)
-library(patchwork)

2 Objectives

How to use purrr to automate workflow in a cleaner, faster, and more extendable way

3 Copy-and-paste programming

-
Copy-and-paste programming, sometimes referred to as just pasting, is the production of highly repetitive computer programming code, as produced by copy and paste operations. It is primarily a pejorative term; those who use the term are often implying a lack of programming competence. It may also be the result of technology limitations (e.g., an insufficiently expressive development environment) as subroutines or libraries would normally be used instead. However, there are occasions when copy-and-paste programming is considered acceptable or necessary, such as for boilerplate, loop unrolling (when not supported automatically by the compiler), or certain programming idioms, and it is supported by some source code editors in the form of snippets. - Wikipedia
-

The following exercise was inspired by Wickham’s example.
Let’s imagine df is a survey data.
-
- a, b, c, d = Survey questions
- -99: non-responses
- Your goal: replace -99 with NA

# Data
-df <- tibble("a" = -99,
-             "b" = -99,
-             "c" = -99,
-             "d" = -99)

# Copy and paste 
-df$a[df$a == -99] <- NA
-df$b[df$b == -99] <- NA
-df$c[df$c == -99] <- NA
-df$d[df$d == -99] <- NA

Challenge 1. Explain why this solution is not very efficient. (e.g., If df$a[df$a == -99] <- NA has an error, how are you going to fix it?) A solution is not scalable if it’s not automatable and, thus, scalable.

4 Using a function

Let’s recall what’s function in R: input + computation + output
If you write a function, you gain efficiency because you don’t need to copy and paste the computation part.

` function(input){

computation

return(output)

} `

# Function
-fix_missing <- function(x) {
-  x[x == -99] <- NA
-  # This is better 
-  return(x)
-}
-
-# Apply function to each column (vector)
-df$a <- fix_missing(df$a)
-df$b <- fix_missing(df$b)
-df$c <- fix_missing(df$c)
-df$d <- fix_missing(df$d)

Challenge 2 Why using function is more efficient than 100% copying and pasting? Can you think about a way we can automate the process?
Many options for automation in R: for loop, apply family, etc.
Here’s a tidy solution comes from purrr package.
The power and joy of one-liner.

purrr::map_df(df, fix_missing)

- -

map() is a higher-order function that applies a given function to each element of a list/vector.

This is how map() works. It’s easier to understand with a picture.

- Input: Takes a vector/list. 
-
-- Computation: Calls the function once for each element of the vector 
-
-- Output: Returns in a list or whatever data format you prefer (e.g., `_df helper: dataframe`)

Challenge 3 If you run the code below, what’s going to be the data type of the output?

map_chr(df, fix_missing)

##  a  b  c  d 
-## NA NA NA NA

Why map() is a good alternative to for loop. (For more information, watch Hadley Wickam’s talk titled “The Joy of Functional Programming (for Data Science)”.)

# Built-in data 
-data("airquality")

tic()
-
-out1 <- vector("double", ncol(airquality)) # Placeholder 
-
-for (i in seq_along(airquality)) { # Sequence variable 
-  
-  out1[[i]] <- mean(airquality[[i]], na.rm = TRUE) # Assign a computation result to each element 
-  
-}
-
-toc()

## 0.006 sec elapsed

tic()
-
-out1 <- airquality %>% map_dbl(mean, na.rm = TRUE)
-
-toc()

## 0.033 sec elapsed

In short, map() is more readable, faster, and easily extensive with other data science tasks (e.g., wrangling, modeling, and visualization) using %>%.
Final point: Why not base R apply family?

Short answer: purrr::map() is simpler to write. For instance,

map_dbl(x, mean, na.rm = TRUE) = vapply(x, mean, na.rm = TRUE, FUN.VALUE = double(1))

Additional tips

Performance testing (profiling) is an important part of programming. tictoc() measures the time that needs to take to run a target function for once. If you want a more robust measure of timing as well as information on memory (speed and space both matter for performance testing), consider using the bench package that is designed for high precising timing of R expressions.

map_mark <- bench::mark(
-
-  out1 <- airquality %>% map_dbl(mean, na.rm = TRUE)
-
-  )
-
-map_mark

## # A tibble: 1 × 6
-##   expression                                              min   median `itr/sec`
-##   <bch:expr>                                         <bch:tm> <bch:tm>     <dbl>
-## 1 out1 <- airquality %>% map_dbl(mean, na.rm = TRUE)   62.1µs   78.1µs    11966.
-## # … with 2 more variables: mem_alloc <bch:byt>, gc/sec <dbl>

5 Applications

Many models

One popular application of map() is to run regression models (or whatever model you want to run) on list-columns. No more copying and pasting for running many regression models on subgroups!

# Have you ever tried this?
-lm_A <- lm(y ~ x, subset(data, subgroup == "group_A"))
-lm_B <- lm(y ~ x, subset(data, subgroup == "group_B"))
-lm_C <- lm(y ~ x, subset(data, subgroup == "group_C"))
-lm_D <- lm(y ~ x, subset(data, subgroup == "group_D"))
-lm_E <- lm(y ~ x, subset(data, subgroup == "group_E"))

For more information on this technique, read the Many Models subchapter of the R for Data Science.

# Function
-lm_model <- function(df) {
-  lm(Temp ~ Ozone, data = df)
-}

# Map
-models <- airquality %>%
-  # Determines group variable 
-  group_by(Month) %>%
-  nest() %>% # Create list-columns
-  mutate(ols = map(data, lm_model)) # Map

# Add tidying
-tidy_lm_model <- purrr::compose( # compose multiple functions
-  broom::tidy, # convert lm objects into tidy tibbles
-  lm_model
-)
-
-tidied_models <- airquality %>%
-  group_by(Month) %>%
-  nest() %>% # Create list-columns
-  mutate(ols = map(data, tidy_lm_model))
-
-tidied_models$ols[1]

## [[1]]
-## # A tibble: 2 × 5
-##   term        estimate std.error statistic  p.value
-##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
-## 1 (Intercept)   62.9      1.61       39.2  2.88e-23
-## 2 Ozone          0.163    0.0500      3.26 3.31e- 3

Simulations

A good friend of map() function is rerun() function. This comibination is really useful for simulations. Consider the following example.

Base R approach

set.seed(1234)
-
-small_n <- 100 ; k <- 1000 ; mu <- 500 ; sigma <- 20
-
-y_list <- rep(list(NA), k)
-
-for (i in seq(k)) {
-        
-    y_list[[i]] <- rnorm(small_n, mu, sigma)
-        
-}
-
-y_means <- unlist(lapply(y_list, mean))
-
-qplot(y_means) +
-   geom_vline(xintercept = 500, linetype = "dotted", color = "red")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

rerun() + map()

small_n <- 100 ; k <- 1000
-
-y_tidy <- rerun(k, rnorm(small_n, mu, sigma)) 
-
-y_means_tidy <- map_dbl(y_tidy, mean)
-
-# Visualize 
-(qplot(y_means) +
-   geom_vline(xintercept = 500, linetype = "dotted", color = "red")) +
-(qplot(y_means_tidy) +
-   geom_vline(xintercept = 500, linetype = "dotted", color = "red"))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
-## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

- -

- - - -

- -

- - - - - - - - - - - - - - - - diff --git a/lecture_notes/02_more_inputs.html b/lecture_notes/02_more_inputs.html deleted file mode 100644 index 53a5bb0..0000000 --- a/lecture_notes/02_more_inputs.html +++ /dev/null @@ -1,445 +0,0 @@ - - - - - - - - - - - - - - -Automote 2 or 2+ Tasks - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - -

- -

- - - - - - - -

1 Setup

library(tidyverse)

2 Objectives

Learning how to use map2() and pmap() to avoid writing nested loops.

3 Problem

Problem: How can you create something like below?

[1] “University = Berkeley | Department = waterbenders”

[1] “University = Berkeley | Department = earthbenders”

[1] “University = Berkeley | Department = firebenders”

[1] “University = Berkeley | Department = airbenders”

[1] “University = Stanford | Department = waterbenders”

[1] “University = Stanford | Department = earthbenders”

[1] “University = Stanford | Department = firebenders”

[1] “University = Stanford | Department = airbenders”

The most manual way: You can copy and paste eight times.

paste("University = Berkeley | Department = CS")

## [1] "University = Berkeley | Department = CS"

4 For loop

A slightly more efficient way: using a for loop.
Think about which part of the statement is constant and which part varies ( = parameters).
-
Do we need a placeholder? No. We don’t need a placeholder because we don’t store the result of iterations.

Challenge 1: How many parameters do you need to solve the problem below?

- Fun fact: The department names are from [Avatar: The Last Airbender](https://en.wikipedia.org/wiki/Avatar:_The_Last_Airbender).

# Outer loop for univ variable 
-
-for (univ in c("Berkeley", "Stanford")) {
-
-  # Inner loop for dept variable 
-  for (dept in c("waterbenders", "earthbenders", "firebenders", "airbenders")) {
-
-    print(paste("University = ", univ, "|", "Department = ", dept))
-
-  }
-
-}

## [1] "University =  Berkeley | Department =  waterbenders"
-## [1] "University =  Berkeley | Department =  earthbenders"
-## [1] "University =  Berkeley | Department =  firebenders"
-## [1] "University =  Berkeley | Department =  airbenders"
-## [1] "University =  Stanford | Department =  waterbenders"
-## [1] "University =  Stanford | Department =  earthbenders"
-## [1] "University =  Stanford | Department =  firebenders"
-## [1] "University =  Stanford | Department =  airbenders"

This is not bad, but … n arguments -> n nested for loops. As a scale of your problem grows, your code gets really complicated.

-
To become significantly more reliable, code must become more transparent. In particular, nested conditions and loops must be viewed with great suspicion. Complicated control flows confuse programmers. Messy code often hides bugs. — Bjarne Stroustrup
-

5 map2 & pmap

Step 1: Define inputs and a function.

Challenge 2 Why are we using rep() to create input vectors? For instance, for univ_list why not just use c("Berkeley", "Stanford")?

# Inputs (remember the length of these inputs should be identical)
-univ_list <- rep(c("Berkeley", "Stanford"),4)
-
-dept_list <- rep(c("waterbenders", "earthbenders", "firebenders", "airbenders"),2)

# Function 
-print_lists <- function(univ, dept){
-  
-  print(paste("University = ", univ, "|", "Department = ", dept))
-  
-}

# Test 
-print_lists(univ_list[1], dept_list[1])

## [1] "University =  Berkeley | Department =  waterbenders"

Step2: Using map2() or pmap()

# 2 arguments 
-map2_output <- map2(univ_list, dept_list,
-                    print_lists)

## [1] "University =  Berkeley | Department =  waterbenders"
-## [1] "University =  Stanford | Department =  earthbenders"
-## [1] "University =  Berkeley | Department =  firebenders"
-## [1] "University =  Stanford | Department =  airbenders"
-## [1] "University =  Berkeley | Department =  waterbenders"
-## [1] "University =  Stanford | Department =  earthbenders"
-## [1] "University =  Berkeley | Department =  firebenders"
-## [1] "University =  Stanford | Department =  airbenders"

# 3+ arguments 
-pmap_output <- pmap(list(univ_list, dept_list), print_lists)

## [1] "University =  Berkeley | Department =  waterbenders"
-## [1] "University =  Stanford | Department =  earthbenders"
-## [1] "University =  Berkeley | Department =  firebenders"
-## [1] "University =  Stanford | Department =  airbenders"
-## [1] "University =  Berkeley | Department =  waterbenders"
-## [1] "University =  Stanford | Department =  earthbenders"
-## [1] "University =  Berkeley | Department =  firebenders"
-## [1] "University =  Stanford | Department =  airbenders"

Challenge 3 Have you noticed that we used a slightly different input for pmap() compared to map() or map2()? What is the difference?

- -

- - - -

- -

- - - - - - - - - - - - - - - - diff --git a/lecture_notes/03_map_glue.html b/lecture_notes/03_map_glue.html deleted file mode 100644 index 2a9bb81..0000000 --- a/lecture_notes/03_map_glue.html +++ /dev/null @@ -1,426 +0,0 @@ - - - - - - - - - - - - - - -Automate Plotting - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - -

- -

- - - - - - - -

1 Setup

library(tidyverse)
-library(glue)

2 Objective

Learning how to use map() and glue() to automate creating multiple plots

3 Problem

Making the following data visualization process more efficient.

data("airquality")
-
-airquality

- -

airquality %>%
-    ggplot(aes(x = Ozone, y = Solar.R)) +
-    geom_point() +
-    labs(title = "Relationship between Ozone and Solar.R",
-         y = "Solar.R")

## Warning: Removed 42 rows containing missing values (geom_point).

airquality %>%
-    ggplot(aes(x = Ozone, y = Wind)) +
-    geom_point() +
-    labs(title = "Relationship between Ozone and Wind",
-         y = "Wind")

## Warning: Removed 37 rows containing missing values (geom_point).

airquality %>%
-    ggplot(aes(x = Ozone, y = Temp)) +
-    geom_point() +
-    labs(title = "Relationship between Ozone and Temp",
-         y = "Temp")

## Warning: Removed 37 rows containing missing values (geom_point).

4 Solution

Learn how glue() works.
glue() combines strings and objects and it works simpler and faster than paste() or sprintif().

names <- c("Jae", "Aniket", "Avery")
-
-fields <- c("Political Science", "Law", "Public Health")
-
-library(glue)
-glue("{names} studies {fields}.")

## Jae studies Political Science.
-## Aniket studies Law.
-## Avery studies Public Health.

So, our next step is to combine glue() and map().
Let’s first think about writing a function that includes glue().

Challenge 1 How can you create the character vector of column names?

Challenge 2 How can make ggplot2() take strings as x and y variable names?

airquality %>%
-      ggplot(aes(x = .data[[names(airquality)[1]]], y = .data[[names(airquality)[2]]]))+
-    geom_point()+
-    labs(title = glue("Relationship between Ozone and {names(airquality)[2]}"),
-         y = glue("{names(airquality)[2]}"))

The next step is to write an automatic plotting function.
-
- Note that in the function i (abstract argument) replaced 2 (specific number).

create_point_plot <- function(i){
-  
-  p <- airquality %>%
-    ggplot(aes(x = .data[[names(airquality)[1]]], y = .data[[names(airquality)[i]]]))+
-    geom_point() +
-    labs(title = glue("Relationship between Ozone and {names(airquality)[i]}"),
-         y = glue("{names(airquality)[i]}"))
-  print(p)
-}

The final step is to put the function in walk().

walk(2:ncol(airquality), create_point_plot)

- -

- - - -

- -

- - - - - - - - - - - - - - - - diff --git a/lecture_notes/04_reduce_bind.html b/lecture_notes/04_reduce_bind.html deleted file mode 100644 index 18f3772..0000000 --- a/lecture_notes/04_reduce_bind.html +++ /dev/null @@ -1,379 +0,0 @@ - - - - - - - - - - - - - - -Automate Binding - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - -

- -

- - - - - - - -

1 Setup

library(tidyverse)

2 Objective

Learning how to use reduce() to automate row-binding multiple dataframes

3 Problem

How can you make row-binding multiple dataframes more efficient?

df1 <- tibble(x = sample(1:10, size = 3, replace = TRUE),
-       y = sample(1:10, size = 3, replace = TRUE),
-       z = sample(1:10, size = 3, replace = TRUE))
-
-df2 <- tibble(x = sample(1:10, size = 3, replace = TRUE),
-       y = sample(1:10, size = 3, replace = TRUE),
-       z = sample(1:10, size = 3, replace = TRUE))
-
-df3 <- tibble(x = sample(1:10, size = 3, replace = TRUE),
-       y = sample(1:10, size = 3, replace = TRUE),
-       z = sample(1:10, size = 3, replace = TRUE))

4 Copy and paste

first_bind <- bind_rows(df1, df2)
-
-second_bind <- bind_rows(first_bind, df3)

Challenge Why the above solution is not efficient?

5 reduce

How reduce() works.

- Input: Takes a vector of length n
-
-- Computation: Calls a function with a pair of values at a time
-
-- Output: Returns a vector of length 1

reduced <- reduce(list(df1, df2, df3), bind_rows)

- -

- - - -

- -

- - - - - - - - - - - - - - - - diff --git a/lecture_notes/05_slower_faster.html b/lecture_notes/05_slower_faster.html deleted file mode 100644 index 6dd67e0..0000000 --- a/lecture_notes/05_slower_faster.html +++ /dev/null @@ -1,457 +0,0 @@ - - - - - - - - - - - - - - -Make Automation Slower or Faster - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - -

- -

- - - - - - - -

1 Setup

library(tidyverse)
-library(tictoc)
-library(furrr)

2 Objectives

Learning how to use slowly() and future_ to make automation process either slower or faster

3 How to Make Automation Slower

Scraping 50 pages from a website and you don’t want to overload the server. How can you do that?

4 For loop

5 Map

walk() works same as map() but doesn’t store its output.
If you’re web scraping, one problem with this approach is it’s too fast by human standards.

tic()
-walk(1:50, function(x){message("Scraping page", x)}) # Anonymous function; I don't name the function

## Scraping page1

## Scraping page2

## Scraping page3

## Scraping page4

## Scraping page5

## Scraping page6

## Scraping page7

## Scraping page8

## Scraping page9

## Scraping page10

## Scraping page11

## Scraping page12

## Scraping page13

## Scraping page14

## Scraping page15

## Scraping page16

## Scraping page17

## Scraping page18

## Scraping page19

## Scraping page20

## Scraping page21

## Scraping page22

## Scraping page23

## Scraping page24

## Scraping page25

## Scraping page26

## Scraping page27

## Scraping page28

## Scraping page29

## Scraping page30

## Scraping page31

## Scraping page32

## Scraping page33

## Scraping page34

## Scraping page35

## Scraping page36

## Scraping page37

## Scraping page38

## Scraping page39

## Scraping page40

## Scraping page41

## Scraping page42

## Scraping page43

## Scraping page44

## Scraping page45

## Scraping page46

## Scraping page47

## Scraping page48

## Scraping page49

## Scraping page50

toc()

## 0.007 sec elapsed

If you want to make the function run slowly …

-
slowly() takes a function and modifies it to wait a given amount of time between each call. - purrr package vignette
-

If a function is a verb, then a helper function is an adverb (modifying the behavior of the verb).

6 How to Make Automation Faster

In a different situation, you want to make your function run faster. This is a common situation when you collect and analyze data at large-scale. You can solve this problem using parallel processing. For more on the parallel processing in R, read this review.

Parallel processing setup
-
- Step1: Determine the number of max workers (availableCores())
- Step2: Determine the parallel processing mode (plan())

# Setup 
-n_cores <- availableCores() - 1
-n_cores # This number depends on your computer spec.

## system 
-##      3

plan(multiprocess, # multicore, if supported, otherwise multisession
-     workers = n_cores) # the maximum number of workers

## Warning: Strategy 'multiprocess' is deprecated in future (>= 1.20.0). Instead,
-## explicitly specify either 'multisession' or 'multicore'. In the current R
-## session, 'multiprocess' equals 'multisession'.

## Warning in supportsMulticoreAndRStudio(...): [ONE-TIME WARNING] Forked
-## processing ('multicore') is not supported when running R from RStudio
-## because it is considered unstable. For more details, how to control forked
-## processing or not, and how to silence this warning in future R sessions, see ?
-## parallelly::supportsMulticore

tic()
-mean100 <- map(1:100000, mean)
-toc()

## 0.392 sec elapsed

tic()
-mean100 <- future_map(1:100000, mean)
-toc()

## 0.657 sec elapsed

- -

- - - -

- -

- - - - - - - - - - - - - - - - diff --git a/lecture_notes/06_make_error_handling_easier.html b/lecture_notes/06_make_error_handling_easier.html deleted file mode 100644 index 774d522..0000000 --- a/lecture_notes/06_make_error_handling_easier.html +++ /dev/null @@ -1,455 +0,0 @@ - - - - - - - - - - - - - - -Make Error Handling Easier - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - -

- -

- - - - - - - -

1 Setup

library(tidyverse)
-library(rvest)

2 Learning objective

Learning how to use safely() and possibly() to make error handling easier

3 Problem

Challenge 1

Explain why we can’t run map(url_lists, read_html)

url_lists <- c("https://en.wikipedia.org/wiki/University_of_California,_Berkeley",
-"https://en.wikipedia.org/wiki/Stanford_University",
-"https://en.wikipedia.org/wiki/Carnegie_Mellon_University",
-"https://DLAB" 
-)

map(url_lists, read_html)

This is a toy example so it’s easy to tell where the problem is. How can you make your error more informative?

4 Solution

4.1 Try-catch

There are three kinds of messages you will run into, if your code has an error based on the following functions.
-
- stop(): errors; Functions must stop.
- warning(): warnings; Functions may still work. Nonetheless, something is possibly messed up.
- message(): messages; Some actions happened.
The basic logic of try-catch, R’s basic error handling function, works like the following.

tryCatch(
-  {map(url_lists, read_html) 
-    }, warning = function(w) {
-                "Warning"
-    }, error = function(e) {
-                "Error"
-    }, finally = {
-                "Message"
-})

## [1] "Error"

Here’s purrr version of the try-catch mechanism (evaluates code and assigns exception handlers).

4.2 safely and possibly

Outputs

result: result or NULL
error: NULL or error

test <- map(url_lists, safely(read_html))

The easier way to solve this problem is just avoiding the error.

map(url_lists, safely(read_html)) %>%
-  map("result") %>% # = map(function(x) x[["result"]]) = map(~.x[["name"]])
-  purrr::compact() # Remove empty elements

## [[1]]
-## {html_document}
-## <html class="client-nojs" lang="en" dir="ltr">
-## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
-## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...
-## 
-## [[2]]
-## {html_document}
-## <html class="client-nojs" lang="en" dir="ltr">
-## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
-## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...
-## 
-## [[3]]
-## {html_document}
-## <html class="client-nojs" lang="en" dir="ltr">
-## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
-## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...

4.3 possibly

What if the best way to solve the problem is not ignoring the error …

# If error occurred, "The URL is broken." will be stored in that element(s).
-out <- map(url_lists, 
-    
-    possibly(read_html,
-             otherwise = "The URL is broken.")
-    
-    ) 
-
-out

## [[1]]
-## {html_document}
-## <html class="client-nojs" lang="en" dir="ltr">
-## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
-## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...
-## 
-## [[2]]
-## {html_document}
-## <html class="client-nojs" lang="en" dir="ltr">
-## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
-## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...
-## 
-## [[3]]
-## {html_document}
-## <html class="client-nojs" lang="en" dir="ltr">
-## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
-## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...
-## 
-## [[4]]
-## [1] "The URL is broken."

# Let's find the broken URL.
-url_lists[out[seq(out)] == "The URL is broken."]

## [1] "https://DLAB"

- -

- - - -

- -

- - - - - - - - - - - - - - - - diff --git a/lecture_notes/libs/header-attrs-2.3/header-attrs.js b/lecture_notes/libs/header-attrs-2.3/header-attrs.js deleted file mode 100644 index dd57d92..0000000 --- a/lecture_notes/libs/header-attrs-2.3/header-attrs.js +++ /dev/null @@ -1,12 +0,0 @@ -// Pandoc 2.9 adds attributes on both header and div. We remove the former (to -// be compatible with the behavior of Pandoc < 2.8). -document.addEventListener('DOMContentLoaded', function(e) { - var hs = document.querySelectorAll("div.section[class*='level'] > :first-child"); - var i, h, a; - for (i = 0; i < hs.length; i++) { - h = hs[i]; - if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6 - a = h.attributes; - while (a.length > 0) h.removeAttribute(a[0].name); - } -}); diff --git a/lecture_notes/libs/remark-css-0.0.1/default-fonts.css b/lecture_notes/libs/remark-css-0.0.1/default-fonts.css deleted file mode 100644 index 8d035fa..0000000 --- a/lecture_notes/libs/remark-css-0.0.1/default-fonts.css +++ /dev/null @@ -1,10 +0,0 @@ -@import url(https://fonts.googleapis.com/css?family=Yanone+Kaffeesatz); -@import url(https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic); -@import url(https://fonts.googleapis.com/css?family=Source+Code+Pro:400,700); - -body { font-family: 'Droid Serif', 'Palatino Linotype', 'Book Antiqua', Palatino, 'Microsoft YaHei', 'Songti SC', serif; } -h1, h2, h3 { - font-family: 'Yanone Kaffeesatz'; - font-weight: normal; -} -.remark-code, .remark-inline-code { font-family: 'Source Code Pro', 'Lucida Console', Monaco, monospace; } diff --git a/lecture_notes/libs/remark-css-0.0.1/default.css b/lecture_notes/libs/remark-css-0.0.1/default.css deleted file mode 100644 index cb9fc34..0000000 --- a/lecture_notes/libs/remark-css-0.0.1/default.css +++ /dev/null @@ -1,72 +0,0 @@ -a, a > code { - color: rgb(249, 38, 114); - text-decoration: none; -} -.footnote { - position: absolute; - bottom: 3em; - padding-right: 4em; - font-size: 90%; -} -.remark-code-line-highlighted { background-color: #ffff88; } - -.inverse { - background-color: #272822; - color: #d6d6d6; - text-shadow: 0 0 20px #333; -} -.inverse h1, .inverse h2, .inverse h3 { - color: #f3f3f3; -} -/* Two-column layout */ -.left-column { - color: #777; - width: 20%; - height: 92%; - float: left; -} -.left-column h2:last-of-type, .left-column h3:last-child { - color: #000; -} -.right-column { - width: 75%; - float: right; - padding-top: 1em; -} -.pull-left { - float: left; - width: 47%; -} -.pull-right { - float: right; - width: 47%; -} -.pull-right ~ * { - clear: both; -} -img, video, iframe { - max-width: 100%; -} -blockquote { - border-left: solid 5px lightgray; - padding-left: 1em; -} -.remark-slide table { - margin: auto; - border-top: 1px solid #666; - border-bottom: 1px solid #666; -} -.remark-slide table thead th { border-bottom: 1px solid #ddd; } -th, td { padding: 5px; } -.remark-slide thead, .remark-slide tfoot, .remark-slide tr:nth-child(even) { background: #eee } - -@page { margin: 0; } -@media print { - .remark-slide-scaler { - width: 100% !important; - height: 100% !important; - transform: scale(1) !important; - top: 0 !important; - left: 0 !important; - } -} diff --git a/lecture_notes/01_why_map.Rmd b/lessons/01_why_map.Rmd similarity index 100% rename from lecture_notes/01_why_map.Rmd rename to lessons/01_why_map.Rmd diff --git a/lecture_notes/02_more_inputs.Rmd b/lessons/02_more_inputs.Rmd similarity index 100% rename from lecture_notes/02_more_inputs.Rmd rename to lessons/02_more_inputs.Rmd diff --git a/lecture_notes/03_map_glue.Rmd b/lessons/03_map_glue.Rmd similarity index 100% rename from lecture_notes/03_map_glue.Rmd rename to lessons/03_map_glue.Rmd diff --git a/lecture_notes/04_reduce_bind.Rmd b/lessons/04_reduce_bind.Rmd similarity index 100% rename from lecture_notes/04_reduce_bind.Rmd rename to lessons/04_reduce_bind.Rmd diff --git a/lecture_notes/05_slower_faster.Rmd b/lessons/05_slower_faster.Rmd similarity index 100% rename from lecture_notes/05_slower_faster.Rmd rename to lessons/05_slower_faster.Rmd diff --git a/lecture_notes/06_make_error_handling_easier.Rmd b/lessons/06_make_error_handling_easier.Rmd similarity index 100% rename from lecture_notes/06_make_error_handling_easier.Rmd rename to lessons/06_make_error_handling_easier.Rmd diff --git a/runtime.txt b/runtime.txt deleted file mode 100644 index 6aa25e0..0000000 --- a/runtime.txt +++ /dev/null @@ -1 +0,0 @@ -r-3.6-2020-06-03