[go: up one dir, main page]

0% found this document useful (0 votes)
108 views8 pages

Tidyverse: Core Packages in Tidyverse

The document discusses the tidyverse package in R which contains packages for data science tasks like wrangling, visualization, and modeling. It provides examples of using dplyr to select, filter, group, and summarize data; using tidyr to gather, spread, separate, and unite data; using ggplot2 to create density and scatter plots; and using stargazer and lfe packages to output regression tables and fit linear models with group fixed effects.

Uploaded by

Abhishek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views8 pages

Tidyverse: Core Packages in Tidyverse

The document discusses the tidyverse package in R which contains packages for data science tasks like wrangling, visualization, and modeling. It provides examples of using dplyr to select, filter, group, and summarize data; using tidyr to gather, spread, separate, and unite data; using ggplot2 to create density and scatter plots; and using stargazer and lfe packages to output regression tables and fit linear models with group fixed effects.

Uploaded by

Abhishek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Tidyverse

Tidyverse is a collection of essential R packages for data science. The packages under the
tidyverse umbrella help us in performing and interacting with the data. There are a whole host of
things you can do with your data, such as subsetting, transforming, visualizing, etc.

● Start by installing Tidyverse :

install.packages("tidyverse")

Core packages in Tidyverse

Data Wrangling & Data Import & Management Data Visualization


Transformations

Dplyr Tibble ggplot2

Tidyr Readr

StringR

Forcats

dplyr

List of functions dplyr offers:

● select(): Select columns from your dataset


● filter(): Filter out certain rows that meet your criteria(s)
● group_by(): Group different observations together such that the original dataset does not
change. Only the way it is represented is changed in the form of a list
● summarise(): Summarise any of the above functions
● arrange(): Arrange your column data in ascending or descending order
● join(): Perform left, right, full, and inner joins in R
● mutate(): Create new columns by preserving the existing variables
Through this subsection , the code examples will be working on the food demand dataset which
can be accessed here : https://www.kaggle.com/shivashi11/food-demand-prediction

Code Examples :

library(dplyr)
joined_data <- left_join(data,fc,by="center_id")

Let’s use three dplyr functions simultaneously to summarise the data.

data %>%
select(center_type,num_orders) %>%
filter(center_type=="TYPE_A") %>%
summarise(avg_A=mean(num_orders))

tidyr

List of functions tidyr offers:

● gather(): The function “gathers” multiple columns from your dataset and converts them
into key-value pairs
● spread(): This takes two columns and “spreads” them into multiple columns
● separate(): As the name suggests, this function helps in separating or splitting a single
column into numerous columns
● unite(): Works completely opposite to the separate() function. It helps in combining two
or more columns into one

Below is code for uniting two binary variables and create only one column for both:

data %>%
unite_(.,"email_home",c("emailer_for_promotion","homepage_featured")) %>%
head()

Output :

Another example of how tidyr works :

data<- data.frame(variable1 = rep(LETTERS[1:3], each = 3),


variable2 = rep(paste0("factor", c(1, 2, 3)), 3),
num = 1:9)
head(data)
spread(data,variable2,num)
ggplot2
Ggplot2 is often used to produce charts and visualizations because of it’s ease of use and
interactivity.

An example below to create a density chart :

install.packages("ggplot2")
library(ggplot2)

ggplot(data = data) +
aes(x = num_orders) +
geom_density(adjust = 1, fill = "#0c4c8a") +
theme_minimal()
Below is an example to create a scatterplot :

ggplot(data = data) +
aes(x = checkout_price, y = base_price) +
geom_point(color = "#1f9e89") +
theme_minimal()

Stargazer
Stargazer is an R package that creates LATEX code, HTML code and ASCII text for well-
formatted regression tables, with multiple models side-by-side, as well as for summary statistics
tables, data frames, vectors and matrices.

Stargazer excels in at least three respects: its ease of use, the large number of models it
supports, and its beautiful aesthetics.

● One can install stargazer from CRAN in the usual way:

install.packages("stargazer")
library(stargazer)

● To create a summary statistics table from the ‘attitude’ data frame (which should be
available with your default installation of R), simply run the following:
stargazer(attitude)
Output :

● To output the contents of the first four rows of some data frame, specify the part of the
data frame you would like to see, and set the summary option to FALSE:

stargazer(attitude[1:4,], summary=FALSE, rownames=FALSE)


Output :

Now, let us try to create a simple regression table with three side-by-side models – two Ordinary
Least Squares (OLS) and one probit regression model – using the lm() and glm() functions.
We can set the align argument to TRUE, so that coefficients in each column are aligned along
the decimal point.
stargazer(linear.1, linear.2, probit.model, title="Results", align=TRUE)
Output :

lfe
lfe or Linear group effects package is intended for linear models with multiple group fixed
effects, i.e. with 2 or more factors with a large number of levels. It performs similar functions as
lm, but it uses a special method for projecting out multiple group fixed effects from the normal
equations, hence it is faster. It is a generalization of the within estimator. This may be required if
the groups have high cardinality (many levels), resulting in tens or hundreds of thousands of
dummy variables. It is also useful if one only wants to control for the group effects, without
actually estimating them

Code example :
oldopts <- options(lfe.threads=1)
x <- rnorm(1000)
x2 <- rnorm(length(x))
id <- factor(sample(10,length(x),replace=TRUE))
firm <- factor(sample(3,length(x),replace=TRUE,prob=c(2,1.5,1)))
year <- factor(sample(10,length(x),replace=TRUE,prob=c(2,1.5,rep(1,8))))
id.eff <- rnorm(nlevels(id))
firm.eff <- rnorm(nlevels(firm))
year.eff <- rnorm(nlevels(year))
y <- x + 0.25*x2 + id.eff[id] + firm.eff[firm] +
year.eff[year] + rnorm(length(x))
est <- felm(y ~ x+x2 | id + firm + year)
summary(est)

getfe(est,se=TRUE)
# compare with an ordinary lm
summary(lm(y ~ x+x2+id+firm+year-1))
options(oldopts)

Output :

You might also like