0% found this document useful (0 votes)

8 views32 pages

Data Wrangling

This document introduces the dplyr package from the tidyverse for data manipulation in R, focusing on filtering data frames and using the pipe operator (%>%) to link functions. It provides instructions on loading dplyr, importing data, and subsetting data using both base R and dplyr functions like select() and filter(). The document also includes examples of selecting and renaming columns from a differential expression results dataset.

Uploaded by

jeremybravo4229

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views32 pages

Data Wrangling

Uploaded by

jeremybravo4229

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

70 Introduction to dplyr and the %>%

Introduction to dplyr and the %>%

Objectives
Today we will begin to wrangle data using the tidyverse package, dplyr. To this end, you will
learn:

1. how to filter data frames using dplyr

2. how to employ the pipe (%>%) operator to link functions

What is dplyr?
The package dplyr tries to provide easy tools for the most common data
manipulation tasks. It was built to work directly with data frames. The thinking
behind it was largely inspired by the package plyr which has been in use for some
time but suffered from being slow in some cases. --- datacarpentry.com (https://
datacarpentry.org/genomics-r-intro/05-dplyr/index.html)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.

## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts(
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all co

Bioinformatics Training and Education Program

71 Introduction to dplyr and the %>%

Importing data
For this lesson, we will use sample metadata and differential expression results from the
airway RNA-Seq project.

Let's begin by importing the data.

#sample information
smeta<-read_delim("./data/airway_sampleinfo.txt")

## Rows: 8 Columns: 9
## ── Column specification ────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
## dbl (1): avgLength
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this mess

smeta

## # A tibble: 8 × 9
## SampleName cell dex albut Run avgLength Experiment Sample BioSa
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 GSM1275862 N61311 untrt untrt SRR10395… 126 SRX384345 SRS50… SAMN0
## 2 GSM1275863 N61311 trt untrt SRR10395… 126 SRX384346 SRS50… SAMN0
## 3 GSM1275866 N052611 untrt untrt SRR10395… 126 SRX384349 SRS50… SAMN0
## 4 GSM1275867 N052611 trt untrt SRR10395… 87 SRX384350 SRS50… SAMN0
## 5 GSM1275870 N080611 untrt untrt SRR10395… 120 SRX384353 SRS50… SAMN0
## 6 GSM1275871 N080611 trt untrt SRR10395… 126 SRX384354 SRS50… SAMN0
## 7 GSM1275874 N061011 untrt untrt SRR10395… 101 SRX384357 SRS50… SAMN0
## 8 GSM1275875 N061011 trt untrt SRR10395… 98 SRX384358 SRS50… SAMN0

#let's use our differential expression results

dexp<-read_delim("./data/diffexp_results_edger_airways.txt")

## Rows: 15926 Columns: 10

## ── Column specification ────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): feature, albut, transcript, ref_genome

Bioinformatics Training and Education Program

72 Introduction to dplyr and the %>%

## dbl (5): logFC, logCPM, F, PValue, FDR

## lgl (1): .abundant
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this mess

dexp

## # A tibble: 15,926 × 10
## feature albut transcript ref_genome .abundant logFC logCPM F PV
## <chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <
## 1 ENSG000… untrt TSPAN6 hg38 TRUE -0.390 5.06 32.8 3.1
## 2 ENSG000… untrt DPM1 hg38 TRUE 0.198 4.61 6.90 2.8
## 3 ENSG000… untrt SCYL3 hg38 TRUE 0.0292 3.48 0.0969 7.6
## 4 ENSG000… untrt C1orf112 hg38 TRUE -0.124 1.47 0.377 5.5
## 5 ENSG000… untrt CFH hg38 TRUE 0.417 8.09 29.3 4.6
## 6 ENSG000… untrt FUCA2 hg38 TRUE -0.250 5.91 14.9 4.0
## 7 ENSG000… untrt GCLC hg38 TRUE -0.0581 4.84 0.167 6.9
## 8 ENSG000… untrt NFYA hg38 TRUE -0.509 4.13 44.9 1.0
## 9 ENSG000… untrt STPG1 hg38 TRUE -0.136 3.12 1.04 3.3
## 10 ENSG000… untrt NIPAL3 hg38 TRUE -0.0500 7.04 0.350 5.6
## # ℹ 15,916 more rows
## # ℹ 1 more variable: FDR <dbl>

We can get an idea of the structure of these data by using str() or glimpse(). glimpse(),
from tidyverse, is similar to str() but provides somewhat cleaner output.

glimpse(smeta)

## Rows: 8
## Columns: 9
## $ SampleName <chr> "GSM1275862", "GSM1275863", "GSM1275866", "GSM1275867", "
## $ cell <chr> "N61311", "N61311", "N052611", "N052611", "N080611", "N08
## $ dex <chr> "untrt", "trt", "untrt", "trt", "untrt", "trt", "untrt",
## $ albut <chr> "untrt", "untrt", "untrt", "untrt", "untrt", "untrt", "un
## $ Run <chr> "SRR1039508", "SRR1039509", "SRR1039512", "SRR1039513", "
## $ avgLength <dbl> 126, 126, 126, 87, 120, 126, 101, 98
## $ Experiment <chr> "SRX384345", "SRX384346", "SRX384349", "SRX384350", "SRX3
## $ Sample <chr> "SRS508568", "SRS508567", "SRS508571", "SRS508572", "SRS5
## $ BioSample <chr> "SAMN02422669", "SAMN02422675", "SAMN02422678", "SAMN0242

Bioinformatics Training and Education Program

73 Introduction to dplyr and the %>%

glimpse(dexp)

## Rows: 15,926
## Columns: 10
## $ feature <chr> "ENSG00000000003", "ENSG00000000419", "ENSG00000000457",
## $ albut <chr> "untrt", "untrt", "untrt", "untrt", "untrt", "untrt", "un
## $ transcript <chr> "TSPAN6", "DPM1", "SCYL3", "C1orf112", "CFH", "FUCA2", "G
## $ ref_genome <chr> "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "
## $ .abundant <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU
## $ logFC <dbl> -0.390100222, 0.197802179, 0.029160865, -0.124382022, 0.4
## $ logCPM <dbl> 5.059704, 4.611483, 3.482462, 1.473375, 8.089146, 5.90966
## $ F <dbl> 3.284948e+01, 6.903534e+00, 9.685073e-02, 3.772134e-01, 2
## $ PValue <dbl> 0.0003117656, 0.0280616149, 0.7629129276, 0.5546956332, 0
## $ FDR <dbl> 0.002831504, 0.077013489, 0.844247837, 0.682326613, 0.003

Now that we have some data to work with, let's start subsetting.

Subsetting data in base R

Base R uses bracket notation for subsetting. For example, if we want to subset the data frame
iris to include only the first 5 rows and the first 3 columns, we could use

iris[1:5,1:3]

## Sepal.Length Sepal.Width Petal.Length

## 1 5.1 3.5 1.4
## 2 4.9 3.0 1.4
## 3 4.7 3.2 1.3
## 4 4.6 3.1 1.5
## 5 5.0 3.6 1.4

While this type of subsetting is useful, it is not always the most readable or easy to employ,
especially for beginners. This is where dplyr comes in. The dplyr package in the tidyverse
world simplifies data wrangling with easy to employ and easy to understand functions specific
for data manipulation in data frames.

Subsetting with dplyr

How can we select only columns of interest and rows of interest? We can use select() and
filter() from dplyr.

Bioinformatics Training and Education Program

74 Introduction to dplyr and the %>%

Subsetting by column (select())

To subset by column, we use the function select(). We can include and exclude columns,
reorder columns, and rename columns using select().

Select a few columns from our differential expression results (dexp).

We can select the columns we are interested in by first calling the data frame object (dexp)
followed by the columns we want to select (transcript,logFC,FDR). All arguments are
separated by a comma. The order of the arguments will determine the order of the columns in
the new data frame.

#select the gene / transcript, logFC, and FDR corrected p-value

#first argument is the df followed by columns to select
ex1<-select(dexp, transcript, logFC, FDR)
ex1

## # A tibble: 15,926 × 3
## transcript logFC FDR
## <chr> <dbl> <dbl>
## 1 TSPAN6 -0.390 0.00283
## 2 DPM1 0.198 0.0770
## 3 SCYL3 0.0292 0.844
## 4 C1orf112 -0.124 0.682
## 5 CFH 0.417 0.00376
## 6 FUCA2 -0.250 0.0186
## 7 GCLC -0.0581 0.794
## 8 NFYA -0.509 0.00126
## 9 STPG1 -0.136 0.478
## 10 NIPAL3 -0.0500 0.695
## # ℹ 15,916 more rows

We can rename while selecting.

#rename using the syntax new_name = old_name

ex1<-select(dexp, gene=transcript, logFoldChange = logFC, FDRpvalue=FDR)
ex1

## # A tibble: 15,926 × 3
## gene logFoldChange FDRpvalue
## <chr> <dbl> <dbl>
## 1 TSPAN6 -0.390 0.00283

Bioinformatics Training and Education Program

75 Introduction to dplyr and the %>%

## 2 DPM1 0.198 0.0770

## 3 SCYL3 0.0292 0.844
## 4 C1orf112 -0.124 0.682
## 5 CFH 0.417 0.00376
## 6 FUCA2 -0.250 0.0186
## 7 GCLC -0.0581 0.794
## 8 NFYA -0.509 0.00126
## 9 STPG1 -0.136 0.478
## 10 NIPAL3 -0.0500 0.695
## # ℹ 15,916 more rows

Note

If you want to retain all columns, you could also use rename() (https://dplyr.tidyverse.org/reference/rename.html)
from dplyr to rename columns.

Excluding columns

We can select all columns, leaving out ones that do not interest us using a - sign. This is helpful
if the columns to keep far outweigh those to exclude. We can similarly use the ! to negate a
selection.

ex2<-select(dexp, -feature)
ex2

## # A tibble: 15,926 × 9
## albut transcript ref_genome .abundant logFC logCPM F PValue
## <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <
## 1 untrt TSPAN6 hg38 TRUE -0.390 5.06 32.8 0.000312 0.0
## 2 untrt DPM1 hg38 TRUE 0.198 4.61 6.90 0.0281 0.0
## 3 untrt SCYL3 hg38 TRUE 0.0292 3.48 0.0969 0.763 0.8
## 4 untrt C1orf112 hg38 TRUE -0.124 1.47 0.377 0.555 0.6
## 5 untrt CFH hg38 TRUE 0.417 8.09 29.3 0.000463 0.0
## 6 untrt FUCA2 hg38 TRUE -0.250 5.91 14.9 0.00405 0.0
## 7 untrt GCLC hg38 TRUE -0.0581 4.84 0.167 0.692 0.7
## 8 untrt NFYA hg38 TRUE -0.509 4.13 44.9 0.000100 0.0
## 9 untrt STPG1 hg38 TRUE -0.136 3.12 1.04 0.335 0.4
## 10 untrt NIPAL3 hg38 TRUE -0.0500 7.04 0.350 0.569 0.6
## # ℹ 15,916 more rows

ex2<-select(dexp, !feature)
ex2

Bioinformatics Training and Education Program

76 Introduction to dplyr and the %>%

We can reorder using select().

For readability, let's move the transcript column to the front.

#you can reorder columns and call a range of columns using select().
ex3<-select(dexp, transcript:FDR,albut)
ex3

## # A tibble: 15,926 × 9
## transcript ref_genome .abundant logFC logCPM F PValue FDR a
## <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <
## 1 TSPAN6 hg38 TRUE -0.390 5.06 32.8 0.000312 0.00283 u
## 2 DPM1 hg38 TRUE 0.198 4.61 6.90 0.0281 0.0770 u
## 3 SCYL3 hg38 TRUE 0.0292 3.48 0.0969 0.763 0.844 u
## 4 C1orf112 hg38 TRUE -0.124 1.47 0.377 0.555 0.682 u
## 5 CFH hg38 TRUE 0.417 8.09 29.3 0.000463 0.00376 u
## 6 FUCA2 hg38 TRUE -0.250 5.91 14.9 0.00405 0.0186 u
## 7 GCLC hg38 TRUE -0.0581 4.84 0.167 0.692 0.794 u
## 8 NFYA hg38 TRUE -0.509 4.13 44.9 0.000100 0.00126 u
## 9 STPG1 hg38 TRUE -0.136 3.12 1.04 0.335 0.478 u
## 10 NIPAL3 hg38 TRUE -0.0500 7.04 0.350 0.569 0.695 u
## # ℹ 15,916 more rows

Note

This also would have excluded the feature column.

Bioinformatics Training and Education Program

77 Introduction to dplyr and the %>%

Selecting a range of columns

Notice that we can select a range of columns using the :. We could also deselect a range of
columns or deselect a range of columns while adding a column back.

ex3<-select(dexp, -(albut:F),logFC)
ex3

## # A tibble: 15,926 × 4
## feature PValue FDR logFC
## <chr> <dbl> <dbl> <dbl>
## 1 ENSG00000000003 0.000312 0.00283 -0.390
## 2 ENSG00000000419 0.0281 0.0770 0.198
## 3 ENSG00000000457 0.763 0.844 0.0292
## 4 ENSG00000000460 0.555 0.682 -0.124
## 5 ENSG00000000971 0.000463 0.00376 0.417
## 6 ENSG00000001036 0.00405 0.0186 -0.250
## 7 ENSG00000001084 0.692 0.794 -0.0581
## 8 ENSG00000001167 0.000100 0.00126 -0.509
## 9 ENSG00000001460 0.335 0.478 -0.136
## 10 ENSG00000001461 0.569 0.695 -0.0500
## # ℹ 15,916 more rows

Helper functions

We can also include helper functions such as starts_with() and ends_with()

select(dexp, transcript, starts_with("log"), FDR)

## # A tibble: 15,926 × 4
## transcript logFC logCPM FDR
## <chr> <dbl> <dbl> <dbl>
## 1 TSPAN6 -0.390 5.06 0.00283
## 2 DPM1 0.198 4.61 0.0770
## 3 SCYL3 0.0292 3.48 0.844
## 4 C1orf112 -0.124 1.47 0.682
## 5 CFH 0.417 8.09 0.00376
## 6 FUCA2 -0.250 5.91 0.0186
## 7 GCLC -0.0581 4.84 0.794
## 8 NFYA -0.509 4.13 0.00126
## 9 STPG1 -0.136 3.12 0.478

Bioinformatics Training and Education Program

78 Introduction to dplyr and the %>%

## 10 NIPAL3 -0.0500 7.04 0.695

## # ℹ 15,916 more rows

There are a number of other selection helpers. See the help documentation for select (https://
dplyr.tidyverse.org/reference/select.html) for more information ?dplyr::select().

Select columns of a particular type

There are many other ways to select multiple columns. You may commonly be interested in
selecting all numeric columns or all factors. The syntax below can be used for this purpose.

select(dexp, where(is.numeric)) #or

## # A tibble: 15,926 × 5
## logFC logCPM F PValue FDR
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -0.390 5.06 32.8 0.000312 0.00283
## 2 0.198 4.61 6.90 0.0281 0.0770
## 3 0.0292 3.48 0.0969 0.763 0.844
## 4 -0.124 1.47 0.377 0.555 0.682
## 5 0.417 8.09 29.3 0.000463 0.00376
## 6 -0.250 5.91 14.9 0.00405 0.0186
## 7 -0.0581 4.84 0.167 0.692 0.794
## 8 -0.509 4.13 44.9 0.000100 0.00126
## 9 -0.136 3.12 1.04 0.335 0.478
## 10 -0.0500 7.04 0.350 0.569 0.695
## # ℹ 15,916 more rows

select_if(dexp, is.numeric) #select_if is a scoped verb function

Bioinformatics Training and Education Program

79 Introduction to dplyr and the %>%

## 10 -0.0500 7.04 0.350 0.569 0.695

## # ℹ 15,916 more rows

Subsetting by row (filter())

To subset by row, we use the function filter().

filter() only includes rows where the condition is TRUE; it excludes both FALSE and
NA values. ---R4DS (https://r4ds.had.co.nz/transform.html#filter-rows-with-filter)

Now let's filter the rows from smeta based on a condition. Let's look at only the treated samples
in dex (i.e., trt) using the function filter(). The first argument is the data frame (e.g.,
smeta) followed by the expression(s) to filter the data frame.

filter(smeta, dex == "trt") #we've seen == notation before

To complete these filter phrases you will often need to include comparison operators such as
the == above. These operators help us evaluate relations. For example, a == b is asking if a
and b are equivalent. It is a logical comparison that when evaluated will return TRUE or FALSE.
The filter function will then return rows that evaluate to TRUE.

Try the following:

a <- 1
b <- 1
a == b

## [1] TRUE

Keep these comparison operators in mind for filtering.

Comparison operators
Comparison Operator Description
> greater than

>= greater than or equal to

< less than

<= less than or equal to

!= Not equal

== equal

Bioinformatics Training and Education Program

80 Introduction to dplyr and the %>%

Comparison Operator Description

a|b a or b

a&b a and b

We may want to combine filtering parameters using AND or OR phrasing and the operators &
and |.

For example, if we only wanted to return rows where dex == trt and cell==N61311, we can
use:

filter(smeta, dex == "trt" & cell == "N61311")

## # A tibble: 1 × 9
## SampleName cell dex albut Run avgLength Experiment Sample BioSa
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS50… SAMN0

A , is treated the same as & in the case of filter().

filter(smeta, dex == "trt", cell == "N61311")

We can also filter by one condition or another using the |.

filter(smeta,cell == "N080611" | cell == "N61311")

## # A tibble: 4 × 9
## SampleName cell dex albut Run avgLength Experiment Sample BioSa
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 GSM1275862 N61311 untrt untrt SRR10395… 126 SRX384345 SRS50… SAMN0
## 2 GSM1275863 N61311 trt untrt SRR10395… 126 SRX384346 SRS50… SAMN0
## 3 GSM1275870 N080611 untrt untrt SRR10395… 120 SRX384353 SRS50… SAMN0
## 4 GSM1275871 N080611 trt untrt SRR10395… 126 SRX384354 SRS50… SAMN0

Bioinformatics Training and Education Program

81 Introduction to dplyr and the %>%

The %in% operator

Used to match elements of a vector.

%in% returns a logical vector indicating if there is a match or not for its left operand.
--- match R Documentation.

The returned logical vector will be the length of the vector to the left. Its basic usage:

smeta$SampleName %in% c("GSM1275871","GSM1275863")

## [1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE

c("GSM1275871","GSM1275863") %in% smeta$SampleName

## [1] TRUE TRUE

We can combine the %in% operator with filter().

#filter for two cell lines

filter(smeta,cell %in% c("N061011", "N052611"))

## # A tibble: 4 × 9
## SampleName cell dex albut Run avgLength Experiment Sample BioSa
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 GSM1275866 N052611 untrt untrt SRR10395… 126 SRX384349 SRS50… SAMN0
## 2 GSM1275867 N052611 trt untrt SRR10395… 87 SRX384350 SRS50… SAMN0
## 3 GSM1275874 N061011 untrt untrt SRR10395… 101 SRX384357 SRS50… SAMN0
## 4 GSM1275875 N061011 trt untrt SRR10395… 98 SRX384358 SRS50… SAMN0

Including multiple phrases

#use `|` operator

#look at only results with named genes (not NAs)
#and those with a log fold change greater than 2
#and either a p-value or an FDR corrected p_value < or = to 0.01
#The comma acts as &
sig_annot_transcripts<-
filter(dexp, !is.na(transcript),

Bioinformatics Training and Education Program

82 Introduction to dplyr and the %>%

abs(logFC) > 2, (PValue | FDR <= 0.01))

Filtering across columns

Past versions of dplyr included powerful variants of filter, select, and other functions to
help perform tasks across columns. You may see functions such as filter_all, filter_if,
and filter_at. Functions like these can still be used but have been superseded by across
(https://dplyr.tidyverse.org/reference/across.html). However, across has been deprecated in
the case of filter and replaced by if_any() and if_all().

Both functions operate similarly to across() but go the extra mile of aggregating the
results to indicate if all the results are true when using if_all(), or if at least one is
true when using if_any() ---tidyverse.org (https://www.tidyverse.org/blog/2021/02/
dplyr-1-0-4-if-any/)

Let's briefly see this in action.

f<-filter(dexp, if_all(PValue:FDR, ~ . < 0.05))

Anonymous functions

The code above includes an anonymous function. Read more here (https://jennybc.github.io/purrr-tutorial/ls03_map-
function-syntax.html#anonymous_function,_formula). You may also find this stackoverflow post (https://
stackoverflow.com/questions/56532119/dplyr-piping-data-difference-between-and-x) useful.

Subsetting rows by position

There are times when you may want to subset your data by position, for example, the first or last
number of rows. There are a series of functions in the tidyverse that facilitate this type of
subsetting. The primary function is slice(), which has several commonly used helper
functions including slice_head(), slice_tail(), slice_min(), and slice_max(). See
the slice() (https://dplyr.tidyverse.org/reference/slice.html) documentation for more
information.

Introducing the pipe

Often we will apply multiple functions to wrangle a data frame into the state that we need it. For
example, maybe you want to select and filter. What are our options? We could run one step
after another, saving an object for each step, or we could nest a function within a function, but
these can affect code readability and clutter our work space, making it difficult to follow what
we or someone else did.

For example,

Bioinformatics Training and Education Program

83 Introduction to dplyr and the %>%

#Run one step at a time with intermediate objects.

#We've done this a few times above
#select gene, logFC, FDR
dexp_s<-select(dexp, transcript, logFC, FDR)

#Now filter for only the genes "TSPAN6" and DPM1

#Note: we could have used %in%
tspanDpm<- filter(dexp_s, transcript == "TSPAN6" | transcript=="DPM1")

#Nested code example

tspanDpm<- filter(select(dexp, c(transcript, logFC, FDR)),
transcript == "TSPAN6" | transcript=="DPM1" )

Let's explore how piping streamlines this. Piping (using %>%) allows you to employ multiple
functions consecutively, while improving readability. The output of one function is passed
directly to another without storing the intermediate steps as objects. You can pipe from the
beginning (reading in the data) all the way to plotting without storing the data or intermediate
objects, if you want. Pipes in R come from the magrittr package, which is a dependency of
dplyr.

Pipe

Read more info about the magrittr pipe here (https://magrittr.tidyverse.org/reference/pipe.html) . There is also a
native R pipe, |>, as of R 4.1.0. Read more about the difference between %>% and |> here (https://
www.tidyverse.org/blog/2023/04/base-vs-magrittr-pipe/).

To pipe, we have to first call the data and then pipe it into a function. The output of each step is
then piped into the next step.

Let's see how this works

tspanDpm <- dexp %>% #call the data and pipe to select()
select(transcript, logFC, FDR) %>% #select columns of interest
filter(transcript == "TSPAN6" | transcript=="DPM1" ) #filter

Notice that the data argument has been dropped from select() and filter(). This is
because the pipe passes the input from the left to the right. The %>% must be at the end of each
line.

Piping from the beginning:

read_delim("./data/diffexp_results_edger_airways.txt") %>% #read data

select(transcript, logFC, FDR) %>% #select columns of interest
filter(transcript == "TSPAN6" | transcript=="DPM1" ) %>% #filter

Bioinformatics Training and Education Program

84 Introduction to dplyr and the %>%

ggplot(aes(x=transcript,y=logFC,fill=FDR)) + #plot
geom_bar(stat = "identity") +
theme_classic() +
geom_hline(yintercept=0, linetype="dashed", color = "black")

## Rows: 15926 Columns: 10

## ── Column specification ────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): feature, albut, transcript, ref_genome
## dbl (5): logFC, logCPM, F, PValue, FDR
## lgl (1): .abundant
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this mess

The dplyr functions by themselves are somewhat simple, but by combining them
into linear workflows with the pipe, we can accomplish more complex manipulations
of data frames. ---datacarpentry.org (https://datacarpentry.org/genomics-r-intro/05-
dplyr/index.html)

Bioinformatics Training and Education Program

85 Introduction to dplyr and the %>%

Reordering rows
There are many steps that can be taken following subsetting (i.e., filtering by rows and
columns); one of which is reordering rows. In the tidyverse, reordering rows is largely done by
arrange(). Arrange will reorder a variable from smallest to largest, or in the case of
characters, alphabetically, from a to z.

Let's arrange the genes in dexp.

dexp %>% arrange(transcript)

## # A tibble: 15,926 × 10
## feature albut transcript ref_genome .abundant logFC logCPM F PV
## <chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <
## 1 ENSG0000… untrt A1BG-AS1 hg38 TRUE 0.513 1.02 9.22 1.4
## 2 ENSG0000… untrt A2M hg38 TRUE 0.528 10.1 3.57 9.2
## 3 ENSG0000… untrt A2M-AS1 hg38 TRUE -0.337 0.308 2.76 1.3
## 4 ENSG0000… untrt A4GALT hg38 TRUE 0.519 5.89 24.5 8.5
## 5 ENSG0000… untrt AAAS hg38 TRUE -0.0254 5.12 0.134 7.2
## 6 ENSG0000… untrt AACS hg38 TRUE -0.191 4.06 5.00 5.3
## 7 ENSG0000… untrt AADAT hg38 TRUE -0.642 2.67 16.9 2.7
## 8 ENSG0000… untrt AAGAB hg38 TRUE -0.165 5.08 5.82 3.9
## 9 ENSG0000… untrt AAK1 hg38 TRUE -0.188 3.82 2.29 1.6
## 10 ENSG0000… untrt AAMDC hg38 TRUE 0.447 2.42 8.52 1.7
## # ℹ 15,916 more rows
## # ℹ 1 more variable: FDR <dbl>

Let's arrange logFC from smallest to largest.

dexp %>% arrange(logFC)

## # A tibble: 15,926 × 10
## feature albut transcript ref_genome .abundant logFC logCPM F PV
## <chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <
## 1 ENSG000002… untrt LINC00906 hg38 TRUE -4.59 0.473 139. 1.1
## 2 ENSG000001… untrt LRRTM2 hg38 TRUE -4.00 1.24 127. 1.6
## 3 ENSG000001… untrt VASH2 hg38 TRUE -3.95 0.0171 152. 7.7
## 4 ENSG000001… untrt VCAM1 hg38 TRUE -3.66 4.60 565. 2.8
## 5 ENSG000001… untrt SLC14A1 hg38 TRUE -3.63 1.38 42.3 1.2
## 6 ENSG000002… untrt FER1L6 hg38 TRUE -3.13 3.53 238. 1.1
## 7 ENSG000001… untrt SMTNL2 hg38 TRUE -3.12 1.46 134. 1.2
## 8 ENSG000001… untrt WNT2 hg38 TRUE -3.07 3.99 521. 4.0

Bioinformatics Training and Education Program

86 Introduction to dplyr and the %>%

## 9 ENSG000001… untrt EGR2 hg38 TRUE -3.04 -0.141 96.1 5.1

## 10 ENSG000001… untrt SLITRK6 hg38 TRUE -3.03 1.16 130. 1.4
## # ℹ 15,916 more rows
## # ℹ 1 more variable: FDR <dbl>

What if we want to arrange from largest to smallest? We can use desc().

dexp %>% arrange(desc(logFC))

## # A tibble: 15,926 × 10
## feature albut transcript ref_genome .abundant logFC logCPM F PV
## <chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <
## 1 ENSG00000… untrt ALOX15B hg38 TRUE 10.1 1.62 554. 5.92
## 2 ENSG00000… untrt ZBTB16 hg38 TRUE 7.15 4.15 1429. 5.11
## 3 ENSG00000… untrt <NA> <NA> TRUE 6.17 1.35 380. 1.58
## 4 ENSG00000… untrt ANGPTL7 hg38 TRUE 5.68 3.51 483. 5.66
## 5 ENSG00000… untrt STEAP4 hg38 TRUE 5.22 3.66 445. 8.07
## 6 ENSG00000… untrt PRODH hg38 TRUE 4.85 1.29 253. 9.10
## 7 ENSG00000… untrt FAM107A hg38 TRUE 4.74 2.78 656. 1.51
## 8 ENSG00000… untrt LGI3 hg38 TRUE 4.68 -0.0503 106. 3.45
## 9 ENSG00000… untrt SPARCL1 hg38 TRUE 4.56 5.53 721. 1.00
## 10 ENSG00000… untrt KLF15 hg38 TRUE 4.48 4.69 479. 5.86
## # ℹ 15,916 more rows
## # ℹ 1 more variable: FDR <dbl>

Acknowledgments
Some material from this lesson was either taken directly or adapted from the Intro to R and
RStudio for Genomics lesson provided by datacarpentry.org (https://datacarpentry.org/
genomics-r-intro/01-introduction/index.html). Additional content was inspired by Chapter 3,
Wrangling Data in the Tidyverse, (https://jhudatascience.org/tidyversecourse/wrangle-
data.html#filtering-data) from Tidyverse Skills for Data Science and Suzan Baert's dplyr tutorials
(https://suzan.rbind.io/categories/tutorial/).

Bioinformatics Training and Education Program

87 dplyr: joining, tranforming, and summarizing data frames

dplyr: joining, tranforming, and

summarizing data frames
Objectives
Today we will continue to wrangle data using the tidyverse package, dplyr. We will learn:

1. how to join data frames using dplyr

2. how to transform and create new variables using mutate()
3. how to summarize variables using group_by() and summarize()

Loading dplyr
In this lesson, we are continuing with the package dplyr. We do not need to load the dplyr
package separately, as it is a core tidyverse package. Again, if you need to install and load
only dplyr, use install.packages("dplyr") and library(dplyr).

Load the package:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.

Data
Let's load in some data to work with. In this lesson, we will continue to use sample metadata,
raw count data, and differential expression results from the airway RNA-Seq project.

Load the data:

Bioinformatics Training and Education Program

88 dplyr: joining, tranforming, and summarizing data frames

#sample information
smeta<-read_delim("./data/airway_sampleinfo.txt")

smeta

#raw count data

acount<-read_csv("./data/airway_rawcount.csv") %>%
dplyr::rename("Feature" = "...1")

## New names:
## Rows: 64102 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): ...1 dbl (8): SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR103951
## SRR1039...
## ℹ Use `spec()` to retrieve the full column specification for this data.
## Specify the column types or set `show_col_types = FALSE` to quiet this messa
## • `` -> `...1`

Bioinformatics Training and Education Program

89 dplyr: joining, tranforming, and summarizing data frames

acount

## # A tibble: 64,102 × 9
## Feature SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR103
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <
## 1 ENSG000000… 679 448 873 408 1138
## 2 ENSG000000… 0 0 0 0 0
## 3 ENSG000000… 467 515 621 365 587
## 4 ENSG000000… 260 211 263 164 245
## 5 ENSG000000… 60 55 40 35 78
## 6 ENSG000000… 0 0 2 0 1
## 7 ENSG000000… 3251 3679 6177 4252 6721 1
## 8 ENSG000000… 1433 1062 1733 881 1424
## 9 ENSG000000… 519 380 595 493 820
## 10 ENSG000000… 394 236 464 175 658
## # ℹ 64,092 more rows
## # ℹ 2 more variables: SRR1039520 <dbl>, SRR1039521 <dbl>

#differential expression results

dexp<-read_delim("./data/diffexp_results_edger_airways.txt")

## Rows: 15926 Columns: 10

dexp

Bioinformatics Training and Education Program

90 dplyr: joining, tranforming, and summarizing data frames

## 5 ENSG000… untrt CFH hg38 TRUE 0.417 8.09 29.3 4.6

## 6 ENSG000… untrt FUCA2 hg38 TRUE -0.250 5.91 14.9 4.0
## 7 ENSG000… untrt GCLC hg38 TRUE -0.0581 4.84 0.167 6.9
## 8 ENSG000… untrt NFYA hg38 TRUE -0.509 4.13 44.9 1.0
## 9 ENSG000… untrt STPG1 hg38 TRUE -0.136 3.12 1.04 3.3
## 10 ENSG000… untrt NIPAL3 hg38 TRUE -0.0500 7.04 0.350 5.6
## # ℹ 15,916 more rows
## # ℹ 1 more variable: FDR <dbl>

Joining data frames

Often related data is stored across multiple data frames. In such cases, while each data frame
likely contains different types of data, an identifier column or key (e.g., sampleID) can be used
to unite or combine aspects of the data.

There are a series of functions from dplyr devoted to the purpose of joining data frames. There
are two types of joins: mutating joins (https://dplyr.tidyverse.org/reference/mutate-joins.html)
and filtering joins (https://dplyr.tidyverse.org/reference/filter-joins.html).

Mutating joins

Imagine we have two data frames x and y. A mutating join will keep all columns from x and y
by adding columns from y to x.

left_join() - Output contains all rows from x

return all rows from x, and all columns from x and y. Rows in x with no match in y
will have NA values in the new columns. If there are multiple matches between x
and y, all combinations of the matches are returned. --- R documentation, dplyr
(version 0.7.8) (https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/
topics/join)

right_join() - Output contains all rows from y

return all rows from y, and all columns from x and y. Rows in y with no match in x will
have NA values in the new columns. If there are multiple matches between x and y,
all combinations of the matches are returned. --- R documentation, dplyr (version
0.7.8) (https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/join)

inner_join() - Output contains matched rows from x

return all rows from x where there are matching values in y, and all columns from x
and y. If there are multiple matches between x and y, all combination of the
matches are returned. --- R documentation, dplyr (version 0.7.8) (https://
www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/join)

Bioinformatics Training and Education Program

91 dplyr: joining, tranforming, and summarizing data frames

Unmatched values from x will be dropped.

full_join() - Output contains all rows from x and y

return all rows and all columns from both x and y. Where there are not matching
values, returns NA for the one missing. --- R documentation, dplyr (version 0.7.8)
(https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/join)

Note

The R documentation for dplyr has been updated with dplyr v1.0.9. However, these descriptions still stand and are
clearer (in my opinion) than the new documentation.

The most common type of join is the left_join(). Let's see this in action:

#reshape acount
acount_smeta<-acount %>% pivot_longer(where(is.numeric),names_to ="Sample",
values_to= "Count") %>% left_join(smeta, by=c("Sample"="Run
acount_smeta

## # A tibble: 512,816 × 11
## Feature Sample Count SampleName cell dex albut avgLength Experi
## <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 ENSG000000000… SRR10… 679 GSM1275862 N613… untrt untrt 126 SRX384
## 2 ENSG000000000… SRR10… 448 GSM1275863 N613… trt untrt 126 SRX384
## 3 ENSG000000000… SRR10… 873 GSM1275866 N052… untrt untrt 126 SRX384
## 4 ENSG000000000… SRR10… 408 GSM1275867 N052… trt untrt 87 SRX384
## 5 ENSG000000000… SRR10… 1138 GSM1275870 N080… untrt untrt 120 SRX384
## 6 ENSG000000000… SRR10… 1047 GSM1275871 N080… trt untrt 126 SRX384
## 7 ENSG000000000… SRR10… 770 GSM1275874 N061… untrt untrt 101 SRX384
## 8 ENSG000000000… SRR10… 572 GSM1275875 N061… trt untrt 98 SRX384
## 9 ENSG000000000… SRR10… 0 GSM1275862 N613… untrt untrt 126 SRX384
## 10 ENSG000000000… SRR10… 0 GSM1275863 N613… trt untrt 126 SRX384
## # ℹ 512,806 more rows
## # ℹ 2 more variables: Sample.y <chr>, BioSample <chr>

Notice the use of by in left_join. The argument by requires the column or columns that we
want to join by. If the column we want to join by has a different name, we can use the notation
above, which says to match Sample from acount to Run from smeta.

Filtering joins

Filtering joins result in filtered x data based on matching or non-matching with y. These joins do
not add columns from y to x.

Bioinformatics Training and Education Program

92 dplyr: joining, tranforming, and summarizing data frames

semi_join()

return all rows from x where there are matching values in y, keeping just columns
from x.
A semi join differs from an inner join because an inner join will return one row of x
for each matching row of y, where a semi join will never duplicate rows of x. --- R
documentation, dplyr (version 0.7.8) (https://www.rdocumentation.org/packages/
dplyr/versions/0.7.8/topics/join)

anti_join()

return all rows from x where there are not matching values in y, keeping just
columns from x. --- R documentation, dplyr (version 0.7.8) (https://
www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/join)

Let's see a brief example of semi-join:

#reshape acount
smeta_f<-smeta %>% filter(Run %in% c("SRR1039512","SRR1039508"))

acount_L<-acount %>% pivot_longer(where(is.numeric),names_to ="Sample",

values_to= "Count")

semi_join(acount_L,smeta_f, by=c("Sample"="Run"))

Note

This example does not use the %>%. This was simply to demonstrate the different "strategies" that can be used to set
up and run code.

## # A tibble: 128,204 × 3
## Feature Sample Count
## <chr> <chr> <dbl>
## 1 ENSG00000000003 SRR1039508 679
## 2 ENSG00000000003 SRR1039512 873
## 3 ENSG00000000005 SRR1039508 0
## 4 ENSG00000000005 SRR1039512 0
## 5 ENSG00000000419 SRR1039508 467
## 6 ENSG00000000419 SRR1039512 621
## 7 ENSG00000000457 SRR1039508 260
## 8 ENSG00000000457 SRR1039512 263
## 9 ENSG00000000460 SRR1039508 60
## 10 ENSG00000000460 SRR1039512 40
## # ℹ 128,194 more rows

Bioinformatics Training and Education Program

93 dplyr: joining, tranforming, and summarizing data frames

In this case, we could have used filter. However, if we were interested in filtering by multiple
variables simultaneously, it would be easier to employ a semi-join.

Transforming variables
Data wrangling often involves transforming one variable to another. For example, we may be
interested in log transforming a variable or adding two variables to create a third. In dplyr this
can be done with mutate() and transmute(). These functions allow us to create a new
variable from existing variables.

mutate()
mutate() adds new variables and preserves existing ones; transmute() adds new
variables and drops existing ones. New variables overwrite existing variables of the
same name. --- dplyr.tidyverse.org (https://dplyr.tidyverse.org/reference/
mutate.html)

Let's create a column in our original differential expression data frame denoting significant
transcripts (those with an FDR corrected p-value less than 0.05 and a log fold change greater
than or equal to 2).

dexp_sigtrnsc<-dexp %>% mutate(Significant= FDR<0.05 & abs(logFC) >=2)

head(dexp_sigtrnsc$Significant)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

This creates a column named Significant that contains TRUE values where the expression
above was true (meaning significant in this case) and FALSE where the expression was FALSE.

Let's look at another example. This time let's log transform our FDR corrected p-values.

exmut<-dexp %>% mutate(logFDR = log10(FDR))

exmut["logFDR"]

## # A tibble: 15,926 × 1
## logFDR
## <dbl>
## 1 -2.55
## 2 -1.11
## 3 -0.0735
## 4 -0.166

Bioinformatics Training and Education Program

94 dplyr: joining, tranforming, and summarizing data frames

## 5 -2.42
## 6 -1.73
## 7 -0.100
## 8 -2.90
## 9 -0.320
## 10 -0.158
## # ℹ 15,916 more rows

Mutating several variables at once

What if we want to transform all of our counts spread across multiple columns in acount using
scale(), which applies a z-score transformation? In this case we use across() within
mutate(), which has replaced the scoped verbs (mutate_if,mutate_at, and
mutate_all).

Z-score

A z score tells us the number of standard deviations from the mean of a given value. This can be achieved by
scale(x, center = TRUE, scale = TRUE).

Let's see this in action.

acount %>% mutate(across(where(is.numeric),scale))

## # A tibble: 64,102 × 9
## Feature SRR1039508[,1] SRR1039509[,1] SRR1039512[,1] SRR1039513[,
## <chr> <dbl> <dbl> <dbl> <db
## 1 ENSG00000000003 0.103 0.0527 0.0991 0.06
## 2 ENSG00000000005 -0.0929 -0.100 -0.0821 -0.08
## 3 ENSG00000000419 0.0418 0.0756 0.0468 0.04
## 4 ENSG00000000457 -0.0179 -0.0281 -0.0275 -0.02
## 5 ENSG00000000460 -0.0756 -0.0814 -0.0738 -0.07
## 6 ENSG00000000938 -0.0929 -0.100 -0.0817 -0.08
## 7 ENSG00000000971 0.845 1.16 1.20 1.51
## 8 ENSG00000001036 0.321 0.262 0.278 0.24
## 9 ENSG00000001084 0.0568 0.0295 0.0414 0.09
## 10 ENSG00000001167 0.0208 -0.0196 0.0142 -0.02
## # ℹ 64,092 more rows
## # ℹ 4 more variables: SRR1039516 <dbl[,1]>, SRR1039517 <dbl[,1]>,
## # SRR1039520 <dbl[,1]>, SRR1039521 <dbl[,1]>

For further information on across (https://dplyr.tidyverse.org/articles/colwise.html), check out

this great tutorial here (https://www.rebeccabarter.com/blog/2020-07-09-across/).

Bioinformatics Training and Education Program

95 dplyr: joining, tranforming, and summarizing data frames

Coercing variables with mutate

Mutate can also be used to coerce variables. Again, we need to use across() and where().
Let's also check out the difference between mutate() and transmute().

#compare transmute to mutate

ex_coerce<-acount_smeta %>% transmute(across(where(is.character),as.factor))
glimpse(ex_coerce)

## Rows: 512,816
## Columns: 9
## $ Feature <fct> ENSG00000000003, ENSG00000000003, ENSG00000000003, ENSG00
## $ Sample <fct> SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR103951
## $ SampleName <fct> GSM1275862, GSM1275863, GSM1275866, GSM1275867, GSM127587
## $ cell <fct> N61311, N61311, N052611, N052611, N080611, N080611, N0610
## $ dex <fct> untrt, trt, untrt, trt, untrt, trt, untrt, trt, untrt, tr
## $ albut <fct> untrt, untrt, untrt, untrt, untrt, untrt, untrt, untrt, u
## $ Experiment <fct> SRX384345, SRX384346, SRX384349, SRX384350, SRX384353, SR
## $ Sample.y <fct> SRS508568, SRS508567, SRS508571, SRS508572, SRS508575, SR
## $ BioSample <fct> SAMN02422669, SAMN02422675, SAMN02422678, SAMN02422670, S

#mutate
ex_coerce<-acount_smeta %>% mutate(across(where(is.character),as.factor))
glimpse(ex_coerce)

## Rows: 512,816
## Columns: 11
## $ Feature <fct> ENSG00000000003, ENSG00000000003, ENSG00000000003, ENSG00
## $ Sample <fct> SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR103951
## $ Count <dbl> 679, 448, 873, 408, 1138, 1047, 770, 572, 0, 0, 0, 0, 0,
## $ SampleName <fct> GSM1275862, GSM1275863, GSM1275866, GSM1275867, GSM127587
## $ cell <fct> N61311, N61311, N052611, N052611, N080611, N080611, N0610
## $ dex <fct> untrt, trt, untrt, trt, untrt, trt, untrt, trt, untrt, tr
## $ albut <fct> untrt, untrt, untrt, untrt, untrt, untrt, untrt, untrt, u
## $ avgLength <dbl> 126, 126, 126, 87, 120, 126, 101, 98, 126, 126, 126, 87,
## $ Experiment <fct> SRX384345, SRX384346, SRX384349, SRX384350, SRX384353, SR
## $ Sample.y <fct> SRS508568, SRS508567, SRS508571, SRS508572, SRS508575, SR
## $ BioSample <fct> SAMN02422669, SAMN02422675, SAMN02422678, SAMN02422670, S

Notice that transmute dropped all columns that were not mutated.

Bioinformatics Training and Education Program

96 dplyr: joining, tranforming, and summarizing data frames

Using rowwise() and mutate()

What if we wanted a new column that stored the mean of each row in our data frame?

Let's create a small data frame, and use mutate() to get the mean(). What happens when we
use mean as is?

df<-data.frame(A=c(1,2,3),B=c(4,5,6),C=c(7,8,9))
df

## A B C
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9

df %>% mutate(D= mean(c(A,B,C)))

## A B C D
## 1 1 4 7 5
## 2 2 5 8 5
## 3 3 6 9 5

df %>% mutate(D = (A+B+C)/3)

## A B C D
## 1 1 4 7 4
## 2 2 5 8 5
## 3 3 6 9 6

The first example simply gives us the mean of A, B, and C (not row wise). The second example
gave us what we wanted, but was more complicated. The alternative is to first group by row
using rowwise() and then use mutate().

df %>% rowwise() %>% mutate(D= mean(c(A,B,C)))

## # A tibble: 3 × 4
## # Rowwise:
## A B C D

Bioinformatics Training and Education Program

97 dplyr: joining, tranforming, and summarizing data frames

## <dbl> <dbl> <dbl> <dbl>

## 1 1 4 7 4
## 2 2 5 8 5
## 3 3 6 9 6

See more uses of rowwise() operations here (https://dplyr.tidyverse.org/articles/rowwise.html).

Group_by and summarize

There is an approach to data analysis known as "split-apply-combine", in which the data is split
into smaller components, some type of analysis is applied to each component, and the results
are combined. The dplyr functions including group_by() and summarize() are key players
in this type of workflow.

group_by() allows us to group a data frame by a categorical variable so that a given

operation can be performed per group / category.

Let's get the top five genes with the greatest median raw counts by treatment.

#Call the data

acount_smeta %>%
# group_by dex and Feature (Feature nested within treatment)
group_by(dex,Feature) %>%
#for each group calculate the median value of raw counts
summarize(median_counts=median(Count)) %>%
#arrange in descending order
arrange(desc(median_counts),.by_group = TRUE) %>%
#return the top 5 values for each group
slice_head(n=5)

## `summarise()` has grouped output by 'dex'. You can override using the `.grou
## argument.

Bioinformatics Training and Education Program

98 dplyr: joining, tranforming, and summarizing data frames

## 7 untrt ENSG00000115414 302956.

## 8 untrt ENSG00000156508 294097
## 9 untrt ENSG00000164692 249846
## 10 untrt ENSG00000198804 249206

#can skip arrange and use slice_max

acount_smeta %>%
group_by(dex,Feature) %>%
summarize(median_counts=median(Count)) %>%
slice_max(n=5, order_by=median_counts) #notice use of slice_max

## `summarise()` has grouped output by 'dex'. You can override using the `.grou
## argument.

## # A tibble: 10 × 3
## # Groups: dex [2]
## dex Feature median_counts
## <chr> <chr> <dbl>
## 1 trt ENSG00000115414 322164
## 2 trt ENSG00000011465 263587
## 3 trt ENSG00000156508 239676.
## 4 trt ENSG00000198804 230992
## 5 trt ENSG00000116260 187288.
## 6 untrt ENSG00000011465 336076
## 7 untrt ENSG00000115414 302956.
## 8 untrt ENSG00000156508 294097
## 9 untrt ENSG00000164692 249846
## 10 untrt ENSG00000198804 249206

How many rows per sample are in the acount_smeta data frame? We can use count() or
summarize() paired with other functions (e.g., n(),tally()).

acount_smeta %>%
count(dex, Sample)

## # A tibble: 8 × 3
## dex Sample n
## <chr> <chr> <int>
## 1 trt SRR1039509 64102
## 2 trt SRR1039513 64102
## 3 trt SRR1039517 64102

Bioinformatics Training and Education Program

99 dplyr: joining, tranforming, and summarizing data frames

## 4 trt SRR1039521 64102

## 5 untrt SRR1039508 64102
## 6 untrt SRR1039512 64102
## 7 untrt SRR1039516 64102
## 8 untrt SRR1039520 64102

acount_smeta %>%
group_by(dex, Sample) %>%
summarize(n=n()) #there are multiple functions that can be used here

## `summarise()` has grouped output by 'dex'. You can override using the `.grou
## argument.

## # A tibble: 8 × 3
## # Groups: dex [2]
## dex Sample n
## <chr> <chr> <int>
## 1 trt SRR1039509 64102
## 2 trt SRR1039513 64102
## 3 trt SRR1039517 64102
## 4 trt SRR1039521 64102
## 5 untrt SRR1039508 64102
## 6 untrt SRR1039512 64102
## 7 untrt SRR1039516 64102
## 8 untrt SRR1039520 64102

This output makes sense, as there are 64,102 unique Ensembl ids
(n_distinct(acount_smeta$Feature))

Missing Values

By default, all [built in] R functions operating on vectors that contain missing data will return NA. It’s a
way to make sure that users know they have missing data, and make a conscious decision on how
to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA (the
missing data) is to use na.rm = TRUE (rm stands for remove). ---datacarpentry.org (https://
datacarpentry.org/genomics-r-intro/05-dplyr/index.html)

Let's see this in practice

#This is used to get the same result

#with a pseudorandom number generator like sample()
set.seed(138)

Bioinformatics Training and Education Program

100 dplyr: joining, tranforming, and summarizing data frames

#make mock data frame

fun_df<-data.frame(genes=rep(c("A","B","C","D"), each=3),
counts=sample(1:500,12,TRUE)) %>%
#Assign NAs if the value is less than 100. This is arbitrary.
mutate(counts=replace(counts, counts<100, NA))

#let's view
fun_df

## genes counts
## 1 A NA
## 2 A 214
## 3 A NA
## 4 B 352
## 5 B 256
## 6 B NA
## 7 C 400
## 8 C 381
## 9 C 250
## 10 D 278
## 11 D NA
## 12 D 169

#Summarize mean, median, min, and max

fun_df %>%
group_by(genes) %>%
summarize(
mean_count = mean(counts),
median_count = median(counts),
min_count = min(counts),
max_count = max(counts))

## # A tibble: 4 × 5
## genes mean_count median_count min_count max_count
## <chr> <dbl> <int> <int> <int>
## 1 A NA NA NA NA
## 2 B NA NA NA NA
## 3 C 344. 381 250 400
## 4 D NA NA NA NA

#use na.rm

Bioinformatics Training and Education Program

101 dplyr: joining, tranforming, and summarizing data frames

fun_df %>%
group_by(genes) %>%
summarize(
mean_count = mean(counts, na.rm=TRUE),
median_count = median(counts, na.rm=TRUE),
min_count = min(counts, na.rm=TRUE),
max_count = max(counts, na.rm=TRUE))

## # A tibble: 4 × 5
## genes mean_count median_count min_count max_count
## <chr> <dbl> <dbl> <int> <int>
## 1 A 214 214 214 214
## 2 B 304 304 256 352
## 3 C 344. 381 250 400
## 4 D 224. 224. 169 278

Similar to mutate, we can summarize across multiple columns at once using across(). For
example, let's get the mean of logFC and logCPM.

dexp %>%
summarize(across(starts_with("Log"), mean))

## # A tibble: 1 × 2
## logFC logCPM
## <dbl> <dbl>
## 1 -0.00859 3.71

Exporting files using the write functions

We have learned how to import data using the read functions, but how can we export / write out
data? We can use a series of write functions. Some examples from readr include
write_csv(), write_delim(), write_tsv(). Some examples from base R include
write.csv(), write.table(), writeLines().

Let's export a tab delimited file containing acount_smeta.

write_tsv(acount_smeta, "countsANDmeta.txt", quote="none")

Bioinformatics Training and Education Program

Data Manipulation with dplyr
100% (1)
Data Manipulation with dplyr
39 pages
Lab 1 (With Answers)
No ratings yet
Lab 1 (With Answers)
44 pages
R Programming Cont..
No ratings yet
R Programming Cont..
24 pages
Basics of R: Installation & Data Types
No ratings yet
Basics of R: Installation & Data Types
43 pages
R File Code
No ratings yet
R File Code
16 pages
Data Wrangling Cheatsheet PDF
No ratings yet
Data Wrangling Cheatsheet PDF
2 pages
Data Wrangling Cheatsheet PDF
No ratings yet
Data Wrangling Cheatsheet PDF
2 pages
Dplyr Cheatsheet PDF
100% (1)
Dplyr Cheatsheet PDF
2 pages
05 Dplyr
No ratings yet
05 Dplyr
37 pages
Week 7
No ratings yet
Week 7
10 pages
R Programming Basics for Beginners
No ratings yet
R Programming Basics for Beginners
2 pages
Tidy Verse
No ratings yet
Tidy Verse
76 pages
R Packages Dplyr Sem-III 2021
No ratings yet
R Packages Dplyr Sem-III 2021
13 pages
R Tutorial #1: Applied Econometrics (Econ3005)
No ratings yet
R Tutorial #1: Applied Econometrics (Econ3005)
21 pages
Chapter 03 Wrangling
No ratings yet
Chapter 03 Wrangling
40 pages
Basic R Dplyr Session 4 Demonstration
No ratings yet
Basic R Dplyr Session 4 Demonstration
18 pages
R Guru Cheat Sheet
No ratings yet
R Guru Cheat Sheet
2 pages
R Exercises
No ratings yet
R Exercises
3 pages
R Practical
No ratings yet
R Practical
9 pages
R Basic and Advanced
No ratings yet
R Basic and Advanced
9 pages
BS730 Class 12
No ratings yet
BS730 Class 12
36 pages
R Lecture 2-1
No ratings yet
R Lecture 2-1
28 pages
MBA Sem 1 Unit 3 Fundamentals of R
No ratings yet
MBA Sem 1 Unit 3 Fundamentals of R
41 pages
Tidyr & Dplyr Functions Guide
No ratings yet
Tidyr & Dplyr Functions Guide
3 pages
BIO259 Note
No ratings yet
BIO259 Note
55 pages
DS-R Block 3-1 All
No ratings yet
DS-R Block 3-1 All
43 pages
Tutorial-Introduction To Dplyr
No ratings yet
Tutorial-Introduction To Dplyr
54 pages
R Data Subsetting & Manipulation Guide
No ratings yet
R Data Subsetting & Manipulation Guide
44 pages
Important R Codes and Notes
No ratings yet
Important R Codes and Notes
13 pages
R Reference Card
100% (4)
R Reference Card
4 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
Tidy Data Principles and R Packages
No ratings yet
Tidy Data Principles and R Packages
14 pages
Data Wrangling
No ratings yet
Data Wrangling
12 pages
Advanced R Programming Tidyverse Packages Notes
No ratings yet
Advanced R Programming Tidyverse Packages Notes
12 pages
R Cheat Sheet: 1. Basics 4. Input and Export of Data
100% (1)
R Cheat Sheet: 1. Basics 4. Input and Export of Data
4 pages
R Cheat Sheet Merged
100% (2)
R Cheat Sheet Merged
35 pages
R Programming Cheat Sheet Guide
No ratings yet
R Programming Cheat Sheet Guide
4 pages
R Functions
No ratings yet
R Functions
8 pages
STATA - Subject Table of Contents
No ratings yet
STATA - Subject Table of Contents
15 pages
Unit No 4 Question 2024-1
No ratings yet
Unit No 4 Question 2024-1
3 pages
Do The Same For All The Questions
No ratings yet
Do The Same For All The Questions
6 pages
DSR LAB MANUAL - 10 Programs
No ratings yet
DSR LAB MANUAL - 10 Programs
34 pages
A Short List of Some Useful R Commands: Input and Display
No ratings yet
A Short List of Some Useful R Commands: Input and Display
2 pages
Campbell Biology Concepts and Connections 9th Edition Taylor Unlocked Test Bank
No ratings yet
Campbell Biology Concepts and Connections 9th Edition Taylor Unlocked Test Bank
307 pages
1st Lecture
No ratings yet
1st Lecture
48 pages
Module 4-5 Cells
No ratings yet
Module 4-5 Cells
36 pages
Microbiology Lab Equipment Guide
No ratings yet
Microbiology Lab Equipment Guide
3 pages
Assessment of Genetic Variability For Aluminum Tolerance in Cowpea Accessions Screened in Pots Under Field Conditions
No ratings yet
Assessment of Genetic Variability For Aluminum Tolerance in Cowpea Accessions Screened in Pots Under Field Conditions
10 pages
Drugs Acting at Muscular Junctions
No ratings yet
Drugs Acting at Muscular Junctions
24 pages
Testbank For Campbell Biology Canadian Edition 3rd Edition Reece Instant Download
No ratings yet
Testbank For Campbell Biology Canadian Edition 3rd Edition Reece Instant Download
18 pages
Syllabus of Class XI Science 2025 26
No ratings yet
Syllabus of Class XI Science 2025 26
14 pages
Pattern of Speciation of Felidae Zia Czarina A. Garcia
No ratings yet
Pattern of Speciation of Felidae Zia Czarina A. Garcia
14 pages
The Social Life of Painting in Ancient Rome and On The Bay of Naples Eleanor Winsor Leach Download PDF
No ratings yet
The Social Life of Painting in Ancient Rome and On The Bay of Naples Eleanor Winsor Leach Download PDF
51 pages
4 SP Class Ix Science
No ratings yet
4 SP Class Ix Science
19 pages
Life Sciences Content Manual
No ratings yet
Life Sciences Content Manual
55 pages
AQA BYA1 W MS Jan03
No ratings yet
AQA BYA1 W MS Jan03
5 pages
Positive Psychology The Science of WellBeing by Zelenski Official Test Bank
No ratings yet
Positive Psychology The Science of WellBeing by Zelenski Official Test Bank
328 pages
Fundamentals of Forensic Science 3rd Edition All-in-One Download
100% (18)
Fundamentals of Forensic Science 3rd Edition All-in-One Download
15 pages
Hans Eysenck's Biologically Based Factor Theory
No ratings yet
Hans Eysenck's Biologically Based Factor Theory
17 pages
Fluted Pumpkin Yield Optimization
No ratings yet
Fluted Pumpkin Yield Optimization
9 pages
Small Victories
No ratings yet
Small Victories
9 pages
Field Experimental Techniques in Entomology
No ratings yet
Field Experimental Techniques in Entomology
13 pages
Testbank For Biology The Essentials 4th Edition Hoefnagels
No ratings yet
Testbank For Biology The Essentials 4th Edition Hoefnagels
17 pages
Genetic Engineering in Biotechnology
No ratings yet
Genetic Engineering in Biotechnology
7 pages
Serra Et Al 2019
No ratings yet
Serra Et Al 2019
7 pages
Ubt Test 1S
No ratings yet
Ubt Test 1S
28 pages
Sist en 17980 2024
No ratings yet
Sist en 17980 2024
10 pages
A Theory of Everything PDF
No ratings yet
A Theory of Everything PDF
10 pages
Hybridoma Tech for Biotech Experts
No ratings yet
Hybridoma Tech for Biotech Experts
19 pages
NEET Prep: The Living World
No ratings yet
NEET Prep: The Living World
15 pages
(Ebook) Biodiesel Fuels: Science, Technology, Health, and Environment (Handbook of Biodiesel and Petrodiesel Fuels) Vol 1 by Ozcan Konur (Editor) ISBN 9780367456146, 0367456141 PDF Download
100% (4)
(Ebook) Biodiesel Fuels: Science, Technology, Health, and Environment (Handbook of Biodiesel and Petrodiesel Fuels) Vol 1 by Ozcan Konur (Editor) ISBN 9780367456146, 0367456141 PDF Download
76 pages
08 Translation in Prokaryotes and Eukaryotes
No ratings yet
08 Translation in Prokaryotes and Eukaryotes
30 pages
3.2 Flow Cytometry
No ratings yet
3.2 Flow Cytometry
4 pages

Data Wrangling

Uploaded by

Data Wrangling

Uploaded by

70 Introduction to dplyr and the %>%

Introduction to dplyr and the %>%

1. how to filter data frames using dplyr

Read more about dplyr at https://dplyr.tidyverse.org/articles/programming.html (https://

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.

Bioinformatics Training and Education Program

Let's begin by importing the data.

#let's use our differential expression results

## Rows: 15926 Columns: 10

Bioinformatics Training and Education Program

## dbl (5): logFC, logCPM, F, PValue, FDR

Bioinformatics Training and Education Program

Subsetting data in base R

## Sepal.Length Sepal.Width Petal.Length

Subsetting with dplyr

Bioinformatics Training and Education Program

Subsetting by column (select())

Select a few columns from our differential expression results (dexp).

#select the gene / transcript, logFC, and FDR corrected p-value

We can rename while selecting.

#rename using the syntax new_name = old_name

Bioinformatics Training and Education Program

## 2 DPM1 0.198 0.0770

Bioinformatics Training and Education Program

We can reorder using select().

For readability, let's move the transcript column to the front.

This also would have excluded the feature column.

Bioinformatics Training and Education Program

Selecting a range of columns

We can also include helper functions such as starts_with() and ends_with()

select(dexp, transcript, starts_with("log"), FDR)

Bioinformatics Training and Education Program

## 10 NIPAL3 -0.0500 7.04 0.695

Select columns of a particular type

select(dexp, where(is.numeric)) #or

select_if(dexp, is.numeric) #select_if is a scoped verb function

Bioinformatics Training and Education Program

## 10 -0.0500 7.04 0.350 0.569 0.695

Subsetting by row (filter())

To subset by row, we use the function filter().

filter(smeta, dex == "trt") #we've seen == notation before

Try the following:

Keep these comparison operators in mind for filtering.

>= greater than or equal to

< less than

<= less than or equal to

Bioinformatics Training and Education Program

Comparison Operator Description

filter(smeta, dex == "trt" & cell == "N61311")

A , is treated the same as & in the case of filter().

filter(smeta, dex == "trt", cell == "N61311")

We can also filter by one condition or another using the |.

filter(smeta,cell == "N080611" | cell == "N61311")

Bioinformatics Training and Education Program

The %in% operator

Used to match elements of a vector.

smeta$SampleName %in% c("GSM1275871","GSM1275863")

## [1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE

c("GSM1275871","GSM1275863") %in% smeta$SampleName

## [1] TRUE TRUE

We can combine the %in% operator with filter().

#filter for two cell lines

Including multiple phrases

#use `|` operator

Bioinformatics Training and Education Program

abs(logFC) > 2, (PValue | FDR <= 0.01))

Filtering across columns

Let's briefly see this in action.

f<-filter(dexp, if_all(PValue:FDR, ~ . < 0.05))

Subsetting rows by position

Introducing the pipe

Bioinformatics Training and Education Program

#Run one step at a time with intermediate objects.

#Now filter for only the genes "TSPAN6" and DPM1

#Nested code example

Let's see how this works