Brings transcriptomics to the tidyverse!
The code is released under the version 3 of the GNU General Public License.
website: stemangiola.github.io/tidybulk/ Third party tutorials Please have a look also to
- tidySummarizedExperiment for bulk data tidy representation
- tidySingleCellExperiment for single-cell data tidy representation
- tidyseurat for single-cell data tidy representation
- tidyHeatmap for heatmaps produced with tidy principles analysis and manipulation
- tidygate for adding custom gate information to your tibble
Function | Description |
---|---|
aggregate_duplicates |
Aggregate abundance and annotation of duplicated transcripts in a robust way |
identify_abundant keep_abundant |
identify or keep the abundant genes |
keep_variable |
Filter for top variable features |
scale_abundance |
Scale (normalise) abundance for RNA sequencing depth |
reduce_dimensions |
Perform dimensionality reduction (PCA, MDS, tSNE, UMAP) |
cluster_elements |
Labels elements with cluster identity (kmeans, SNN) |
remove_redundancy |
Filter out elements with highly correlated features |
adjust_abundance |
Remove known unwanted variation (Combat) |
test_differential_abundance |
Differential transcript abundance testing (DESeq2, edgeR, voom) |
deconvolve_cellularity |
Estimated tissue composition (Cibersort, llsr, epic, xCell, mcp_counter, quantiseq |
test_differential_cellularity |
Differential cell-type abundance testing |
test_stratification_cellularity |
Estimate Kaplan-Meier survival differences |
test_gene_enrichment |
Gene enrichment analyses (EGSEA) |
test_gene_overrepresentation |
Gene enrichment on list of transcript names (no rank) |
test_gene_rank |
Gene enrichment on list of transcript (GSEA) |
impute_missing_abundance |
Impute abundance for missing data points using sample groupings |
Utilities | Description |
---|---|
get_bibliography |
Get the bibliography of your workflow |
tidybulk |
add tidybulk attributes to a tibble object |
tidybulk_SAM_BAM |
Convert SAM BAM files into tidybulk tibble |
pivot_sample |
Select sample-wise columns/information |
pivot_transcript |
Select transcript-wise columns/information |
rotate_dimensions |
Rotate two dimensions of a degree |
ensembl_to_symbol |
Add gene symbol from ensembl IDs |
symbol_to_entrez |
Add entrez ID from gene symbol |
describe_transcript |
Add gene description from gene symbol |
All functions are directly compatibble with SummarizedExperiment
object.
From Bioconductor
BiocManager::install("tidybulk")
From Github
devtools::install_github("stemangiola/tidybulk")
We will use a SummarizedExperiment
object
counts_SE
## # A SummarizedExperiment-tibble abstraction: 408,624 × 8
## # �[90mFeatures=8513 | Samples=48 | Assays=count�[0m
## .feature .sample count Cell.type time condition batch factor_of_interest
## <chr> <chr> <dbl> <fct> <fct> <lgl> <fct> <lgl>
## 1 A1BG SRR1740034 153 b_cell 0 d TRUE 0 TRUE
## 2 A1BG-AS1 SRR1740034 83 b_cell 0 d TRUE 0 TRUE
## 3 AAAS SRR1740034 868 b_cell 0 d TRUE 0 TRUE
## 4 AACS SRR1740034 222 b_cell 0 d TRUE 0 TRUE
## 5 AAGAB SRR1740034 590 b_cell 0 d TRUE 0 TRUE
## 6 AAMDC SRR1740034 48 b_cell 0 d TRUE 0 TRUE
## 7 AAMP SRR1740034 1257 b_cell 0 d TRUE 0 TRUE
## 8 AANAT SRR1740034 284 b_cell 0 d TRUE 0 TRUE
## 9 AAR2 SRR1740034 379 b_cell 0 d TRUE 0 TRUE
## 10 AARS2 SRR1740034 685 b_cell 0 d TRUE 0 TRUE
## # ℹ 40 more rows
Loading tidySummarizedExperiment
will automatically abstract this
object as tibble
, so we can display it and manipulate it with tidy
tools. Although it looks different, and more tools (tidyverse) are
available to us, this object is in fact a SummarizedExperiment
object.
class(counts_SE)
## [1] "SummarizedExperiment"
## attr(,"package")
## [1] "SummarizedExperiment"
First of all, you can cite all articles utilised within your workflow automatically from any tidybulk tibble
counts_SE |> get_bibliography()
tidybulk provide the aggregate_duplicates
function to aggregate
duplicated transcripts (e.g., isoforms, ensembl). For example, we often
have to convert ensembl symbols to gene/transcript symbol, but in doing
so we have to deal with duplicates. aggregate_duplicates
takes a
tibble and column names (as symbols; for sample
, transcript
and
count
) as arguments and returns a tibble with transcripts with the
same name aggregated. All the rest of the columns are appended, and
factors and boolean are appended as characters.
TidyTranscriptomics
rowData(counts_SE)$gene_name = rownames(counts_SE)
counts_SE.aggr = counts_SE |> aggregate_duplicates(.transcript = gene_name)
Standard procedure (comparative purpose)
temp = data.frame(
symbol = dge_list$genes$symbol,
dge_list$counts
)
dge_list.nr <- by(temp, temp$symbol,
function(df)
if(length(df[1,1])>0)
matrixStats:::colSums(as.matrix(df[,-1]))
)
dge_list.nr <- do.call("rbind", dge_list.nr)
colnames(dge_list.nr) <- colnames(dge_list)
We may want to compensate for sequencing depth, scaling the transcript
abundance (e.g., with TMM algorithm, Robinson and Oshlack
doi.org/10.1186/gb-2010-11-3-r25). scale_abundance
takes a tibble,
column names (as symbols; for sample
, transcript
and count
) and a
method as arguments and returns a tibble with additional columns with
scaled data as <NAME OF COUNT COLUMN>_scaled
.
TidyTranscriptomics
counts_SE.norm = counts_SE.aggr |> identify_abundant(factor_of_interest = condition) |> scale_abundance()
## tidybulk says: the sample with largest library size SRR1740080 was chosen as reference for scaling
Standard procedure (comparative purpose)
library(edgeR)
dgList <- DGEList(count_m=x,group=group)
keep <- filterByExpr(dgList)
dgList <- dgList[keep,,keep.lib.sizes=FALSE]
[...]
dgList <- calcNormFactors(dgList, method="TMM")
norm_counts.table <- cpm(dgList)
We can easily plot the scaled density to check the scaling outcome. On the x axis we have the log scaled counts, on the y axes we have the density, data is grouped by sample and coloured by cell type.
counts_SE.norm |>
ggplot(aes(count_scaled + 1, group=.sample, color=`Cell.type`)) +
geom_density() +
scale_x_log10() +
my_theme
We may want to identify and filter variable transcripts.
TidyTranscriptomics
counts_SE.norm.variable = counts_SE.norm |> keep_variable()
## Getting the 500 most variable genes
Standard procedure (comparative purpose)
library(edgeR)
x = norm_counts.table
s <- rowMeans((x-rowMeans(x))^2)
o <- order(s,decreasing=TRUE)
x <- x[o[1L:top],,drop=FALSE]
norm_counts.table = norm_counts.table[rownames(x)]
norm_counts.table$cell_type = tibble_counts[
match(
tibble_counts$sample,
rownames(norm_counts.table)
),
"Cell.type"
]
We may want to reduce the dimensions of our data, for example using PCA
or MDS algorithms. reduce_dimensions
takes a tibble, column names (as
symbols; for sample
, transcript
and count
) and a method (e.g., MDS
or PCA) as arguments and returns a tibble with additional columns for
the reduced dimensions.
MDS (Robinson et al., 10.1093/bioinformatics/btp616)
TidyTranscriptomics
counts_SE.norm.MDS =
counts_SE.norm |>
reduce_dimensions(method="MDS", .dims = 6)
## Getting the 500 most variable genes
## tidybulk says: to access the raw results do `attr(..., "internals")$MDS`
Standard procedure (comparative purpose)
library(limma)
count_m_log = log(count_m + 1)
cmds = limma::plotMDS(ndim = .dims, plot = FALSE)
cmds = cmds %$%
cmdscale.out |>
setNames(sprintf("Dim%s", 1:6))
cmds$cell_type = tibble_counts[
match(tibble_counts$sample, rownames(cmds)),
"Cell.type"
]
On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by cell type.
counts_SE.norm.MDS |> pivot_sample() |> select(contains("Dim"), everything())