tidybulk - part of tidyTranscriptomics

Brings transcriptomics to the tidyverse!

The code is released under the version 3 of the GNU General Public License.

website: stemangiola.github.io/tidybulk/ Third party tutorials Please have a look also to

tidySummarizedExperiment for bulk data tidy representation
tidySingleCellExperiment for single-cell data tidy representation
tidyseurat for single-cell data tidy representation
tidyHeatmap for heatmaps produced with tidy principles analysis and manipulation
tidygate for adding custom gate information to your tibble

Functions/utilities available

Function	Description
`aggregate_duplicates`	Aggregate abundance and annotation of duplicated transcripts in a robust way
`identify_abundant` `keep_abundant`	identify or keep the abundant genes
`keep_variable`	Filter for top variable features
`scale_abundance`	Scale (normalise) abundance for RNA sequencing depth
`reduce_dimensions`	Perform dimensionality reduction (PCA, MDS, tSNE, UMAP)
`cluster_elements`	Labels elements with cluster identity (kmeans, SNN)
`remove_redundancy`	Filter out elements with highly correlated features
`adjust_abundance`	Remove known unwanted variation (Combat)
`test_differential_abundance`	Differential transcript abundance testing (DESeq2, edgeR, voom)
`deconvolve_cellularity`	Estimated tissue composition (Cibersort, llsr, epic, xCell, mcp_counter, quantiseq
`test_differential_cellularity`	Differential cell-type abundance testing
`test_stratification_cellularity`	Estimate Kaplan-Meier survival differences
`test_gene_enrichment`	Gene enrichment analyses (EGSEA)
`test_gene_overrepresentation`	Gene enrichment on list of transcript names (no rank)
`test_gene_rank`	Gene enrichment on list of transcript (GSEA)
`impute_missing_abundance`	Impute abundance for missing data points using sample groupings

Utilities	Description
`get_bibliography`	Get the bibliography of your workflow
`tidybulk`	add tidybulk attributes to a tibble object
`tidybulk_SAM_BAM`	Convert SAM BAM files into tidybulk tibble
`pivot_sample`	Select sample-wise columns/information
`pivot_transcript`	Select transcript-wise columns/information
`rotate_dimensions`	Rotate two dimensions of a degree
`ensembl_to_symbol`	Add gene symbol from ensembl IDs
`symbol_to_entrez`	Add entrez ID from gene symbol
`describe_transcript`	Add gene description from gene symbol

All functions are directly compatibble with SummarizedExperiment object.

Installation

From Bioconductor

BiocManager::install("tidybulk")

From Github

devtools::install_github("stemangiola/tidybulk")

Data

We will use a SummarizedExperiment object

counts_SE

## # A SummarizedExperiment-tibble abstraction: 408,624 × 8
## # �[90mFeatures=8513 | Samples=48 | Assays=count�[0m
##    .feature .sample    count Cell.type time  condition batch factor_of_interest
##    <chr>    <chr>      <dbl> <fct>     <fct> <lgl>     <fct> <lgl>             
##  1 A1BG     SRR1740034   153 b_cell    0 d   TRUE      0     TRUE              
##  2 A1BG-AS1 SRR1740034    83 b_cell    0 d   TRUE      0     TRUE              
##  3 AAAS     SRR1740034   868 b_cell    0 d   TRUE      0     TRUE              
##  4 AACS     SRR1740034   222 b_cell    0 d   TRUE      0     TRUE              
##  5 AAGAB    SRR1740034   590 b_cell    0 d   TRUE      0     TRUE              
##  6 AAMDC    SRR1740034    48 b_cell    0 d   TRUE      0     TRUE              
##  7 AAMP     SRR1740034  1257 b_cell    0 d   TRUE      0     TRUE              
##  8 AANAT    SRR1740034   284 b_cell    0 d   TRUE      0     TRUE              
##  9 AAR2     SRR1740034   379 b_cell    0 d   TRUE      0     TRUE              
## 10 AARS2    SRR1740034   685 b_cell    0 d   TRUE      0     TRUE              
## # ℹ 40 more rows

Loading tidySummarizedExperiment will automatically abstract this object as tibble, so we can display it and manipulate it with tidy tools. Although it looks different, and more tools (tidyverse) are available to us, this object is in fact a SummarizedExperiment object.

class(counts_SE)

## [1] "SummarizedExperiment"
## attr(,"package")
## [1] "SummarizedExperiment"

Get the bibliography of your workflow

First of all, you can cite all articles utilised within your workflow automatically from any tidybulk tibble

counts_SE |>    get_bibliography()

Aggregate duplicated `transcripts`

tidybulk provide the aggregate_duplicates function to aggregate duplicated transcripts (e.g., isoforms, ensembl). For example, we often have to convert ensembl symbols to gene/transcript symbol, but in doing so we have to deal with duplicates. aggregate_duplicates takes a tibble and column names (as symbols; for sample, transcript and count) as arguments and returns a tibble with transcripts with the same name aggregated. All the rest of the columns are appended, and factors and boolean are appended as characters.

TidyTranscriptomics

rowData(counts_SE)$gene_name = rownames(counts_SE)
counts_SE.aggr = counts_SE |> aggregate_duplicates(.transcript = gene_name)

Standard procedure (comparative purpose)

temp = data.frame(
    symbol = dge_list$genes$symbol,
    dge_list$counts
)
dge_list.nr <- by(temp, temp$symbol,
    function(df)
        if(length(df[1,1])>0)
            matrixStats:::colSums(as.matrix(df[,-1]))
)
dge_list.nr <- do.call("rbind", dge_list.nr)
colnames(dge_list.nr) <- colnames(dge_list)

Scale `counts`

We may want to compensate for sequencing depth, scaling the transcript abundance (e.g., with TMM algorithm, Robinson and Oshlack doi.org/10.1186/gb-2010-11-3-r25). scale_abundance takes a tibble, column names (as symbols; for sample, transcript and count) and a method as arguments and returns a tibble with additional columns with scaled data as <NAME OF COUNT COLUMN>_scaled.

TidyTranscriptomics

counts_SE.norm = counts_SE.aggr |> identify_abundant(factor_of_interest = condition) |> scale_abundance()

## tidybulk says: the sample with largest library size SRR1740080 was chosen as reference for scaling

Standard procedure (comparative purpose)

library(edgeR)

dgList <- DGEList(count_m=x,group=group)
keep <- filterByExpr(dgList)
dgList <- dgList[keep,,keep.lib.sizes=FALSE]
[...]
dgList <- calcNormFactors(dgList, method="TMM")
norm_counts.table <- cpm(dgList)

We can easily plot the scaled density to check the scaling outcome. On the x axis we have the log scaled counts, on the y axes we have the density, data is grouped by sample and coloured by cell type.

counts_SE.norm |>
    ggplot(aes(count_scaled + 1, group=.sample, color=`Cell.type`)) +
    geom_density() +
    scale_x_log10() +
    my_theme

Filter `variable transcripts`

We may want to identify and filter variable transcripts.

TidyTranscriptomics

counts_SE.norm.variable = counts_SE.norm |> keep_variable()

## Getting the 500 most variable genes

Standard procedure (comparative purpose)

library(edgeR)

x = norm_counts.table

s <- rowMeans((x-rowMeans(x))^2)
o <- order(s,decreasing=TRUE)
x <- x[o[1L:top],,drop=FALSE]

norm_counts.table = norm_counts.table[rownames(x)]

norm_counts.table$cell_type = tibble_counts[
    match(
        tibble_counts$sample,
        rownames(norm_counts.table)
    ),
    "Cell.type"
]

Reduce `dimensions`

We may want to reduce the dimensions of our data, for example using PCA or MDS algorithms. reduce_dimensions takes a tibble, column names (as symbols; for sample, transcript and count) and a method (e.g., MDS or PCA) as arguments and returns a tibble with additional columns for the reduced dimensions.

MDS (Robinson et al., 10.1093/bioinformatics/btp616)

TidyTranscriptomics

counts_SE.norm.MDS =
  counts_SE.norm |>
  reduce_dimensions(method="MDS", .dims = 6)

## Getting the 500 most variable genes

## tidybulk says: to access the raw results do `attr(..., "internals")$MDS`

Standard procedure (comparative purpose)

library(limma)

count_m_log = log(count_m + 1)
cmds = limma::plotMDS(ndim = .dims, plot = FALSE)

cmds = cmds %$% 
    cmdscale.out |>
    setNames(sprintf("Dim%s", 1:6))

cmds$cell_type = tibble_counts[
    match(tibble_counts$sample, rownames(cmds)),
    "Cell.type"
]

On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by cell type.

counts_SE.norm.MDS |> pivot_sample()  |> select(contains("Dim"), everything())

Name		Name	Last commit message	Last commit date
Latest commit History 1,463 Commits
.github		.github
R		R
data		data
dev		dev
inst		inst
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.coveralls.yml		.coveralls.yml
.gitignore		.gitignore
.travis.yml		.travis.yml
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
codecov.yml		codecov.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tidybulk - part of tidyTranscriptomics

Functions/utilities available

Installation

Data

Get the bibliography of your workflow

Aggregate duplicated `transcripts`

Scale `counts`

Filter `variable transcripts`

Reduce `dimensions`

Rotate `dimensions`

Test `differential abundance`

Adjust `counts`

Deconvolve `Cell type composition`

Test differential cell-type abundance

Cluster `samples`

Drop `redundant` transcripts

Other useful wrappers

From BAM/SAM to tibble of gene counts

From ensembl IDs to gene symbol IDs

From gene symbol to gene description (gene name in full)

About

Releases

Packages

Languages

william-hutchison/tidybulk

Folders and files

Latest commit

History

Repository files navigation

tidybulk - part of tidyTranscriptomics

Functions/utilities available

Installation

Data

Get the bibliography of your workflow

Aggregate duplicated transcripts

Scale counts

Filter variable transcripts

Reduce dimensions

Rotate dimensions

Test differential abundance

Adjust counts

Deconvolve Cell type composition

Test differential cell-type abundance

Cluster samples

Drop redundant transcripts

Other useful wrappers

From BAM/SAM to tibble of gene counts

From ensembl IDs to gene symbol IDs

From gene symbol to gene description (gene name in full)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Aggregate duplicated `transcripts`

Scale `counts`

Filter `variable transcripts`

Reduce `dimensions`

Rotate `dimensions`

Test `differential abundance`

Adjust `counts`

Deconvolve `Cell type composition`

Cluster `samples`

Drop `redundant` transcripts

Packages