suppressPackageStartupMessages(library(tidyverse))
In this first post, we want to introduce the concept of tidy transcriptomics and its principles.
Tidy transcriptomics is a specific approach to transcriptomic data analysis in R that uses the "tidy" principles proposed by Wickham et al.. Tidy transcriptomics introduces both a tidy data representation and manipulation (for single-cell and bulk) with the packages tidySummarizedExperiment
, tidySingleCellExperiment
, and tidyseurat
and a tidy analysis workflow (for bulk data) with the package tidybulk
.
This manifesto lays down the principles of tidy transcriptomics that tidybulk
and tidySummarizedExperiment
, tidySingleCellExperiment
, and tidyseurat
are based on. These principles are in line with the tidy tools manifesto.
log10_reverse
scale instead of transforming the p-values in their negative-log form; for visualising (raw, scaled and/or adjusted) transcript abundance (in the form of read counts), use the log1p
scale instead of transforming the data in its log (or log count-per-million) form.All packages are under active development and are in a maturing lifecycle.
There are two parts to the tidy transcriptomics ecosystem: data and analysis framework. So far, for bulk RNA sequencing, both data (tidySummarizedExperiment) and analysis (tidybulk) frameworks are available. In contrast, for single-cell, only data frameworks are available (tidySingleCellExperiment and tidyseurat).
Data frameworks are not data containers. Data frameworks are data abstractions that display and manipulate your existing containers (i.e. SummarizedExperiment
, SingleCellExperiment
and Seurat
object) in a tidy manner. Therefore there is no such thing as a tidy*
object. This has the advantage of allowing you to use tidyverse on transcriptomics data without compromising your existing pipelines. That is if a method works for SummarizedExperiment
it works for its tidy representation.
Therefore, the question "can we go from tidySummarizedExperiment
to SummarizedExperiment
and vice versa" is not relevant, as we never leave SummarizedExperiment
in the first place.
With tidy transcriptomics, we differentiate the data container (tibble, S4 objects) from the user interface (tidy). When we display, manipulate and visualise transcriptomic data, we might not care how the data is stored (tibble or hierarchical-rectangular structures) as long as the interface is consistent.
Bulk RNA sequencing data
SummarizedExperiment
can interface with Bioconductor
and tidybulk
tibble
can interface with tidyverse
and tidybulk
tidySummarizedExperiment
can interface with all three: Bioconductor
, tidyverse
and tidybulk
Single-cell RNA sequencing data
SingleCellExperiment
can interface with Bioconductor
tidySingleCellExperiment
can interface with Bioconductor
and tidyverse
Seurat
object can interface with Seurat
tidyseurat
object can interface with Seurat
and tidyverse
Here we provide few examples of the differences in programming transcriptomics analyses with tidy transcriptomics and the base R alternative. An important aspect is that tidy transcriptomics is complementary to standard workflows. Everything that works with SummarizedExperiment
, SingleCellExperiment
and Seurat
works with their tidy representations.
This example is taken from the workshop BioC2021.
As tidybulk smoothly integrates with ggplot2 and other tidyverse packages it can save on typing and make plots easier to generate. Compare the code for creating density plots with tidybulk versus standard base R below (standard code adapted from (Law et al. 2016)).
base R using edgeR
# Example code, no need to run # Prepare data set library(edgeR) dgList <- SE2DGEList(airway) group <- factor(dgList$samples$dex) keep.exprs <- filterByExpr(dgList, group = group) dgList <- dgList[keep.exprs, , keep.lib.sizes = FALSE] nsamples <- ncol(dgList) logcounts <- log2(dgList$counts) # Setup graphics col <- RColorBrewer::brewer.pal(nsamples, "Paired") par(mfrow = c(1, 2)) # Plot raw counts plot(density(logcounts[, 1]), col = col[1], lwd = 2, ylim = c(0, 0.26), las = 2, main = "", xlab = "") title(main = "Counts") for (i in 2:nsamples) { den <- density(logcounts[, i]) lines(den$x, den$y, col = col[i], lwd = 2) } legend("topright", legend = dgList$samples$Run, text.col = col, bty = "n") # Plot scaled counts dgList_norm <- calcNormFactors(dgList) lcpm_n <- cpm(dgList_norm, log = TRUE) plot(density(lcpm_n[, 1]), col = col[1], lwd = 2, ylim = c(0, 0.26), las = 2, main = "", xlab = "") title("Counts scaled") for (i in 2:nsamples) { den <- density(lcpm_n[, i]) lines(den$x, den$y, col = col[i], lwd = 2) } legend("topright", legend = dgList_norm$samples$Run, text.col = col, bty = "n")
tidybulk
# tidybulk airway %>% keep_abundant(factor_of_interest = dex) %>% scale_abundance() %>% pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance") %>% ggplot(aes(x = abundance + 1, color = sample)) + geom_density() + facet_wrap(~source) + scale_x_log10() + custom_theme
suppressPackageStartupMessages(library(tidySingleCellExperiment))
These examples are taken from the article Interfacing Seurat with the R tidy universe.
Calculating a transcriptional signature.
Base
# Load data pbmc_SCE = tidySingleCellExperiment::pbmc_small signature_score_1 = pbmc_SCE[c("CD79A", "CD79B"),] %>% logcounts() %>% colSums() %>% scales::rescale(to = c(0,1)) signature_score_2 = pbmc_SCE[c("CD3D", "CCR7"),] %>% logcounts() %>% colSums() %>% scales::rescale(to = c(0,1)) signature_score_1 - signature_score_2
tidySingleCellExperiment
# Load data pbmc_SCE = tidySingleCellExperiment::pbmc_small pbmc_SCE %>% join_features(c("CD79A", "CD79B", "CD3D", "CCR7"), assay = "logcounts", shape = "wide") %>% mutate(signature_score = scales::rescale(CD79A + CD79B, to = c(0,1)) - scales::rescale(CD3D + CCR7, to = c(0,1)) ) %>% select(signature_score, everything())
Cell subsampling for mulisample balanced UMAP representation
Base
splits = colnames(pbmc_SCE) %>% split(pbmc_SCE$file ) min_size = splits %>% sapply(length) %>% min() cell_subset = splits %>% lapply(function(x) sample(x, min_size)) %>% unlist() pbmc_SCE[, cell_subset]
tidySingleCellExperiment
pbmc_SCE %>% add_count(file , name = "tot_cells") %>% mutate(min_cells = min(tot_cells)) %>% group_by(file ) %>% sample_n(min_cells)
detach(package:tidySingleCellExperiment, unload=TRUE)
Calculating a transcriptional signature.
Base
suppressPackageStartupMessages(library(Seurat)) # Load data pbmc_Seurat = tidySingleCellExperiment::pbmc_small %>% as.Seurat() %>% NormalizeData() signature_score_1 = pbmc_Seurat[c("CD79A", "CD79B"),] %>% Seurat::GetAssayData() %>% colSums() %>% scales::rescale(to = c(0,1)) signature_score_2 = pbmc_Seurat[c("CD3D", "CCR7"),] %>% Seurat::GetAssayData() %>% colSums() %>% scales::rescale(to = c(0,1)) signature_score_1 - signature_score_2
tidyseurat
It is the same as for tidySingleCellExperiment
suppressPackageStartupMessages(library(tidyseurat)) pbmc_Seurat %>% join_features(c("CD79A", "CD79B", "CD3D", "CCR7"), shape = "wide") %>% mutate(signature_score = scales::rescale(CD79A + CD79B, to = c(0,1)) - scales::rescale(CD3D + CCR7, to = c(0,1)) ) %>% select(signature_score, everything())
Cell subsampling for multisample balanced UMAP representation
Base
splits = colnames(pbmc_Seurat) %>% split(pbmc_Seurat$file ) min_size = splits %>% sapply(length) %>% min() cell_subset = splits %>% lapply(function(x) sample(x, min_size)) %>% unlist() pbmc_Seurat[, cell_subset]
tidyseurat
It is the same as for tidySingleCellExperiment
pbmc_Seurat %>% add_count(file , name = "tot_cells") %>% mutate(min_cells = min(tot_cells)) %>% group_by(file ) %>% sample_n(min_cells)
We delivered a series of workshops on tidy transcriptomics, which tracks the progress of the ecosystem.
Mangiola, S., Molania, R., Dong, R. et al. tidybulk: an R tidy framework for modular transcriptomic data analysis. Genome Biol 22, 42 (2021). https://doi.org/10.1186/s13059-020-02233-7
Stefano Mangiola, Maria A Doyle, Anthony T Papenfuss, Interfacing Seurat with the R tidy universe, Bioinformatics, 2021, https://doi.org/10.1093/bioinformatics/btab404
You can use tidyseurat
citation as introduces the concepts of data abstraction.
{=html}
<script src="https://utteranc.es/client.js"
repo="stemangiola/stemangiola.github.io"
issue-term="pathname"
theme="github-light"
crossorigin="anonymous"
async>
</script>
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.