In stemangiola/tidyTranscriptomics: The tidyomics blog

suppressPackageStartupMessages(library(tidyverse))

In this first post, we want to introduce the concept of tidy transcriptomics and its principles.

What is tidy transcriptomics?

Tidy transcriptomics is a specific approach to transcriptomic data analysis in R that uses the "tidy" principles proposed by Wickham et al.. Tidy transcriptomics introduces both a tidy data representation and manipulation (for single-cell and bulk) with the packages tidySummarizedExperiment, tidySingleCellExperiment, and tidyseurat and a tidy analysis workflow (for bulk data) with the package tidybulk.

The tidy transcriptomics manifesto

This manifesto lays down the principles of tidy transcriptomics that tidybulk and tidySummarizedExperiment, tidySingleCellExperiment, and tidyseurat are based on. These principles are in line with the tidy tools manifesto.

Use easy-to-understand, verbose, jargon- and acronym-free vocabulary. It is a bad sign if the English dictionary is not enough to understand the underlying meaning of a function or a variable name. _https://iubmb.onlinelibrary.wiley.com/doi/10.1002/bmb.20922_; and the cost of saving a few characters may be bigger than we think.
Present/visualise information in its raw form when possible. Modern visualisation tools (e.g. ggplot2) allow the use of custom scales to visualise data. For example, for visualising p-values apply the log10_reverse scale instead of transforming the p-values in their negative-log form; for visualising (raw, scaled and/or adjusted) transcript abundance (in the form of read counts), use the log1p scale instead of transforming the data in its log (or log count-per-million) form.
Avoid the creation of temporary variables when possible. Creating variables to store data is a bug-prone process when working interactively, especially if multiple assignments are done on the same variable through the workflow. The main utility of variables should be to store data that is used more than once. Tidyverse allows complex operations to be combined in a simple way, reducing the need to create temporary variables.

Integration map

All packages are under active development and are in a maturing lifecycle.

roadmap

There are two parts to the tidy transcriptomics ecosystem: data and analysis framework. So far, for bulk RNA sequencing, both data (tidySummarizedExperiment) and analysis (tidybulk) frameworks are available. In contrast, for single-cell, only data frameworks are available (tidySingleCellExperiment and tidyseurat).

What data frameworks are and what they are not

Data frameworks are not data containers. Data frameworks are data abstractions that display and manipulate your existing containers (i.e. SummarizedExperiment, SingleCellExperiment and Seurat object) in a tidy manner. Therefore there is no such thing as a tidy* object. This has the advantage of allowing you to use tidyverse on transcriptomics data without compromising your existing pipelines. That is if a method works for SummarizedExperiment it works for its tidy representation.

Therefore, the question "can we go from tidySummarizedExperiment to SummarizedExperiment and vice versa" is not relevant, as we never leave SummarizedExperiment in the first place.

Giving a consistent interface despite different data containers

With tidy transcriptomics, we differentiate the data container (tibble, S4 objects) from the user interface (tidy). When we display, manipulate and visualise transcriptomic data, we might not care how the data is stored (tibble or hierarchical-rectangular structures) as long as the interface is consistent.

Which analysis framework can interface with which data framework

Bulk RNA sequencing data

SummarizedExperiment can interface with Bioconductor and tidybulk
tibble can interface with tidyverse and tidybulk
tidySummarizedExperiment can interface with all three: Bioconductor, tidyverse and tidybulk

Single-cell RNA sequencing data

SingleCellExperiment can interface with Bioconductor
tidySingleCellExperiment can interface with Bioconductor and tidyverse
Seurat object can interface with Seurat
tidyseurat object can interface with Seurat and tidyverse

Differences between tidy and base transcriptomics syntax

Here we provide few examples of the differences in programming transcriptomics analyses with tidy transcriptomics and the base R alternative. An important aspect is that tidy transcriptomics is complementary to standard workflows. Everything that works with SummarizedExperiment, SingleCellExperiment and Seurat works with their tidy representations.

Tidybulk and tidySummarizedExperiment

This example is taken from the workshop BioC2021.

As tidybulk smoothly integrates with ggplot2 and other tidyverse packages it can save on typing and make plots easier to generate. Compare the code for creating density plots with tidybulk versus standard base R below (standard code adapted from (Law et al. 2016)).

base R using edgeR

# Example code, no need to run
# Prepare data set
library(edgeR)
dgList <- SE2DGEList(airway)
group <- factor(dgList$samples$dex)
keep.exprs <- filterByExpr(dgList, group = group)
dgList <- dgList[keep.exprs, , keep.lib.sizes = FALSE]
nsamples <- ncol(dgList)
logcounts <- log2(dgList$counts)
# Setup graphics
col <- RColorBrewer::brewer.pal(nsamples, "Paired")
par(mfrow = c(1, 2))
# Plot raw counts
plot(density(logcounts[, 1]), col = col[1], lwd = 2, ylim = c(0, 0.26), las = 2, main = "", xlab = "")
title(main = "Counts")
for (i in 2:nsamples) {
  den <- density(logcounts[, i])
  lines(den$x, den$y, col = col[i], lwd = 2)
}
legend("topright", legend = dgList$samples$Run, text.col = col, bty = "n")
# Plot scaled counts
dgList_norm <- calcNormFactors(dgList)
lcpm_n <- cpm(dgList_norm, log = TRUE)
plot(density(lcpm_n[, 1]), col = col[1], lwd = 2, ylim = c(0, 0.26), las = 2, main = "", xlab = "")
title("Counts scaled")
for (i in 2:nsamples) {
  den <- density(lcpm_n[, i])
  lines(den$x, den$y, col = col[i], lwd = 2)
}
legend("topright", legend = dgList_norm$samples$Run, text.col = col, bty = "n")

tidybulk

# tidybulk
airway %>%
  keep_abundant(factor_of_interest = dex) %>%
  scale_abundance() %>%
  pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance") %>%
  ggplot(aes(x = abundance + 1, color = sample)) +
  geom_density() +
  facet_wrap(~source) +
  scale_x_log10() +
  custom_theme

SingleCellExperiment and tidySingleCellExperiment

suppressPackageStartupMessages(library(tidySingleCellExperiment))

These examples are taken from the article Interfacing Seurat with the R tidy universe.

Calculating a transcriptional signature.

Base

# Load data
pbmc_SCE = tidySingleCellExperiment::pbmc_small

signature_score_1 =
  pbmc_SCE[c("CD79A", "CD79B"),] %>%
  logcounts() %>%
  colSums() %>%
  scales::rescale(to = c(0,1))

signature_score_2 = 
  pbmc_SCE[c("CD3D", "CCR7"),] %>%
  logcounts() %>%
  colSums() %>%
  scales::rescale(to = c(0,1))

signature_score_1 - signature_score_2

tidySingleCellExperiment

# Load data
pbmc_SCE = tidySingleCellExperiment::pbmc_small

pbmc_SCE %>%
  join_features(c("CD79A", "CD79B", "CD3D", "CCR7"), assay = "logcounts", shape = "wide") %>%
  mutate(signature_score = 
           scales::rescale(CD79A + CD79B, to = c(0,1)) - 
           scales::rescale(CD3D + CCR7, to = c(0,1))
  ) %>%
  select(signature_score, everything())

Cell subsampling for mulisample balanced UMAP representation

Base

splits = 
  colnames(pbmc_SCE) %>%
  split(pbmc_SCE$file  )

min_size = splits %>%
  sapply(length) %>% min()

cell_subset = splits %>%
  lapply(function(x) sample(x, min_size)) %>%
  unlist()

pbmc_SCE[, cell_subset]

tidySingleCellExperiment

pbmc_SCE %>%
  add_count(file  , name = "tot_cells") %>%
  mutate(min_cells = min(tot_cells)) %>%
  group_by(file  ) %>%
  sample_n(min_cells)

tidyseurat

 detach(package:tidySingleCellExperiment, unload=TRUE)

Calculating a transcriptional signature.

Base

suppressPackageStartupMessages(library(Seurat))

# Load data
pbmc_Seurat = tidySingleCellExperiment::pbmc_small %>% as.Seurat() %>% NormalizeData()

signature_score_1 =
  pbmc_Seurat[c("CD79A", "CD79B"),] %>%
  Seurat::GetAssayData() %>%
  colSums() %>%
  scales::rescale(to = c(0,1))

signature_score_2 = 
  pbmc_Seurat[c("CD3D", "CCR7"),] %>%
  Seurat::GetAssayData() %>%
  colSums() %>%
  scales::rescale(to = c(0,1))

signature_score_1 - signature_score_2

tidyseurat

It is the same as for tidySingleCellExperiment

suppressPackageStartupMessages(library(tidyseurat))

pbmc_Seurat %>%
  join_features(c("CD79A", "CD79B", "CD3D", "CCR7"), shape = "wide") %>%
  mutate(signature_score = 
           scales::rescale(CD79A + CD79B, to = c(0,1)) - 
           scales::rescale(CD3D + CCR7, to = c(0,1))
  ) %>%
  select(signature_score, everything())

Cell subsampling for multisample balanced UMAP representation

Base

splits = 
  colnames(pbmc_Seurat) %>%
  split(pbmc_Seurat$file  )

min_size = splits %>%
  sapply(length) %>% min()

cell_subset = splits %>%
  lapply(function(x) sample(x, min_size)) %>%
  unlist()

pbmc_Seurat[, cell_subset]

tidyseurat

It is the same as for tidySingleCellExperiment

pbmc_Seurat %>%
  add_count(file  , name = "tot_cells") %>%
  mutate(min_cells = min(tot_cells)) %>%
  group_by(file  ) %>%
  sample_n(min_cells)

Learning material

We delivered a series of workshops on tidy transcriptomics, which tracks the progress of the ecosystem.

How to cite tidy transcriptomics

tidybulk

Mangiola, S., Molania, R., Dong, R. et al. tidybulk: an R tidy framework for modular transcriptomic data analysis. Genome Biol 22, 42 (2021). https://doi.org/10.1186/s13059-020-02233-7

tidyseurat

Stefano Mangiola, Maria A Doyle, Anthony T Papenfuss, Interfacing Seurat with the R tidy universe, Bioinformatics, 2021, https://doi.org/10.1093/bioinformatics/btab404

tidySummarizedExperiment and tidySingleCellExperiment

You can use tidyseurat citation as introduces the concepts of data abstraction.

{=html} <script src="https://utteranc.es/client.js" repo="stemangiola/stemangiola.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async> </script>

stemangiola/tidyTranscriptomics documentation built on May 20, 2024, 1:35 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

stemangiola/tidyTranscriptomics
The tidyomics blog

In stemangiola/tidyTranscriptomics: The tidyomics blog

What is tidy transcriptomics?

The tidy transcriptomics manifesto

Integration map

What data frameworks are and what they are not

Giving a consistent interface despite different data containers

Which analysis framework can interface with which data framework

Differences between tidy and base transcriptomics syntax

Tidybulk and tidySummarizedExperiment

SingleCellExperiment and tidySingleCellExperiment

tidyseurat

Learning material

How to cite tidy transcriptomics

tidybulk

tidyseurat

tidySummarizedExperiment and tidySingleCellExperiment

R Package Documentation

Browse R Packages

We want your feedback!

stemangiola/tidyTranscriptomics The tidyomics blog

In stemangiola/tidyTranscriptomics: The tidyomics blog

What is tidy transcriptomics?

The tidy transcriptomics manifesto

Integration map

What data frameworks are and what they are not

Giving a consistent interface despite different data containers

Which analysis framework can interface with which data framework

Differences between tidy and base transcriptomics syntax

Tidybulk and tidySummarizedExperiment

SingleCellExperiment and tidySingleCellExperiment

tidyseurat

Learning material

How to cite tidy transcriptomics

tidybulk

tidyseurat

tidySummarizedExperiment and tidySingleCellExperiment

R Package Documentation

Browse R Packages

We want your feedback!

stemangiola/tidyTranscriptomics
The tidyomics blog