knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
```{css, echo = FALSE}
.challenge { background-color: #fff4d4; } .challenge code { background-color: #fff4d4; }
# Workshop introduction ```r knitr::include_graphics("../inst/vignettes/tidybulk_logo.png")
Dr. Stefano Mangiola is currently a Postdoctoral researcher in the laboratory of Prof. Tony Papenfuss at the Walter and Eliza Hall Institute in Melbourne, Australia. His background spans from biotechnology to bioinformatics and biostatistics. His research focuses on prostate and breast tumour microenvironment, the development of statistical models for the analysis of RNA sequencing data, and data analysis and visualisation interfaces.
Dr. Maria Doyle is the Application and Training Specialist for Research Computing at the Peter MacCallum Cancer Centre in Melbourne, Australia. She has a PhD in Molecular Biology and currently works in bioinformatics and data science education and training. She is passionate about supporting researchers, reproducible research, open source and tidy data.
This tutorial will present how to perform analysis of single-cell and bulk RNA sequencing data following the tidy data paradigm. The tidy data paradigm provides a standard way to organise data values within a dataset, where each variable is a column, each observation is a row, and data is manipulated using an easy-to-understand vocabulary. Most importantly, the data structure remains consistent across manipulation and analysis functions.
This can be achieved with the integration of packages present in the R CRAN and Bioconductor ecosystem, including tidyseurat, tidySingleCellExperiment, tidybulk, tidyHeatmap and tidyverse. These packages are part of the tidytranscriptomics suite that introduces a tidy approach to RNA sequencing data representation and analysis.
Recommended Background Reading Introduction to R for Biologists
This workshop is a 4 hour session.
Guide
| Activity | Time | |---------------------------------------------------------|------| | Part 1 Bulk RNA-seq Core | | | Hands-on Demos + Exercises | 90m | | Differential gene expression | | | Cell type composition analysis | | | Break | 30m | | Part 2 Single-cell RNA-seq | | | Hands-on Demos + Exercises | 90m | | Single-cell analysis | | | Pseudobulk analysis | | | Q&A | 30m | | Total | 240m |
Format: Hands on demos, challenges plus Q&A
Interact: Zoom chat, Menti quiz for challenges
In exploring and analysing RNA sequencing data, there are a number of key concepts, such as filtering, scaling, dimensionality reduction, hypothesis testing, clustering and visualisation, that need to be understood. These concepts can be intuitively explained to new users, however, (i) the use of a heterogeneous vocabulary and jargon by methodologies/algorithms/packages, (ii) the complexity of data wrangling, and (iii) the coding burden, impede effective learning of the statistics and biology underlying an informed RNA sequencing analysis.
The tidytranscriptomics approach to RNA sequencing data analysis abstracts out the coding-related complexity and provides tools that use an intuitive and jargon-free vocabulary, enabling focus on the statistical and biological challenges.
This Google doc has all the links you need for this workshop. As it contains logins to the Cloud servers, it will be accessible during the workshop only.
First we have a few questions for you, the audience.
Please either open a tab in your browser or use your phone. Go to Menti (www.menti.com) and type the code given in the Google doc above.
Then please rate the following on a scale of 1-5 (1=no/none, 5=yes/a lot)
“The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells”
Wikipedia
knitr::include_graphics("../inst/vignettes/transcriptomics.jpg")
knitr::include_graphics("../inst/vignettes/ScreenShot2.png")
knitr::include_graphics("../inst/vignettes/ScreenShot3.png")
knitr::include_graphics("../inst/vignettes/bulk_RNAseq_pipeline.png")
knitr::include_graphics("../inst/vignettes/bulk_vs_single.jpg")
Shalek and Benson, 2017
knitr::include_graphics("../inst/vignettes/single_cell_RNAseq_pipeline.png")
This workshop demonstrates how to perform analysis of RNA sequencing data following the tidy data paradigm [@wickham2014tidy]. The tidy data paradigm provides a standard way to organise data values within a dataset, where each variable is a column, each observation is a row, and data is manipulated using an easy-to-understand vocabulary. Most importantly, the data structure remains consistent across manipulation and analysis functions. For more information, see the R for Data Science chapter on tidy data here.
knitr::include_graphics("../inst/vignettes/tidydata_1.jpg")
The tidyverse is a collection of packages that can be used to tidy, manipulate and visualise data. We'll use many functions from the tidyverse in this workshop, such as filter
, select
, mutate
, pivot_longer
and ggplot
.
knitr::include_graphics("../inst/vignettes/tidyverse.png")
Easiest way to run this material. Only available during workshop. Many thanks to the Australian Research Data Commons (ARDC) for providing RStudio in the Australian Nectar Research Cloud and Andy Botting from ARDC for helping to set up.
tidytranscriptomics.Rmd
in ismb2021_tidytranscriptomcs/vignettes
folderWe will use the Cloud during the ISMB/ECCB workshop and this method is available if you want to run the material after the workshop. If you want to install on your own computer, see instructions here.
Alternatively, you can view the material at the workshop webpage here.
tidybulk, tidySummarizedExperiment, tidySingleCellExperiment and tidyseurat are part of the tidytranscriptomics suite that introduces a tidy approach to RNA sequencing data representation and analysis. The roadmap for the development of the tidytranscriptomics packages is shown in the figure below.
knitr::include_graphics("../inst/vignettes/roadmap.png")
knitr::include_graphics("../inst/vignettes/tidybulk_logo.png")
Some of the material in Part 1 was adapted from an R RNA sequencing workshop first run here. The use of the airway dataset was inspired by the DESeq2 vignette.
Measuring gene expression on a genome-wide scale has become common practice over the last two decades or so, with microarrays predominantly used pre-2008. With the advent of next-generation sequencing technology in 2008, an increasing number of scientists use this technology to measure and understand changes in gene expression in often complex systems. As sequencing costs have decreased, using RNA sequencing to simultaneously measure the expression of tens of thousands of genes for multiple samples has never been easier. The cost of these experiments has now moved from generating the data to storing and analysing it.
There are many steps involved in analysing an RNA sequencing dataset. Sequenced reads are aligned to a reference genome, then the number of reads mapped to each gene can be counted. This results in a table of counts, which is what we perform statistical analyses on in R. While mapping and counting are important and necessary tasks, today, we will be starting from the count data and showing how differential expression analysis can be performed in a friendly way using the Bioconductor package, tidybulk.
First, let's load all the packages we will need to analyse the data.
Note: you should load the tidybulk and tidySummarizedExperiment libraries after the tidyverse core packages for best integration.
# dataset library(airway) # tidyverse core packages library(tibble) library(dplyr) library(tidyr) library(readr) library(stringr) library(ggplot2) library(purrr) # tidyverse-friendly packages library(plotly) library(ggrepel) library(GGally) library(tidyHeatmap) library(tidybulk) # library(tidySummarizedExperiment) # we'll load this below to show what it can do
Plot settings. Set the colours and theme we will use for our plots.
# Use colourblind-friendly colours friendly_cols <- dittoSeq::dittoColors() # Set theme custom_theme <- list( scale_fill_manual(values = friendly_cols), scale_color_manual(values = friendly_cols), theme_bw() + theme( panel.border = element_blank(), axis.line = element_line(), panel.grid.major = element_line(size = 0.2), panel.grid.minor = element_line(size = 0.1), text = element_text(size = 12), legend.position = "bottom", strip.background = element_blank(), axis.title.x = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)), axis.title.y = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)), axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1) ) )
Here we will perform our analysis using the data from the airway package. The airway data comes from the paper by [@himes2014rna]; and it includes 8 samples from human airway smooth muscle cells, from 4 cell lines. For each cell line treated (with dexamethasone) and untreated (negative control) a sample has undergone RNA sequencing and gene counts have been generated.
# load airway RNA sequencing data data(airway) # take a look airway
The data in the airway package is a Bioconductor SummarizedExperiment object. For more information see here.
The tidySummarizedExperiment
package enables any SummarizedExperiment object to be displayed and manipulated according to tidy data principles, without affecting any SummarizedExperiment behaviour.
If we load the tidySummarizedExperiment package and then view the airway data it now displays as a tibble. A tibble is the tidyverse table format.
# load tidySummarizedExperiment package library(tidySummarizedExperiment) # take a look airway
Now we can more easily see the data. The airway object contains information about genes and samples, the first column has the Ensembl gene identifier, the second column has the sample identifier and the third column has the gene transcription abundance. The abundance is the number of reads aligning to the gene in each experimental sample. The remaining columns include sample information. The dex column tells us whether the samples are treated or untreated and the cell column tells us what cell line they are from.
We can still interact with the tidy SummarizedExperiment object using commands for SummarizedExperiment objects.
assays(airway)
And now we can also use tidyverse commands, such as filter
, select
, group_by
, summarise
and mutate
to explore the tidy SummarizedExperiment object. Some examples are shown below and more can be seen at the tidySummarizedExperiment website here. We can also use the tidyverse pipe %>%
. This 'pipes' the output from the command on the left into the command on the right/below. Using the pipe is not essential but it reduces the amount of code we need to write when we have multiple steps. It also can make the steps clearer and easier to see. For more details on the pipe see here.
We can use filter
to choose rows, for example, to see just the rows for the treated samples.
airway %>% filter(dex == "trt")
We can use select
to choose columns, for example, to see the sample, cell line and treatment columns.
airway %>% select(sample, cell, dex)
We can combine group_by
and summarise
to calculate the total counts for each sample.
airway %>% group_by(sample) %>% summarise(total_counts=sum(counts))
We can use mutate
to create a column. For example, we could create a new sample_name column that contains shorter sample names. We can remove the SRR1039 prefix that's present in all of the samples, as shorter names can fit better in some of the plots we will create. We can use mutate
together with str_replace
to remove the SRR1039 string from the sample column.
airway %>% mutate(sample_name=str_remove(sample, "SRR1039")) %>% # select columns to view select(sample, sample_name)
We'll set up the airway data for our RNA sequencing analysis. We'll create a column with shorter sample names and a column with gene symbols. We can get the gene symbols for these Ensembl gene ids using the Bioconductor annotation package for human, org.Hs.eg.db
.
# setup data workflow counts <- airway %>% mutate(sample_name = str_remove(sample, "SRR1039")) %>% mutate(symbol = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db, keys = feature, keytype = "ENSEMBL", column = "SYMBOL", multiVals = "first" )) # take a look counts
Genes with very low counts across all libraries provide little evidence for differential expression and they can interfere with some of the statistical approximations that are used later in the pipeline. They also add to the multiple testing burden when estimating false discovery rates, reducing the power to detect differentially expressed genes. These genes should be filtered out prior to further analysis.
We can perform the filtering using tidybulk keep_abundant
or identify_abundant
. These functions can use the edgeR filterByExpr
function described in [@law2016rna] to automatically identify the genes with adequate abundance for differential expression testing. By default, this will keep genes with \~10 counts in a minimum number of samples, the number of the samples in the smallest group. In this dataset, the smallest group size is four (as we have four dex-treated samples versus four untreated). Alternatively, we could use identify_abundant
to identify which genes are abundant or not (TRUE/FALSE), rather than just keeping the abundant ones.
# Filtering counts counts_filtered <- counts %>% keep_abundant(factor_of_interest = dex) # take a look counts_filtered
After running keep_abundant
we have a column called .abundant
containing TRUE (identify_abundant
would have TRUE/FALSE).
Scaling of counts (normalisation) is performed to eliminate uninteresting differences between samples due to sequencing depth or composition. A more detailed explanation can be found here. In the tidybulk package, the function scale_abundance
generates scaled counts, with scaling factors calculated on abundant (filtered) transcripts and applied to all transcripts. We can choose from different normalisation methods. Here we will use the default edgeR's trimmed mean of M values (TMM), [@robinson2010scaling]. TMM normalisation (and most scaling normalisation methods) scale relative to one sample.
# Scaling counts counts_scaled <- counts_filtered %>% scale_abundance() # take a look counts_scaled
After we run, we should see some columns have been added at the end. The counts_scaled
column contains the scaled counts.
We can visualise the difference in abundance densities before and after scaling. As tidybulk output is compatible with tidyverse, we can simply pipe it into standard tidyverse functions such as filter
, pivot_longer
and ggplot
. We can also take advantage of ggplot's facet_wrap
to easily create multiple plots.
counts_scaled %>% # Reshaping pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance") %>% # Plotting ggplot(aes(x = abundance + 1, color = sample_name)) + geom_density() + facet_wrap(~source) + scale_x_log10() + custom_theme
In this dataset, the distributions of the counts are not very different to each other before scaling, but scaling does make the distributions more similar. If we saw a sample with a very different distribution, we may need to investigate it.
As tidybulk smoothly integrates with ggplot2 and other tidyverse packages it can save on typing and make plots easier to generate. Compare the code for creating density plots with tidybulk versus standard base R below (standard code adapted from [@law2016rna]).
tidybulk
# tidybulk airway %>% keep_abundant(factor_of_interest = dex) %>% scale_abundance() %>% pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance") %>% ggplot(aes(x = abundance + 1, color = sample)) + geom_density() + facet_wrap(~source) + scale_x_log10() + custom_theme
base R using edgeR
# Example code, no need to run # Prepare data set library(edgeR) dgList <- SE2DGEList(airway) group <- factor(dgList$samples$dex) keep.exprs <- filterByExpr(dgList, group = group) dgList <- dgList[keep.exprs, , keep.lib.sizes = FALSE] nsamples <- ncol(dgList) logcounts <- log2(dgList$counts) # Setup graphics col <- RColorBrewer::brewer.pal(nsamples, "Paired") par(mfrow = c(1, 2)) # Plot raw counts plot(density(logcounts[, 1]), col = col[1], lwd = 2, ylim = c(0, 0.26), las = 2, main = "", xlab = "") title(main = "Counts") for (i in 2:nsamples) { den <- density(logcounts[, i]) lines(den$x, den$y, col = col[i], lwd = 2) } legend("topright", legend = dgList$samples$Run, text.col = col, bty = "n") # Plot scaled counts dgList_norm <- calcNormFactors(dgList) lcpm_n <- cpm(dgList_norm, log = TRUE) plot(density(lcpm_n[, 1]), col = col[1], lwd = 2, ylim = c(0, 0.26), las = 2, main = "", xlab = "") title("Counts scaled") for (i in 2:nsamples) { den <- density(lcpm_n[, i]) lines(den$x, den$y, col = col[i], lwd = 2) } legend("topright", legend = dgList_norm$samples$Run, text.col = col, bty = "n")
By far, one of the most important plots we make when we analyse RNA sequencing data are principal-component analysis (PCA) or multi-dimensional scaling (MDS) plots. We reduce the dimensions of the data to identify the greatest sources of variation in the data. A principal components analysis is an example of an unsupervised analysis, where we don't need to specify the groups. If your experiment is well controlled and has worked well, what we hope to see is that the greatest sources of variation in the data are the treatments/groups we are interested in. It is also an incredibly useful tool for quality control and checking for outliers. We can use the reduce_dimensions
function to calculate the dimensions.
# Get principal components counts_scal_PCA <- counts_scaled %>% reduce_dimensions(method = "PCA")
```{challenge_1 class.source="challenge"} Challenge: What fraction of variance is explained by PC3? Select one of the multiple choice options in www.menti.com (code in Google doc).
The Principal component (PC) information is joined as columns to the counts object. ```r # Take a look counts_scal_PCA
For plotting, we can select just the sample-wise information with pivot_sample
.
# take a look counts_scal_PCA %>% pivot_sample()
We can now plot the reduced dimensions.
# PCA plot counts_scal_PCA %>% pivot_sample() %>% ggplot(aes(x = PC1, y = PC2, colour = dex, shape = cell)) + geom_point() + geom_text_repel(aes(label = sample_name), show.legend = FALSE) + custom_theme
The samples group by treatment on PC1 which is what we hope to see. PC2 separates the N080611 cell line from the other samples, indicating a greater difference between that cell line and the others.
An alternative to principal component analysis for examining relationships between samples is using hierarchical clustering. Heatmaps are a nice visualisation to examine hierarchical clustering of your samples. tidybulk has a simple function we can use, keep_variable
, to extract the most variable genes which we can then plot with tidyHeatmap.
counts_scal_PCA %>% # extract 500 most variable genes keep_variable(.abundance = counts_scaled, top = 500) %>% as_tibble() %>% # create heatmap heatmap( .column = sample_name, .row = feature, .value = counts_scaled, transform = log1p ) %>% add_tile(dex) %>% add_tile(cell)
In the heatmap, we can see the samples cluster into two groups, treated and untreated, for three of the cell lines, and the cell line (N080611) again is further away from the others.
Tidybulk enables a simplified way of generating a clustered heatmap of variable genes. Compare the code below for tidybulk versus a base R method.
base R using edgeR
# Example code, no need to run library(edgeR) dgList <- SE2DGEList(airway) group <- factor(dgList$samples$dex) keep.exprs <- filterByExpr(dgList, group = group) dgList <- dgList[keep.exprs, , keep.lib.sizes = FALSE] dgList <- calcNormFactors(dgList) logcounts <- cpm(dgList, log = TRUE) var_genes <- apply(logcounts, 1, var) select_var <- names(sort(var_genes, decreasing = TRUE))[1:500] highly_variable_lcpm <- logcounts[select_var, ] colours <- c("#440154FF", "#21908CFF", "#fefada") col.group <- c("red", "grey")[group] gplots::heatmap.2(highly_variable_lcpm, col = colours, trace = "none", ColSideColors = col.group, scale = "row")
tidybulk integrates several popular methods for differential transcript abundance testing: the edgeR quasi-likelihood [@chen2016reads] (tidybulk default method), edgeR likelihood ratio [@mccarthy2012differential], limma-voom [@law2014voom] and DESeq2 [@love2014moderated]. A common question researchers have is which method to choose. With tidybulk, we can easily run multiple methods and compare.
We give test_differential_abundance
our tidybulk counts object and a formula, specifying the column that contains our groups to be compared. If all our samples were from the same cell line, and there were no additional factors contributing variance, such as batch differences, we could use the formula ~ dex
. However, each treated and untreated sample is from a different cell line, so we add the cell line as an additional factor ~ dex + cell
.
de_all <- counts_scal_PCA %>% # edgeR QLT test_differential_abundance( ~ dex + cell, method = "edgeR_quasi_likelihood", prefix = "edgerQLT_" ) %>% # edgeR LRT test_differential_abundance( ~ dex + cell, method = "edgeR_likelihood_ratio", prefix = "edgerLR_" ) %>% # limma-voom test_differential_abundance( ~ dex + cell, method = "limma_voom", prefix = "voom_" ) %>% # DESeq2 test_differential_abundance( ~ dex + cell, method = "deseq2", prefix = "deseq2_" ) # take a look de_all
This outputs the columns from each method such as log-fold change (logFC), false-discovery rate (FDR) and probability value (p-value). logFC is log2(treated/untreated).
We can visually compare the significance for all methods. We will notice that there is some difference between the methods.
de_all %>% pivot_transcript() %>% select(edgerQLT_PValue, edgerLR_PValue, voom_P.Value, deseq2_pvalue, feature) %>% ggpairs(1:4)
```{challenge_2 class.source="challenge"} Challenge: Which method detects the largest no. of differentially abundant transcripts, p value adjusted for multiple testing < 0.05 (FDR, adj.P.Val, padj)? Select one of the multiple choice options in www.menti.com (code in Google doc).
### Single method If we just wanted to run one differential testing method we could do that. The default method is edgeR quasi-likelihood. ```r counts_de <- counts_scal_PCA %>% test_differential_abundance(~ dex + cell)
Tidybulk enables a simplified way of performing an RNA sequencing differential expression analysis (with the benefit of smoothly integrating with ggplot2 and other tidyverse packages). Compare the code for a tidybulk edgeR analysis versus standard edgeR below.
standard edgeR
# Example code, no need to run library(edgeR) dgList <- SE2DGEList(airway) group <- factor(dgList$samples$dex) keep.exprs <- filterByExpr(dgList, group = group) dgList <- dgList[keep.exprs, , keep.lib.sizes = FALSE] dgList <- calcNormFactors(dgList) cell <- factor(dgList$samples$cell) design <- model.matrix(~ group + cell) dgList <- estimateDisp(dgList, design) fit <- glmQLFit(dgList, design) qlf <- glmQLFTest(fit, coef=2)
Volcano plots are a useful genome-wide tool for checking that the analysis looks good. Volcano plots enable us to visualise the significance of change (p-value) versus the fold change (logFC). Highly significant genes are towards the top of the plot. We can also colour significant genes, e.g. genes with false-discovery rate \< 0.05. To decide which genes are differentially expressed, we usually take a cut-off of 0.05 on the FDR (or adjusted P-value), NOT the raw p-value. This is because we are testing many genes (multiple testing), and the chances of finding differentially expressed genes are very high when you do that many tests. Hence we need to control the false discovery rate, the adjusted p-value column in the results table. That is, if 100 genes are significant at a 5% false discovery rate, we are willing to accept that 5 will be false positives.
# volcano plot, minimal counts_de %>% ggplot(aes(x = logFC, y = PValue, colour = FDR < 0.05)) + geom_point() + scale_y_continuous(trans = "log10_reverse") + custom_theme
We'll extract the symbols for a few top genes (by P value) to use in a more informative volcano plot, integrating some of the packages in tidyverse.
topgenes_symbols <- counts_de %>% pivot_transcript() %>% arrange(PValue) %>% head(6) %>% pull(symbol) topgenes_symbols
counts_de %>% pivot_transcript() %>% # Subset data mutate(significant = FDR < 0.05 & abs(logFC) >= 2) %>% mutate(symbol = ifelse(symbol %in% topgenes_symbols, as.character(symbol), "")) %>% # Plot ggplot(aes(x = logFC, y = PValue, label = symbol)) + geom_point(aes(color = significant, size = significant, alpha = significant)) + geom_text_repel() + # Custom scales custom_theme + scale_y_continuous(trans = "log10_reverse") + scale_color_manual(values = c("black", "#e11f28")) + scale_size_discrete(range = c(0, 2))
Before following up on the differentially expressed genes with further lab work, it is also recommended to have a look at the expression levels of the individual samples for the genes of interest. We can use stripcharts to do this. These will help show if expression is consistent amongst replicates in the groups.
With stripcharts we can see if replicates tend to group together and how the expression compares to the other groups. We'll also add a box plot to show the distribution. Tidyverse faceting makes it easy to create a plot for each gene.
strip_chart <- counts_scaled %>% # extract counts for top differentially expressed genes filter(symbol %in% topgenes_symbols) %>% # make faceted stripchart ggplot(aes(x = dex, y = counts_scaled + 1, fill = dex, label = sample)) + geom_boxplot() + geom_jitter() + facet_wrap(~symbol) + scale_y_log10() + custom_theme strip_chart
A really nice feature of using tidyverse and ggplot2 is that we can make interactive plots quite easily using the plotly package. This can be very useful for exploring what genes or samples are in the plots. We can make interactive plots directly from our ggplot2 object (strip_chart). Having label
in the aes
is useful to visualise the identifier of the data point (here the sample id) or other variables when we hover over the plot.
We can also specify which parameters from the aes
we want to show up when we hover over the plot with tooltip
.
strip_chart %>% ggplotly(tooltip = c("label", "y"))
Tidybulk provides a handy function called get_bibliography
that keeps track of the references for the methods used in your tidybulk workflow. The references are in BibTeX format and can be imported into your reference manager.
get_bibliography(counts_de)
If we are sequencing tissue samples, we may want to know what cell types are present and if there are differences in expression between them. tidybulk
has a deconvolve_cellularity
function that can help us do this.
For this example, we will use a subset of the breast cancer dataset from The Cancer Genome Atlas (TCGA).
BRCA <- ismb2021tidytranscriptomics::BRCA BRCA
With tidybulk, we can easily infer the proportions of cell types within a tissue using one of several published methods (Cibersort [@newman2015robust], EPIC [@racle2017simultaneous] and llsr [@abbas2009deconvolution]). Here we will use Cibersort which provides a default signature called LM22 to define the cell types. LM22 contains 547 genes that identify 22 human immune cell types.
BRCA_cell_type <- BRCA %>% deconvolve_cellularity(prefix="cibersort: ", cores = 1) BRCA_cell_type
Cell type proportions are added to the tibble as new columns. The prefix makes it easy to reshape the data frame if needed, for visualisation or further analyses.
BRCA_cell_type_long <- BRCA_cell_type %>% pivot_sample() %>% # Reshape pivot_longer( contains("cibersort"), names_prefix = "cibersort: ", names_to = "cell_type", values_to = "proportion" ) BRCA_cell_type_long
We can plot the proportions of immune cell types for each patient.
BRCA_cell_type_long %>% # Plot proportions ggplot(aes(x = sample, y = proportion, fill = cell_type)) + geom_bar(stat = "identity") + custom_theme
```{challenge_3 class.source="challenge"} Challenge: What is the most abundant cell type overall in BRCA samples? Select one of the multiple choice options in www.menti.com (code in Google doc).
## Key Points - Bulk RNA sequencing data can be represented and analysed in a 'tidy' way using tidySummarizedExperiment, tidybulk and the tidyverse. - tidySummarizedExperiment enables us to visualise and manipulate a Bioconductor SummarizedExperiment object as if it were in tidy data format. - Some of the key steps in an RNA sequencing analysis are filtering lowly abundant transcripts, adjusting for differences in sequencing depth and composition, testing for differential expression, and visualising the data, which can all be performed in a tidy way with tidybulk. - `tidybulk` allows streamlined multi-method analyses - `tidybulk` allow easy analyses of cell-type composition # Part 2 Single-cell RNA sequencing with tidySingleCellExperiment A typical single-cell RNA sequencing workflow is shown in the [Workshop Introduction](https://tidytranscriptomics-workshops.github.io/ismb2021_tidytranscriptomics/articles/tidytranscriptomics.html#differences-between-bulk-and-single-cell-rna-sequencing-1) section. We don't have time in this workshop to go into depth on each step but you can read more about single-cell RNA sequencing workflows in the online book [Orchestrating Single-Cell Analysis with Bioconductor](http://bioconductor.org/books/release/OSCA/index.html). In Part 1, we showed how we can study the cell-type composition of a biological sample using bulk RNA sequencing. Single-cell sequencing enables a more direct estimation of cell-type composition and gives a greater resolution. For bulk RNA sequencing, we need to infer the cell types using the abundance of transcripts in the whole sample. With single-cell RNA sequencing, we can directly measure the transcripts in each cell and then classify the cells into cell types. ```r # load packages library(ggplot2) library(purrr) library(scater) library(scran) library(igraph) library(batchelor) library(SingleR) library(scuttle) library(EnsDb.Hsapiens.v86) library(celldex) library(ggbeeswarm) library(tidySingleCellExperiment) # set colours friendly_cols <- dittoSeq::dittoColors()
The single-cell RNA sequencing data used here is 3000 cells in total, subsetted from 20 samples from 10 peripheral blood mononuclear cell (PBMC) datasets. The datasets are from GSE115189/SRR7244582 [@Freytag2018], SRR11038995 [@Cai2020, SCP345 (singlecell.broadinstitute.org), SCP424 [@Ding2020], SCP591 [@Karagiannis2020] and 10x-derived 6K and 8K datasets (support.10xgenomics.com/). The data is in SingleCellExperiment format. SingleCellExperiment is a very popular container of single-cell RNA sequencing data.
Similar to what we saw with tidySummarizedExperiment, tidySingleCellExperiment
package enables any SingleCellExperiment object to be displayed and manipulated according to tidy data principles without affecting any SingleCellExperiment behaviour.
# load pbmc single cell RNA sequencing data pbmc <- ismb2021tidytranscriptomics::pbmc # take a look pbmc
This tidy SingleCellExperiment object can be interacted with using SingleCellExperiment commands such as assayNames
.
assayNames(pbmc)
We can also interact with our object as we do with any tidyverse tibble.
We can use tidyverse commands, such as filter
, select
and mutate
to explore the tidySingleCellExperiment object. Some examples are shown below and more can be seen at the tidySingleCellExperiment website here. We can also use the tidyverse pipe %>%
. This 'pipes' the output from the command on the left into the command on the right/below. Using the pipe is not essential but it reduces the amount of code we need to write when we have multiple steps. It also can make the steps clearer and easier to see. For more details on the pipe see here.
We can use filter
to choose rows, for example, to see just the rows for the cells in G1 cell-cycle stage.
pbmc %>% filter(ident == "G1")
We can use select
to choose columns, for example, to see the sample, cell, total cellular RNA
pbmc %>% select(cell, nCount_RNA , ident)
We can use mutate
to create a column. For example, we could create a new ident_l
column that contains a lower-case varsion of ident
.
pbmc %>% mutate(ident_l=tolower(ident)) %>% # select columns to view select(ident, ident_l)
We can join datasets as if they were tibbles
pbmc %>% bind_rows(pbmc)
In this case, we want to polish an annotation column. We will extract the sample, dataset and group information from the file name column into separate columns.
# First take a look at the file column pbmc %>% select(file)
# Create columns for sample, dataset and groups pbmc <- pbmc %>% # Extract sample and group extract(file, "sample", "../data/([a-zA-Z0-9_]+)/outs.+", remove = FALSE) %>% # Extract data source extract(file, c("dataset", "groups"), "../data/([a-zA-Z0-9_]+)_([0-9])/outs.+") # Take a look pbmc %>% select(sample, dataset, groups)
A key quality control step performed in single-cell analyses is the assessment of the proportion of mitochondrial transcripts. A high mitochondrial count indicates cell death, and it is useful for filtering cells in a dying state.
We'll first show the mitochondrial analysis for one of the 10 datasets.
one_dataset <- pbmc %>% filter(dataset =="GSE115189")
We get the chromosomal location for each gene in the dataset so we can identify the mitochondrial genes. We'll get a warning that some of the ids don't find a match, but it should be just a small proportion.
location <- mapIds( EnsDb.Hsapiens.v86, keys = rownames(one_dataset), column = "SEQNAME", keytype = "SYMBOL" )
Next we calculate the mitchondrial content for each cell in the dataset.
mito_info_one_dataset <- perCellQCMetrics(one_dataset, subsets = list(Mito = which(location == "MT"))) mito_info_one_dataset
We then label the cells with high mitochondrial content as outliers.
mito_info_one_dataset <- mito_info_one_dataset %>% # Converting to tibble as_tibble(rownames = "cell") %>% # Label cells with high mitochondrial content mutate(high_mitochondrion = isOutlier(subsets_Mito_percent, type = "higher")) mito_info_one_dataset
Finally, we join the mitochondrial information back to the original data so we will be able to filter out the cells with high mitochondrial content.
mito_info_one_dataset <- one_dataset %>% left_join(mito_info_one_dataset, by = "cell") mito_info_one_dataset
The steps above perform the mitochondrial QC analysis for one dataset. To analyse all 10 datasets, we could repeat the steps for each dataset (a lot of steps!) or create a function and apply it to each dataset. However, a more efficient way is to use nesting. We can nest our data by dataset.
A powerful tool tidySingleCellExperiment enables us to use with our single-cell data is tidyverse's nest. Nesting allows us to easily perform independent analyses on subsets of the data. For example, our data contains 10 single-cell datasets from different sources, we can nest by dataset and analyse the mitochondrial content for each.
To demonstrate the power of nesting
knitr::include_graphics("../inst/vignettes/nesting.png")
As a simple example of nesting. Imagine we had two groups, a and b that contained some values.
my_tibble <- tibble( label = c("a","a","a","a","a","b","b","b","b","b" ), value = 1:10 ) my_tibble
We could nest this data to make tables containing the values for each of our a and b groups. With nest
, data=
specifies the columns that will be in the tables and creates a list of these smaller tables.
# Nest my_tibble_nested = my_tibble %>% nest(data = value) my_tibble_nested
We can then iterate over these smaller tables. map
is often used in combination with nest
to perform commands on nested data. There are a few map commands that you can choose from depending on what you want to return. For example, map_dbl
will return a double (floating point number) and we can use it to calculate the average value for each group. You can use ?map
to see the options. The map syntax is map(.x, .f)
where .x
is the table to be iterated over and .f
is the function. So in this example, .x
is data
and .f
is ~ mean(.x$value)
. We get the mean of the value column for each of the nested tables in data.
# Summarise my_tibble_nested %>% mutate(average = map_dbl(data, ~ mean(.x$value)))
We can apply multiple commands to the nested data using the %>%
. We can use unnest
if we want to get back our large (unnested) table to view the results.
# Summarise + filter my_tibble_nested %>% mutate(average = map_dbl(data, ~ mean(.x$value))) %>% filter(average == max(average)) %>% unnest(data)
We can use map2
when we have two arguments that we are iterating over. In the example below we have the nested tables (data) and the average for each (average). The map2 syntax is map2(.x, .y, .f)
.
# Summarise + update my_tibble_nested %>% mutate(average = map_dbl(data, ~ mean(.x$value))) %>% mutate(data = map2(data, average, ~ .x - .y )) %>% unnest(data)
We will created nested tables for our 10 datasets.
pbmc_nested <- pbmc %>% nest(data = -dataset) pbmc_nested
```{challenge_4 class.source="challenge"} Challenge: Using the nest examples above, can you create a column that includes the mitochondrial information? This one is not a Menti quiz question.
We want to obtain this table
dataset data mitochondrion_info
1 GSE115189
We will create a table (tibble) with the mitochondrial QC information for each dataset. ```r mito_info_all_datasets = ismb2021tidytranscriptomics::mito_info_all_datasets
As before, we join the mitochondrial information to the original data.
# Join the mitochondrial information to the original SingleCellExperiment data mito_info_all_datasets <- mito_info_all_datasets %>% mutate(data = map2( data, mitochondrion_info, ~ left_join(.x, .y, by = "cell") )) %>% # Remove the separate mitochondrial information table select(-mitochondrion_info) mito_info_all_datasets
Remove the nesting to get back one table containing all datasets.
pbmc <- mito_info_all_datasets %>% unnest(data) pbmc
Nesting has allowed us to perform the QC on each dataset in an efficient way.
We can use tidyverse to reshape the data and create beeswarm plots to visualise the mitochondrial content.
pbmc %>% # Reshaping pivot_longer(c(detected, sum, subsets_Mito_percent)) %>% ggplot(aes( x = sample, y = value, color = high_mitochondrion, alpha = high_mitochondrion, size = high_mitochondrion )) + # Plotting geom_quasirandom() + facet_wrap(~name, scale = "free_y") + # Customisation scale_color_manual(values = c("black", "#e11f28")) + scale_size_discrete(range = c(0, 2)) + theme_bw() + theme(axis.text.x = element_text(angle = 50, hjust = 1, vjust = 1))
In the faceted plot, "detected" is the number of genes in each of the 10 datasets, "sum" is the total counts.
We then proceed to filter out cells with high mitochondrial content.
pbmc <- pbmc %>% filter(!high_mitochondrion)
As we are working with multiple datasets, we need to integrate them and adjust for technical variability between them. Here we'll nest by dataset (batch), normalise within each batch with multiBatchNorm
and correct for batch effects with fastMNN
.
# Scaling within each dataset pbmc <- pbmc %>% # Define batch nest(data = -dataset) %>% # Normalisation mutate(data = multiBatchNorm(data)) %>% # Integration pull(data) %>% fastMNN() %>% # Join old information left_join(pbmc %>% as_tibble())
We proceed with identifying cell clusters.
# Assign clusters to the 'colLabels' # of the SingleCellExperiment object colLabels(pbmc) <- # from SingleCellExperiment pbmc %>% buildSNNGraph(use.dimred="corrected") %>% # from scran - shared nearest neighbour cluster_walktrap() %$% # from igraph membership %>% as.factor() # Reorder columns pbmc %>% select(label, everything())
Thanks to tidySingleCellExperiment we can interrogate the object with tidyverse commands and use count
to count the number of cells in each cluster.
pbmc %>% count(label)
Besides PCA which is a linear dimensionality reduction, we can apply neighbour aware methods such as UMAP, to better define locally similar cells.
# Calculate UMAP with scater pbmc %>% runUMAP(ncomponents = 2, dimred="corrected") # from scater
```{challenge_5 class.source="challenge"} Challenge: Without creating any variable, check if the variability of the 1st UMAP dimension when calculating 2 components (ncomponents = 2) is equal/more/less than when calculating 3 components. Select one of the multiple choice options in www.menti.com (code in Google doc).
**Tip: your can use as_tibble() to convert the tibble abstraction to a simple (and light) cell-wise tibble** We can calculate the first 3 UMAP dimensions using the scater framework. ```r # Calculate UMAP with scater pbmc <- pbmc %>% runUMAP(ncomponents = 3, dimred="corrected") # from scater
And we can plot the clusters as a 3D plot using plotly. This time we are colouring by estimated cluster labels to visually check the cluster labels.
pbmc %>% plot_ly( x = ~`UMAP1`, y = ~`UMAP2`, z = ~`UMAP3`, colors = friendly_cols[1:10], color = ~label, size=0.5 )
knitr::include_graphics("../inst/vignettes/plotly_2.png")
We can identify cluster markers (genes). As example, we are selecting the top 10 for each cluster. We can then plot a heatmap of those gene markers across cells.
# Identify top 10 markers per cluster marker_genes <- pbmc %>% findMarkers(groups=pbmc$label, assay.type = "reconstructed") %>% # scran as.list() %>% map(~ head(.x, 10) %>% rownames()) %>% unlist() # Plot heatmap pbmc %>% plotHeatmap( # from scater features=marker_genes, columns=order(pbmc$label), colour_columns_by=c("label"), exprs_values = "reconstructed" )
We can infer cell type identities (e.g. T cell) using SingleR [@aran2019reference] and manipulate the output using tidyverse. SingleR accepts any log-normalised transcript abundance matrix
# Reference cell types blueprint <- BlueprintEncodeData() cell_type <- # extracting counts from SingleCellExperiment object assays(pbmc)$reconstructed %>% # SingleR SingleR( ref = blueprint, labels = blueprint$label.main, clusters = pbmc %>% pull(label) ) %>% # Formatting results as.data.frame() %>% as_tibble(rownames = "label") %>% select(label, first.labels) cell_type
We join the cell type information to our pbmc data.
# Join cell type info pbmc <- pbmc %>% left_join(cell_type, by = "label") # Reorder columns pbmc %>% select(cell, first.labels, everything())
```{challenge_6 class.source="challenge"} Challenge: Which cell type (first.label) has the largest no. of cells?
## Pseudobulk analyses It is sometime useful to aggregate cell-wise transcript abundance into pseudobulk samples. It is possible to explore data and perform hypothesis testing with tools and data-source that we are more familiar with. For example, we can use edgeR in tidybulk to perform differential expression testing. For more details on pseudobulk analysis see [here](https://hbctraining.github.io/scRNA-seq/lessons/pseudobulk_DESeq2_scrnaseq.html). ### Data exploration using pseudobulk samples To do this, we will use a helper function called `aggregate_cells`, available in this workshop package, to create a group for each sample. ```r # Aggregate pbmc_bulk <- ismb2021tidytranscriptomics::pbmc %>% ismb2021tidytranscriptomics::aggregate_cells(file) pbmc_bulk
Once we have aggregated the single cells for each sample, we can use the bulk RNA sequencing methods available through tidybulk.
pbmc_bulk %>% # Tidybulk operations tidybulk::identify_abundant() %>% tidybulk::scale_abundance()
Seurat is a very popular analysis toolkit for single cell RNA sequencing data [@butler2018integrating; @stuart2019comprehensive] .
tidyseurat provides a bridge between the Seurat single-cell package [@butler2018integrating; @stuart2019comprehensive] and the tidyverse [@wickham2019welcome]. It creates an invisible layer that enables viewing the Seurat object as a tidyverse tibble, and provides Seurat-compatible dplyr, tidyr, ggplot and plotly functions.
pbmc_seurat <- ismb2021tidytranscriptomics::pbmc_seurat
library(tidyseurat)
pbmc_seurat
It can be interacted with using Seurat commands such as Assays
.
Assays(pbmc_seurat)
We can also interact with our object as we do with any tidyverse tibble.
We can interact with this object in the same way we showed for tidySingleCellExperiment
, through the tidy paradigm.
# Filter pbmc_seurat %>% filter(groups == "g1") # Select pbmc_seurat %>% select(cell, nCount_RNA , groups) # Mutate pbmc_seurat %>% mutate(groups_l=tolower(groups)) %>% # select columns to view select(groups, groups_l) # Bind datasets pbmc_seurat %>% bind_rows(pbmc_seurat)
tidySingleCellExperiment
is an invisible layer that operates on a SingleCellExperiment
object and enables us to visualise and manipulate data as if it were a tidy data frame.tidySingleCellExperiment
object is a SingleCellExperiment object
so it can be used with any SingleCellExperiment
compatible methodtidyseurat
, similarly, enables us to visualise and manipulate a Seurat object as if it were a tidy data frame and can also be used with any Seurat
compatible methodWe and the ISMB Tutorial organisers would be very grateful if you could please complete the ISMB 2021 Tutorial Feedback survey so we can gather feedback on the effectiveness and usefulness of this ISMB Tutorial. The link to the survey is here: https://forms.gle/y5s1Lm5yjn5pGxHJ8.
If you want to suggest improvements for this workshop or ask questions, you can do so as described here.
Record package and version information with sessionInfo
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.