knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
```{css, echo = FALSE}
.poll { background-color: #fff4d4; } .poll code { background-color: #fff4d4; }
# Part 1 Bulk RNA-seq Core This workshop will present how to perform analysis of RNA sequencing data following the tidy data paradigm [@wickham2014tidy]. The tidy data paradigm provides a standard way to organise data values within a dataset, where each variable is a column, each observation is a row, and data is manipulated using an easy-to-understand vocabulary. Most importantly, the data structure remains consistent across manipulation and analysis functions. This can be achieved for RNA sequencing data with the [tidybulk](https://stemangiola.github.io/tidybulk/), [tidyHeatmap](https://stemangiola.github.io/tidyHeatmap) [@mangiola2020tidyheatmap] and [tidyverse](https://www.tidyverse.org/) [@wickham2019welcome] packages. The tidybulk package provides a tidy data structure and a modular framework for bulk transcriptional analyses. tidyHeatmap provides a tidy implementation of ComplexHeatmap. These packages are part of the tidytranscriptomics suite that introduces a tidy approach to RNA sequencing data representation and analysis ### Acknowledgements Some of the material in Part 1 was adapted from an R for RNA sequencing workshop first run [here](http://combine-australia.github.io/2016-05-11-RNAseq/). Use of the airway dataset was inspired by the [DESeq2 vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html). ```r knitr::include_graphics("../inst/vignettes/tidybulk_logo.png")
Measuring gene expression on a genome-wide scale has become common practice over the last two decades or so, with microarrays predominantly used pre-2008. With the advent of next generation sequencing technology in 2008, an increasing number of scientists use this technology to measure and understand changes in gene expression in often complex systems. As sequencing costs have decreased, using RNA sequencing to simultaneously measure the expression of tens of thousands of genes for multiple samples has never been easier. The cost of these experiments has now moved from generating the data to storing and analysing it.
There are many steps involved in analysing an RNA sequencing dataset. Sequenced reads are aligned to a reference genome, then the number of reads mapped to each gene can be counted. This results in a table of counts, which is what we perform statistical analyses on in R. While mapping and counting are important and necessary tasks, today we will be starting from the count data and showing how differential expression analysis can be performed in a friendly way using the Bioconductor package, tidybulk.
knitr::include_graphics("../inst/vignettes/bulk_RNAseq_pipeline.png")
First, let’s load all the packages we will need to analyse the data.
Note: you should load the tidybulk library after the tidyverse core packages for best integration.
# load libraries # dataset library(airway) # tidyverse core packages library(tibble) library(dplyr) library(tidyr) library(readr) library(stringr) library(ggplot2) # tidyverse-friendly packages library(plotly) library(ggrepel) library(GGally) library(tidyHeatmap) library(tidybulk)
Plot settings. Set the colours and theme we will use for our plots.
# Use colourblind-friendly colours friendly_cols <- dittoSeq::dittoColors() # Set theme custom_theme <- list( scale_fill_manual(values = friendly_cols), scale_color_manual(values = friendly_cols), theme_bw() + theme( panel.border = element_blank(), axis.line = element_line(), panel.grid.major = element_line(size = 0.2), panel.grid.minor = element_line(size = 0.1), text = element_text(size = 12), legend.position = "bottom", strip.background = element_blank(), axis.title.x = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)), axis.title.y = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)), axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1) ) )
Here we will perform our analysis using the data from the airway package. The airway data comes from the paper by [@himes2014rna]; and it includes 8 samples from human airway smooth muscle cells, from 4 cell lines. For each cell line treated (with dexamethasone) and untreated (negative control) a sample has undergone RNA sequencing and gene counts have been generated.
The airway data is stored as a Bioconductor RangedSummarizedExperiment object. We will convert this object into a tidybulk tibble. A tibble is the tidyverse table format.
In this workshop we will be using the tidyverse pipe %>%
. This 'pipes' the output from the command on the left into the command on the right/below. Using the pipe is not essential but it reduces the amount of code we need to write when we have multiple steps (as we'll see later). It also can make the steps clearer and easier to see. For more details on the pipe see here.
# load airway RNA sequencing data data(airway) # convert to tidybulk tibble counts_airway <- airway %>% tidybulk() # take a look counts_airway
The counts_airway
object contains information about genes and samples, the first column has the Ensembl gene identifier, the second column has the sample identifier and the third column has the gene transcription abundance. The abundance is the number of reads aligning to the gene in each experimental sample. The remaining columns include sample-wise information. The dex column tells us whether the samples are treated or untreated and the cell column tells us what cell line they are from.
We can shorten the sample names. We can remove the SRR1039 prefix that's present in all of them, as shorter names can fit better in some of the plots we will create. We can use mutate()
together with str_replace()
to remove the SRR1039 string from the sample column.
We can get the gene symbols for these Ensembl gene ids using the Bioconductor annotation package for human, org.Hs.eg.db
and add them as a column using mutate
again.
With tidyverse, all above steps can be linked with the %>%
, as shown below. This has the benefits that
# setup data workflow counts_tt <- airway %>% tidybulk() %>% mutate(sample=str_remove(sample, "SRR1039")) %>% mutate(symbol = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db, keys = as.character(feature), keytype = "ENSEMBL", column="SYMBOL", multiVals = "first")) # take a look counts_tt
From this tidybulk tibble, we can perform differential expression analysis with the tidybulk package.
Genes with very low counts across all libraries provide little evidence for differential expression and they can interfere with some of the statistical approximations that are used later in the pipeline. They also add to the multiple testing burden when estimating false discovery rates, reducing power to detect differentially expressed genes. These genes should be filtered out prior to further analysis.
We can perform the filtering using tidybulk keep_abundant
or identify_abundant
. These functions can use the edgeR filterByExpr
function described in [@law2016rna] to automatically identify the genes with adequate abundance for differential expression testing. By default, this will keep genes with ~10 counts in a minimum number of samples, the number of the samples in the smallest group. In this dataset the smallest group size is four (as we have four dex-treated samples vs four untreated). Alternatively, we could use identify_abundant
to identify which genes are abundant or not (TRUE/FALSE), rather than just keeping the abundant ones.
# Filtering counts counts_filtered <- counts_tt %>% keep_abundant(factor_of_interest=dex) # take a look counts_filtered
After running keep_abundant
we have a column called .abundant
containing TRUE (identify_abundant
would have TRUE/FALSE).
Scaling of counts, normalisation, is performed to eliminate uninteresting differences between samples due to sequencing depth or composition. A more detailed explanation can be found here. In the tidybulk package the function scale_abundance
generates scaled counts, with scaling factors calculated on abundant (filtered) transcripts and applied to all transcripts. We can choose from different normalisation methods. Here we will use the default, edgeR's trimmed mean of M values (TMM), [@robinson2010scaling]. TMM normalisation (and most scaling normalisation methods) scale relative to one sample.
# Scaling counts counts_scaled <- counts_filtered %>% scale_abundance() # take a look counts_scaled
After we run scale_abundance
we should see some columns have been added at the end. The counts_scaled
column contains the scaled counts.
We can visualise the difference of abundance densities before and after scaling. As tidybulk output is compatible with tidyverse, we can simply pipe it into standard tidyverse functions such as filter
, pivot_longer
and ggplot
. We can also take advantage of ggplot's facet_wrap
to easily create multiple plots.
counts_scaled %>% # Reshaping pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance") %>% # Plotting ggplot(aes(x=abundance + 1, color=sample)) + geom_density() + facet_wrap(~source) + scale_x_log10() + custom_theme
In this dataset the distributions of the counts are not very different to each other before scaling but scaling does make the distributions more similar. If we saw a sample with a very different distribution we may need to investigate it.
As tidybulk smoothly integrates with ggplot2 and other tidyverse packages it can save on typing and make plots easier to generate. Compare the code for creating density plots with tidybulk versus standard base R below (standard code adapted from [@law2016rna]).
tidybulk
# tidybulk airway %>% tidybulk() %>% keep_abundant(factor_of_interest=dex) %>% scale_abundance() %>% pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance") %>% ggplot(aes(x=abundance + 1, color=sample)) + geom_density() + facet_wrap(~source) + scale_x_log10() + custom_theme
base R using edgeR
# Example code, no need to run # Prepare data set library(edgeR) dgList <- SE2DGEList(airway) group <- factor(dgList$samples$dex) keep.exprs <- filterByExpr(dgList, group=group) dgList <- dgList[keep.exprs,, keep.lib.sizes=FALSE] nsamples <- ncol(dgList) logcounts <- log2(dgList$counts) # Setup graphics col <- RColorBrewer::brewer.pal(nsamples, "Paired") par(mfrow=c(1,2)) # Plot raw counts plot(density(logcounts[,1]), col=col[1], lwd=2, ylim=c(0,0.26), las=2, main="", xlab="") title(main="Counts") for (i in 2:nsamples){ den <- density(logcounts[,i]) lines(den$x, den$y, col=col[i], lwd=2) } legend("topright", legend=dgList$samples$Run, text.col=col, bty="n") # Plot scaled counts dgList_norm <- calcNormFactors(dgList) lcpm_n <- cpm(dgList_norm, log=TRUE) plot(density(lcpm_n[,1]), col=col[1], lwd=2, ylim=c(0,0.26), las=2, main="", xlab="") title("Counts scaled") for (i in 2:nsamples){ den <- density(lcpm_n[,i]) lines(den$x, den$y, col=col[i], lwd=2) } legend("topright", legend=dgList_norm$samples$Run, text.col=col, bty="n")
By far, one of the most important plots we make when we analyse RNA sequencing data are principal-component analysis (PCA) or multi-dimensional scaling (MDS) plots. We reduce the dimensions of the data to identify the greatest sources of variation in the data. A principal components analysis is an example of an unsupervised analysis, where we don't need to specify the groups. If your experiment is well controlled and has worked well, what we hope to see is that the greatest sources of variation in the data are the treatments/groups we are interested in. It is also an incredibly useful tool for quality control and checking for outliers. We can use the reduce_dimensions
function to calculate the dimensions.
# Get principal components counts_scal_PCA <- counts_scaled %>% reduce_dimensions(method="PCA")
This joins the result to the counts object.
# Take a look
counts_scal_PCA
For plotting, we can select just the sample-wise information with pivot_sample
.
# take a look counts_scal_PCA %>% pivot_sample()
We can now plot the reduced dimensions.
# PCA plot counts_scal_PCA %>% pivot_sample() %>% ggplot(aes(x=PC1, y=PC2, colour=dex, shape=cell)) + geom_point() + geom_text_repel(aes(label=sample), show.legend = FALSE) + custom_theme
The samples separate by treatment on PC1 which is what we hope to see. PC2 separates the N080611 cell line from the other samples, indicating a greater difference between that cell line and the others.
An alternative to principal component analysis for examining relationships between samples is using hierarchical clustering. Heatmaps are a nice visualisation to examine hierarchical clustering of your samples. tidybulk has a simple function we can use, keep_variable
, to extract the most variable genes which we can then plot with tidyHeatmap.
counts_scal_PCA %>% # extract 500 most variable genes keep_variable(.abundance = counts_scaled, top = 500) %>% # create heatmap heatmap( .column = sample, .row = feature, .value = counts_scaled, transform = log1p ) %>% add_tile(dex) %>% add_tile(cell)
In the heatmap we can see the samples cluster into two groups, treated and untreated, for three of the cell lines, and the cell line (N080611) again is further away from the others.
Tidybulk enables a simplified way of generating a clustered heatmap of variable genes. Compare the code below for tidybulk versus a base R method.
base R using edgeR
# Example code, no need to run library(edgeR) dgList <- SE2DGEList(airway) group <- factor(dgList$samples$dex) keep.exprs <- filterByExpr(dgList, group=group) dgList <- dgList[keep.exprs,, keep.lib.sizes=FALSE] dgList <- calcNormFactors(dgList) logcounts <- cpm(dgList, log=TRUE) var_genes <- apply(logcounts, 1, var) select_var <- names(sort(var_genes, decreasing=TRUE))[1:500] highly_variable_lcpm <- logcounts[select_var,] colours <- c("#440154FF", "#21908CFF", "#fefada" ) col.group <- c("red","grey")[group] gplots::heatmap.2(highly_variable_lcpm, col=colours, trace="none", ColSideColors=col.group, scale="row")
tidybulk integrates several popular methods for differential transcript abundance testing: the edgeR quasi-likelihood [@chen2016reads] (tidybulk default method), edgeR likelihood ratio [@mccarthy2012differential], limma-voom [@law2014voom] and DESeq2 [@love2014moderated]. A common question researchers have is which method to choose. With tidybulk we can easily run multiple methods and compare.
We give test_differential_abundance
our tidybulk counts object and a formula, specifying the column that contains our groups to be compared. If all our samples were from the same cell line, and there were no additional factors contributing variance such as batch differences, we could use the formula ~ dex
. However, each treated and untreated sample is from a different cell line so we add the cell line as an additional factor ~ dex + cell
.
de_all <- counts_scal_PCA %>% # edgeR QLT test_differential_abundance( ~ dex + cell, method = "edger_quasi_likelihood", prefix = "edgerQLT_" ) %>% # edgeR LRT test_differential_abundance( ~ dex + cell, method = "edger_likelihood_ratio", prefix = "edgerLR_" ) %>% # limma-voom test_differential_abundance( ~ dex + cell, method = "limma_voom", prefix = "voom_" ) %>% # DESeq2 test_differential_abundance( ~ dex + cell, method = "deseq2", prefix = "deseq2_" ) # take a look de_all
This outputs the columns from each method such as log-fold change (logFC), false-discovery rate (FDR) and probability value (p-value). logFC is log2(treated/untreated).
We can visually compare the significance for all methods. We will notice that there is some difference between the methods.
de_all %>% pivot_transcript() %>% select(edgerQLT_PValue, edgerLR_PValue, voom_P.Value, deseq2_pvalue, feature) %>% ggpairs(1:4)
Note: You may notice that the methods produce columns with different names for similar outputs. If you wish to make these consistent you can do that with tidyverse rename
. For example, to rename the p value adjusted columns you could run below.
de_all %>% rename(deseq2_FDR = deseq2_padj, voom_FDR = voom_adj.P.Val)
Or if we just wanted to run one method we could do that. The default method is edgeR quasi-likelihood.
counts_de <- counts_scal_PCA %>% test_differential_abundance(~ dex + cell)
Tidybulk enables a simplified way of performing an RNA sequencing differential expression analysis (with the benefit of smoothly integrating with ggplot2 and other tidyverse packages). Compare the code for a tidybulk edgeR analysis versus standard edgeR below.
standard edgeR
# Example code, no need to run library(edgeR) dgList <- SE2DGEList(airway) group <- factor(dgList$samples$dex) keep.exprs <- filterByExpr(dgList, group=group) dgList <- dgList[keep.exprs,, keep.lib.sizes=FALSE] dgList <- calcNormFactors(dgList) cell <- factor(dgList$samples$cell) design <- model.matrix(~ 0 + group + cell) dgList <- estimateDisp(dgList, design) fit <- glmQLFit(dgList, design) TvsU <- makeContrasts(TvsU=grouptrt-groupuntrt, levels=design) qlf <- glmQLFTest(fit, contrast=TvsU)
We'll extract the symbols for a few top genes (by P value) to use in some of the plots we will make.
topgenes_symbols <- counts_de %>% pivot_transcript() %>% arrange(PValue) %>% head(6) %>% pull(symbol)
Volcano plots are a useful genome-wide plot for checking that the analysis looks good. Volcano plots enable us to visualise the significance of change (p-value) versus the fold change (logFC). Highly significant genes are towards the top of the plot. We can also colour significant genes (e.g. genes with false-discovery rate < 0.05)
# volcano plot, minimal counts_de %>% ggplot(aes(x=logFC, y=PValue, colour=FDR < 0.05)) + geom_point() + scale_y_continuous(trans = "log10_reverse") + custom_theme
A more informative plot, integrating some of the packages in tidyverse.
counts_de %>% pivot_transcript() %>% # Subset data mutate(significant = FDR<0.05 & abs(logFC) >=2) %>% mutate(symbol = ifelse(symbol %in% topgenes_symbols, as.character(symbol), "")) %>% # Plot ggplot(aes(x = logFC, y = PValue, label=symbol)) + geom_point(aes(color = significant, size = significant, alpha=significant)) + geom_text_repel() + # Custom scales custom_theme + scale_y_continuous(trans = "log10_reverse") + scale_color_manual(values=c("black", "#e11f28")) + scale_size_discrete(range = c(0, 2))
Before following up on the differentially expressed genes with further lab work, it is also recommended to have a look at the expression levels of the individual samples for the genes of interest. We can use stripcharts to do this. These will help show if expression is consistent amongst replicates in the groups.
With stripcharts we can see if replicates tend to group together and how the expression compares to the other groups. We'll also add a box plot to show the distribution. Tidyverse faceting makes it easy to create a plot for each gene.
strip_chart <- counts_scaled %>% # extract counts for top differentially expressed genes filter(symbol %in% topgenes_symbols) %>% # make faceted stripchart ggplot(aes(x = dex, y = counts_scaled + 1, fill = dex, label = sample)) + geom_boxplot() + geom_jitter() + facet_wrap(~symbol) + scale_y_log10()+ custom_theme strip_chart
A really nice feature of using tidyverse and ggplot2 is that we can make interactive plots quite easily using the plotly package. This can be very useful for exploring what genes or samples are in the plots. We can make interactive plots directly from our ggplot2 object (strip_chart). Having label
in the aes
is useful to visualise the identifier of the data point (here the sample id) or other variables when we hover over the plot.
We can also specify which parameters from the aes
we want to show up when we hover over the plot with tooltip
.
strip_chart %>% ggplotly(tooltip = c("label", "y"))
Tidybulk provides a handy function called get_bibliography
that keeps track of the references for the methods used in your tidybulk workflow. The references are in BibTeX format and can be imported into your reference manager.
get_bibliography(counts_de)
If we are sequencing tissue samples, we may want to know what cell types are present and if there are differences in expression between them. tidybulk
has a deconvolve_cellularity
function that can help us do this.
For this example we will use a subset of the breast cancer dataset from The Cancer Genome Atlas (TCGA).
BRCA_tidy <- biocasia2020tidytranscriptomics::BRCA %>% tidybulk(patient, transcript, count) BRCA_tidy
With tidybulk, we can easily infer the proportions of cell types within a tissue using one of several published methods (Cibersort [@newman2015robust], EPIC [@racle2017simultaneous] and llsr [@abbas2009deconvolution]). Here we will use Cibersort which provides a default signature called LM22 to define the cell types. LM22 contains 547 genes that identify 22 human immune cell types.
deconvolve_cellularity
adds cell type proportions to the tibble as new columns. The prefix makes it easy to reshape the data frame if needed, for visualisation or further analyses. For example, we can plot the proportions of immune cell types for each patient.
BRCA_tidy %>% # Identifying cell types deconvolve_cellularity(action="get", cores=2) %>% # Reshaping pivot_longer( contains("cibersort"), names_prefix = "cibersort: ", names_to = "cell_type", values_to = "proportion" ) %>% # Plotting ggplot(aes(x=patient, y=proportion, fill=cell_type)) + geom_bar(stat = "identity") + custom_theme
tidybulk
allows streamlined multi-method analysestidybulk
allow easy analyses of cell type compositionSome things we don't have time to cover in Part 1 of this workshop can be found in the Supplementary material.
knitr::include_graphics("../inst/vignettes/single_cell_RNAseq_pipeline.png")
In Part 1 we showed how we can study the cell-type composition of a biological sample using bulk RNA sequencing. Single cell sequencing enables a more direct estimation of cell-type composition and gives greater resolution. For bulk RNA sequencing we need to infer the cell types using the abundance of transcripts in the whole sample, with single-cell RNA sequencing we can directly measure the transcripts in each cell and then classify the cells into cell types.
SingleCellExperiment is a very popular container for single cell RNA sequencing data [@sce].
tidySCE provides a bridge between the SingleCellExperiment single-cell package [@sce] and the tidyverse [@wickham2019welcome]. It enables the display of the SingleCellExperiment object as a tidyverse tibble, and provides SingleCellExperiment-compatible dplyr, tidyr, ggplot and plotly functions.
# load additional libraries library(biocasia2020tidytranscriptomics) library(dplyr) library(purrr) library(stringr) library(SummarizedExperiment) library(SingleCellExperiment) library(scater) library(scran) library(igraph) library(SingleR) library(tidySCE)
tidySCE
This is a SingleCellExperiment object but it is evaluated as tibble. So it is fully compatible both with SingleCellExperiment and tidyverse APIs.
pbmc_small_tidy <- biocasia2020tidytranscriptomics::pbmc_small %>% tidy()
It looks like a tibble
pbmc_small_tidy
But it is a SingleCellExperiment object after all
assayNames(pbmc_small_tidy) # from SummarizedExperiment
We can interact with our object as we do with any tibble. In this case we want to polish an annotation column.
pbmc_small_polished <- pbmc_small_tidy %>% # Clean groups mutate(groups = groups %>% str_remove("^g")) %>% # Extract sample extract(file, "sample", "../data/sample([a-z0-9]+)/outs.+") pbmc_small_polished
We can treat pbmc_small_polished
as a SingleCellExperiment object
counts <- assay(pbmc_small_polished, "counts") # from SummarizedExperiment libsizes <- colSums(counts) size.factors <- libsizes/mean(libsizes) logcounts(pbmc_small_polished) <- log2(t(t(counts)/size.factors) + 1) # from SingleCellExperiment assayNames(pbmc_small_polished) # from SummarizedExperiment
Here we plot abundance of two transcripts for each group.
pbmc_small_polished_abundance = pbmc_small_polished %>% # Extract abundance join_transcripts(transcripts=c("HLA-DRA", "TCL1A")) pbmc_small_polished_abundance
We proceed to plot. Note: we don't need to create many of these temporary variables. They are created here for educational purposes only.
pbmc_small_polished_abundance %>% ggplot2::ggplot(aes(groups, abundance_counts + 1, fill=groups)) + geom_boxplot(outlier.shape=NA) + geom_jitter(alpha=0.5, width=0.2) + scale_y_log10() + facet_wrap(~transcript, scales="free_y") + custom_theme
We proceed with data processing with Bioconductor packages, such as scran [@lun2016pooling] and scater [@mccarthy2017scater].
variable_genes <- pbmc_small_polished %>% modelGeneVar() %>% # from scran getTopHVGs(prop=0.1) # from scran # Perform PCA with scater pbmc_small_pca <- pbmc_small_polished %>% runPCA(subset_row=variable_genes) # from scater pbmc_small_pca
If a tidyverse-compatible package is not included in the tidySCE collection, we can use as_tibble
to permanently convert tidySCE
into a tibble.
# Create pairs plot with GGally pbmc_small_pca %>% as_tibble() %>% GGally::ggpairs(columns=4:8, ggplot2::aes(colour=groups)) + custom_theme
Beside PCA which is a linear dimensionality reduction, we can apply neighbour aware methods such as UMAP, to better define locally similar cells. We can calculate the first 3 UMAP dimensions using the scater framework.
pbmc_small_UMAP <- pbmc_small_pca %>% runUMAP(ncomponents=3) # from scater pbmc_small_UMAP
We can proceed with cluster identification with scran.
pbmc_small_cluster <- pbmc_small_UMAP # Assign clusters to the 'colLabels' # of the SingleCellExperiment object colLabels(pbmc_small_cluster) <- # from SingleCellExperiment pbmc_small_pca %>% buildSNNGraph(use.dimred="PCA") %>% # from scran - shared nearest neighbor cluster_walktrap() %$% # from igraph membership %>% as.factor() # Reorder columns pbmc_small_cluster %>% select(label, everything())
Now we can interrogate the object as if it was a regular tibble data frame.
pbmc_small_cluster %>% count(groups, label)
And we can plot them using 3D plot using plotly.
pbmc_small_cluster %>% plot_ly( x = ~UMAP1, y = ~UMAP2, z = ~UMAP3, color = ~ label, colors = friendly_cols[1:5] )
knitr::include_graphics("../inst/vignettes/plotly.png")
plotHeatmap( # from scater pbmc_small_cluster, features=variable_genes, columns=order(pbmc_small_cluster$label), colour_columns_by=c("label") )
pbmc_small_cluster %>% # Get transcript abundance join_transcripts(transcripts=variable_genes) %>% # Plot heatmap heatmap( # from tidyHeatmap .row = transcript, .column = cell, .value = abundance_logcounts ) %>% # Add annotation add_tile(label) %>% add_point(PC1)
We can infer cell type identities using SingleR [@aran2019reference] and manipulate the output using tidyverse. SingleR accepts any log-normalised transcript abundance matrix.
# DO NOT EXECUTE - CELLDEX DEPENDENCY NOT IN BIOCONDUCTOR YET # Get cell type reference data hpca <- HumanPrimaryCellAtlasData() # Infer cell identities cell_type_df <- # extracting counts from SingleCellExperiment object assays(pbmc_small_cluster)$logcounts%>% # SingleR SingleR( ref = hpca, labels = hpca$label.main, method = "cluster", clusters = pbmc_small_cluster %>% pull(label) ) %>% # Formatting results as.data.frame() %>% as_tibble(rownames = "label") %>% select(label, first.labels)
We have pre-calculated the cluster classification.
cell_type_df <- biocasia2020tidytranscriptomics::cell_type_df cell_type_df
# Join UMAP and cell type info pbmc_small_cell_type <- pbmc_small_cluster %>% left_join( cell_type_df, by = "label" ) pbmc_small_cell_type
We can easily summarise the results. For example, we can see how cell type classification overlaps with cluster classification.
pbmc_small_cell_type %>% count(label, first.labels)
A powerful tool we can use with tidySCE is nest
. We can easily perform independent analyses on subsets of the dataset. First we classify cell types in lymphoid and myeloid; then, nest based on the new classification
pbmc_small_nested <- pbmc_small_cell_type %>% filter(first.labels != "Platelets") %>% mutate(cell_class = if_else( `first.labels` %in% c("Macrophage", "Monocyte"), "myeloid", "lymphoid" ) ) %>% nest(data = -cell_class) pbmc_small_nested
Now we can independently perform analyses for the lymphoid and myeloid cell subsets.
pbmc_small_nested %>% mutate(variable_genes = map_chr( data, ~ .x %>% modelGeneVar() %>% getTopHVGs(prop=0.05) %>% paste(collapse=", ") ))
tidySCE
is an invisible layer that operates on a SingleCellExperiment
object and enables us to visualise and manipulate data as if it were a tidy data frame.tidySCE
object is a SingleCellExperiment object
so it can be used with any SingleCellExperiment
compatible method```{poll class.source="poll"} Poll: What would be your preferred name for the package currently called tidySCE?
# Contributing If you want to suggest improvements for this workshop or ask questions, you can do so as described [here](https://github.com/stemangiola/biocasia2020_tidytranscriptomics/blob/master/CONTRIBUTING.md). # Reproducibility Record package and version information with `sessionInfo` ```r sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.