knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
"APL" is a package developed for computation of Association Plots, a method for visualization and analysis of single cell transcriptomics data. The main focus of "APL" is the identification of genes characteristic for individual clusters of cells from input data.
When working with r Rpackage("APL")
package please cite:
Association Plots: Visualizing associations in high-dimensional correspondence analysis biplots Elzbieta Gralinska, Martin Vingron bioRxiv 2020.10.23.352096; doi: https://doi.org/10.1101/2020.10.23.352096
For a mathematical description of the method, please refer to the manuscript.
The "APL" package requires R version >= 4.0.
In order to decrease the computation time of the singular value decomposition (SVD), we highly recommend the installation of pytorch
. More information on pytorch
installation are given below.
Instead, users can also opt to use the R native SVD. For this, please turn the argument python = FALSE
wherever applicable in this vignette.
library(reticulate) install_miniconda() conda_install(envname = "r-reticulate", packages = "numpy") conda_install(envname = "r-reticulate", packages = "pytorch")
To install pytorch
please download the appropriate Miniconda installer for your system from the conda website.
Follow the installation instructions on their website and make sure the R package reticulate
is also installed before proceeding.
Once installed, list all available conda environments via
conda info --envs
One of the environments should have r-reticulate
in its name. Depending on where
you installed it and your system, the exact path might be different.
Activate the environment and install pytorch
into it.
```{bash conda, eval=FALSE} conda activate ~/.local/share/r-miniconda/envs/r-reticulate # change path accordingly. conda install numpy conda install pytorch
# Preprocessing ## Setup In this vignette we will use the 3k Peripheral Blood Mononuclear Cell (PBMC) data from 10x Genomics as an example. The data necessary to follow the vignette can be downloaded from [here](https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz). Besides the package `r Rpackage("APL")` we will use the single-cell RNA-seq analysis suite `r CRANpkg("Seurat")` (V. 4.0.4) to preprocess the data, but the preprocessing could equally be done with `r Biocpkg("SingleCellExperiment")` and `r Biocpkg("scater")`\\`r Biocpkg("scran")`. For the preprocessing we follow the [Guided Clustering Tutorial](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html) from the Seurat vignette. For details about the data preprocessing please refer to their website. ```r library(APL) library(Seurat) library(ggplot2) set.seed(1234)
We start with the loading and preprocessing of the 3k PBMC data:
CHANGE FILE PATH OR FIND OTHER SOLUTION
# Load the PBMC dataset pbmc.data <- Read10X(data.dir = "/project/markers/CA/manuscript_scdata/sc_data_PBMC3k/original_data/pbmc3k_filtered_gene_bc_matrices/filtered_gene_bc_matrices/hg19/") # Remove genes not expressed in any gene no_zeros_rows <- rowSums(pbmc.data) > 0 pbmc.data <- pbmc.data[no_zeros_rows,] # Initialize the Seurat object with the raw (non-normalized data) pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200) pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") # Filter data pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
Association Plots from "APL" should be computed based on the normalized expression data. Therefore, we first normalize the counts from the 3k PBMC data. For now, "APL" requires the data to be clustered beforehand. That's why we also do the clustering.
# Normalization pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000, verbose = FALSE) pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000, verbose = FALSE) # Scaling all.genes <- rownames(pbmc) pbmc <- ScaleData(pbmc, features = all.genes, verbose = FALSE) # Run PCA pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc), verbose = FALSE) # Cell clustering pbmc <- FindNeighbors(pbmc, dims = 1:10, verbose = FALSE) pbmc <- FindClusters(pbmc, resolution = 0.5, verbose = FALSE) pbmc <- RunUMAP(pbmc, dims = 1:10, verbose = FALSE) new.cluster.ids <- c("0 - Naive CD4 T", "1 - CD14+ Mono", "2 - Memory CD4 T", "3 - B", "4 - CD8 T", "5 - FCGR3A+ Mono", "6 - NK", "7 - DC", "8 - Platelet") names(new.cluster.ids) <- levels(pbmc) pbmc <- RenameIdents(pbmc, new.cluster.ids) pbmc$cell_type <- Idents(pbmc) DimPlot(pbmc, reduction = "umap", label = F, pt.size = 0.5)
The fastest way to compute the Association Plot for a selected cluster of cells from the input data is by using a wrapper function runAPL()
. runAPL()
automates most of the analysis steps for ease of use.
For example, to generate the Association Plot for the B cells (cluster: "3 - B") we can use the following command:
runAPL(pbmc, assay = "RNA", slot = "data", group = which(pbmc$seurat_clusters == 3), score = TRUE, dims = 220, show_cols = FALSE)
The generated Association Plot is computed based on the log-normalized count matrix using 200 CA dimensions (parameter dims
). The cluster-specificity score ($S_\alpha$) for each gene is also calculated (score = TRUE
). More information on the choice of dimension number and the score $S_\alpha$ is present in the next section of the vignette.
Alternatively, Association Plots can be computed step-by-step. This allows to adjust the Association Plots to user's needs. Below we explain each step of the process of generating Association Plots.
The first step of Association Plot computations is correspondence analysis (CA). CA is a data dimensionality reduction similar to PCA, however it allows for a simultaneous embedding of both cells and genes from the input data in the same space. In this example we perform CA on the log-normalized count matrix from the 3k PBMC data.
# Computing CA on logcounts logcounts <- as.matrix(GetAssayData(pbmc, slot = "data")) ca <- cacomp(obj = logcounts, python = TRUE) # Above is equivalent to: ca <- cacomp(obj = pbmc, assay = "RNA", slot = "data", python = TRUE)
The function cacomp
accepts as an input any matrix with non-negative entries, be it a single-cell RNA-seq, bulk RNA-seq or other data. For ease of use, cacomp
accepts also r CRANpkg("Seurat")
and r Biocpkg("SingleCellExperiment")
objects, however for these we additionally have to specify via the assay
and/or slot
(for Seurat) parameter from where to extract the data.
When performing a feature selection before CA, we can set the argument top
to the desired number of genes with the higest variance across cells from the input data to retain for further analysis. By default, 5,000 genes are kept.
The output of cacomp
is an object of class cacomp
:
ca
As can be seen in the summarized output, by default both types of coordinates in the CA space (principal and standardized) are calculated.
Once the coordinates for the Association Plot are calculate, they will also be shown in the output of cacomp
. Slots are accessed similarly to other S4 classes:
ca@std_coords_cols[1:5,1:5]
In the case of r CRANpkg("Seurat")
and r Biocpkg("SingleCellExperiment")
objects, we can alternatively set return_input = TRUE
to get the input object back, with the CA results computed by "APL" and stored in the appropriate slot for dimension reduction. This also allows for using the plotting functions that come with these packages:
pbmc <- cacomp(obj = pbmc, assay = "RNA", slot = "data", return_input = TRUE, python = TRUE) DimPlot(pbmc, reduction = "CA", label = FALSE, pt.size = 0.5)
When working with high-dimensional data, after singular value decomposition there will be often many dimensions which will be representing the noise in the data. That's why generating Association Plots should be preceded by data dimension reduction step.
The number of dimensions to retain can be computed using the function pick_dims
. This function offers three standard methods which we implemented:
elbow rule (method = "elbow_rule"
) - the number of dimensions to retain is calculated based on scree plots generated for randomized data and corresponds to a point in the plot where the band of randomized singular values enters the band of the original singular values,
80% rule (method = "maj_inertia"
) - only those first dimensions are retained which in total account for >= 80% of total inertia,
* average rule (method = "avg_inertia"
) - only those dimensions are retained which account for more inertia than a single dimension on average.
Additionally, the user can compute a scree plot to choose the number of dimensions by themselves:
pick_dims(ca, method = "scree_plot") + xlim(c(0,75))
In the scree plot above we can see that the first dimension explains only ~0.75% of the total inertia and we observe the "jump" in the scree plot at roughly 10 dimensions. The first dimensions however explain only a small part of the total inertia.
Here we compute the number of dimensions using elbow rule. For demonstration, only five data permutations are computed:
pd <- pick_dims(ca, mat = GetAssayData(pbmc, slot = "data"), method = "elbow_rule", reps = 5, python = TRUE)
pd
In this case the elbow rule leads to a much higher number of dimensions.
# Compute the amount of inertia explained by each of the dimensions expl_inertia <- (ca@D^2/sum(ca@D^2))*100 # Compute the amount of intertia explained by number of dimensions defined by elbow rule sum(expl_inertia[seq_len(pd)])
In this example the elbow rule suggests to keep r pd
dimensions that explain r round(sum(expl_inertia[seq_len(pd)]),2)
% of the total inertia from the data.
Finally, we can reduce the data dimension to the desired number of dimensions:
ca <- subset_dims(ca, dims = pd)
When working with single-cell transcriptomics data we are often interested in which genes are associated to a cluster of cells. To reveal such genes we can compute an Association Plot for a selected cluster of cells. In the following example we want to generate an Association Plot for the cluster of platelets:
# Calculate Association Plot coordinates for platelets from 3k PBMC data ca <- apl_coords(ca, group = platelets)
After computing the coordinates of genes and cells in the Association Plot we are able to plot the results using the apl
function.
# Plot APL apl(ca, row_labs = TRUE, rows_idx = c("ITGA2B", "PF4", "GP1BA", "TUBB1"), #platelet marker genes type = "ggplot")
In the Association Plot all genes are represented by blue circles. The further to the right a gene is located the more associated it is with the chosen cluster of cells. Additionally, the lower the y-axis value, the more specific it is for the selected cluster. Additionally, it is possible to highlight in the Association Plot any set of genes. In the example above we additionally highlighted four genes ("ITGA2B", "PF4", "GP1BA", "TUBB1") which are known to be marker genes for platelets. As we can see in the plot, they are located in the right part of the plot, which confirms their specificity for platelets.
By default we plot only the genes in the Association Plot. To also display the cells in the Association Plot, use the argument show_cols = TRUE
. This way we can identify other cells which show similar expression profiles to the cells of interest. Cells that belong to the cluster of interest will be colored in red, and all remaining cells will be colored in green.
Computing the $S_\alpha$ scores, a measure of cluster-specificity of a gene, allows us to rank genes by their specificity for a selected cluster. The $S_\alpha$ scores can be computed using the apl_score
function. To show the $S_\alpha$ scores in the Association Plot use the argument show_score = TRUE
in the apl
function:
# Compute Salpha score # For the calculation the input matrix is also required. ca <- apl_score(ca, mat = as.matrix(GetAssayData(object = pbmc, slot = "data")), reps = 5, python = TRUE)
apl(ca, show_score = TRUE, show_cols = FALSE, type = "plotly")
By default, only genes that have a score larger than 0 are colored as these tend to be genes of interest.
However, for some datasets this cutoff is not suitable and can easily be changed through the score_cutoff
argument to apl()
.
The $S_\alpha$ scores are stored in ca@APL_score
:
head(ca@APL_score)
To see the expression of genes with the highest $S_\alpha$ scores (or any selected genes) across all cell types from the data we can use the functions provided by r CRANpkg("Seurat")
:
VlnPlot(pbmc, features = head(ca@APL_score$Rowname,3)) FeaturePlot(pbmc, features = head(ca@APL_score$Rowname,3))
As expected, the 3 most highly scored genes are over-expressed in the platelet cluster.
In addition to Association Plots "APL" produces also other forms of the output. For instance, we can use "APL" to generate a two- and three-dimensional correspondence analysis projection of the data. The so-called biplot visualizes then both cells and genes from the input data. To generate such biplots a cacomp
object is required. We can convert r CRANpkg("Seurat")
or r Biocpkg("SingleCellExperiment")
objects, which have the CA results stored, to a cacomp
object using the function as.cacomp()
:
# Converting the object pbmc ca <- as.cacomp(pbmc) # Specifying a cell cluster of interest platelets <- which(pbmc$cell_type == "8 - Platelet") # Creating a static plot # ca_biplot(ca, type = "ggplot", col_labels = platelets) # Creating an interactive plot ca_biplot(ca, col_labels = platelets)
A three-dimensional data projection plot can be generated using the function ca_3Dplot
:
ca_3Dplot(ca, col_labels = platelets)
The above described plots let us interactively explore the CA projection of the data. For a static version of the plot the paramter type = "ggplot"
should be used. This option is available for all plotting functions with interactive plots (except for the three-dimensional data projection).
After computing an Association Plot and identifying a set of genes specific for a selected cluster of cells we might be interested in conducting a Gene Ontology (GO) enrichment analysis of the identified gene set. The GO enrichment analysis of the B-cell specific genes (cluster 3 from the input data) idenitfied using Association Plot we first need to compute the coordinates of the genes in the Association Plot for B cells, as way as the $S_\alpha$ score for each gene:
# Get indices of cells in cluster 3 (B cells) c_three <- which(pbmc$seurat_clusters == 3) # Calculate Association Plot coordinates of the genes and the Salpha scores ca <- apl_coords(ca, group = c_three) ca <- apl_score(ca, mat = as.matrix(GetAssayData(object = pbmc, slot = "data")), reps = 5, python = TRUE)
Now we can conduct GO enrichment analysis as implemented in the package r Biocpkg("topGO")
using the most cluster-specific genes from the Association Plot. By default we use all genes with the positive $S_alpha$ score, or 1,000 genes with the highest $S_\alpha$ score if there are more than 1,000 genes with the $S_\alpha$ above 0.
enr <- apl_topGO(ca, ontology = "BP", organism = "hs", score_cutoff = 1)
plot_enrichment(enr)
Additionally, in the Association Plot we can highlight genes of interest identified for instance using the GO enrichment analysis. In the example below we highlight two B canonical B cell markers.
bcell_markers <- c("CD19", "MS4A1") apl(ca, show_score = TRUE, col_labs = FALSE, row_labs = TRUE, show_cols = FALSE, score_cutoff = 1, rows_idx = bcell_markers, type = "ggplot")
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.