knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
CellWalkR is an R package (described here) implementing the CellWalker method for combining scATAC-seq data with labels and other epigenetic data (see paper for algorithmic details). This vignette shows an example of running CellWalkR on a small set of scATAC-seq data to generate a cellWalk object which can then be used to assign labels to cells as well as cell-type specific labels to bulk data.
CellWalkR can work directly with scATAC-seq data pre-processed by many pipelines including CellRanger, ArchR, SnapATAC, and Cicero. We recommend using ArchR to pre-process data. ArchR processes BAM or Fragment files into project files that CellWalkR can use as input. Other common output types of of pre-processing pipelines such as Cell-by-Peak and Cell-by-Bin matrices can also be used as input.
Currently, CellWalkR must be installed useing devtools:
install.packages("devtools") devtools::install_github("PFPrzytycki/CellWalkR")
After CellWalkR is installed, load the package:
library(CellWalkR)
set.seed(1)
CellWalkR can work directly with scATAC-seq data pre-processed by many pipelines. For this vignette we provide sample data in the form of a cell-by-peak matrix (included) or an ArchR project (download here). This data is the subset of radial glia cells from Ziffra et al.
To work with scATAC-seq data in the form of a a cell-by-peak matrix, load the matrix and the corresponding peaks into a GRanges object as follows:
data("SampleCellWalkRData") ATACMat <- SampleCellWalkRData$ATACMat peaks <- SampleCellWalkRData$peaks #pathToMat <- system.file("extdata", "SamplePeakMat.mtx", package = "CellWalkR") #ATACMat <- Matrix::readMM(pathToMat) #pathToPeaks <- system.file("extdata", "SamplePeaks.txt", package = "CellWalkR") #peaks <- as(data.table::fread(pathToPeaks, header = FALSE)$V1, "GRanges")
To work with the ArchR project:
library('ArchR') ATACMat <- loadArchRProject(path = "Radial_Glia")
To work with SnapATAC, or Cicero data as input (not provided). See additional vignette for how to load these data types.
Now we can load our first set of labeling data. For this vignette we provide a set of labels for radial glia subtypes from Nowakowski et al.
pathToLabels <- system.file("extdata", "SampleMarkers1.txt", package = "CellWalkR") labelGenes <- data.table::fread(pathToLabels)
If no labeling data is available, findMarkers() can be run on a set of scRNA-seq data. This requires the Seurat package to be installed and a gene-by-cell matrix representing the scRNA-data.
labelGenes <- findMarkers(RNAMat, genes, barcodes)
The labeling data should consist of at least two columns, gene names (or other identifiers that match regions) and associated labels, with an optional third column for log-fold change in expression of that gene for that label.
head(labelGenes)
Next, we compute cell-to-cell similarity in order to build edges in the cell-to-cell portion of the graph.
cellEdges <- computeCellSim(ATACMat)
cellEdges is a cell-by-cell matrix of cell-to-cell similarity. Any matrix of cell similarity can be used.
cellEdges[1:5,1:5]
In order to generate label-to-cell edges, we need to define which genomic regions correspond to which genes. These could be promoters, gene bodies, or any other definition. If a specific set of regions associated with genes isn't already known, the getRegions() function can be used to generate a mapping. This function can retrieve hg38 and mm10 mappings. Other mappings can be generated using TxDb objects or biomaRt.
We'll use hg38 with Entrez identifiers in this example, using full gene bodies.
regions <- getRegions(geneBody = TRUE, genome = "hg38", names = "Entrez")
regions is a GRanges object with a gene_id field. This gene_id field needs to match the genes (or other identifiers) in the labeling data.
head(regions)
We then need to map between this data and the peaks in the scATAC-seq data.
ATACGenePeak <- mapPeaksToGenes(labelGenes, ATACMat, peaks, regions)
With this mapping we can compute label-to-cell edges, a matrix where the number of columns is the number of labels and the number of rows is the number of cells.
labelEdges <- computeLabelEdges(labelGenes, ATACMat, ATACGenePeak) head(labelEdges)
Although we now have cell-to-cell edges and label-to-cell edges, we don't know how to correctly weight the two relative to each other. The tuneEdgeWeights method will run CellWalker across a range of possible parameters and compute cell homogeneity for each. We make a list of labelEdges because there can be many of them, and sample down to 1000 cells for faster computation. Edge weights can optionally be tuned in parallel.
labelEdgesList <- list(labelEdges) edgeWeights <- tuneEdgeWeights(cellEdges, labelEdgesList, labelEdgeOpts = 10^seq(1,7,1), sampleDepth = 1000)
We can see which parameter had the highest cell homogeneity:
head(edgeWeights[order(edgeWeights$cellHomogeneity, decreasing = TRUE),])
We generate a cellWalk object with the above tuned edge weight parameter. This object stores the final influence matrix and can be used for downstream analysis.
cellWalk <- walkCells(cellEdges, labelEdgesList, labelEdgeWeights = 1e+07)
We may have some bulk epigenetic data that can help filter down which peaks are relevant to our analysis. We can tune weights on each filter to determine how significant it is to our data. For our example we have H3K4me3 data which indicates active promoters. Thus we apply this filter permissively (setting filterOut=FALSE) and at the whole gene level (filterGene=TRUE) rather than just to overlapping peaks.
pathToFilter <- system.file("extdata", "SampleFilter.bed", package = "CellWalkR") filter <- data.table::fread(pathToFilter) filter <- GRanges(filter$V1, IRanges(filter$V2, filter$V3))
filters <- list(filter) labelGenesList <- list(labelGenes) filterWeights <- tuneFilterWeights(cellEdges, labelGenesList, labelEdgesList, labelEdgeWeights = 1e+07, ATACMat, ATACGenePeak, filters = filters, filterOut = c(FALSE), filterGene = c(TRUE), regions=regions, sampleDepth = 1000) filterWeights
We see that adding this filter improves performance. We can make a new cellWalk object using this filter:
labelEdges <- computeLabelEdges(labelGenes, ATACMat, ATACGenePeak, filters = filters, filterWeights = c(1), filterOut = c(FALSE), filterGene = c(TRUE), regions = regions) labelEdgesList <- list(labelEdges) cellWalk <- walkCells(cellEdges, labelEdgesList, labelEdgeWeights = 1e+07)
Any number of filters can be applied this way. The filters parameters takes a list of GRanges describing the regions each filter applies to. Each GRange in the list must have an associated weight, permissiveness, and gene parameter.
Once we have created a cellWalk object, we can use it for downstream analysis.
Most directly, we can look at what labels are the most strongly linked to each cell. This is based on the maximum amount of label-to-cell influence in the cell walk. Cell labeling can be used for numerous further downstream analyses such as cell-type specific peak calling.
head(cellWalk$cellLabels)
However, in actuality, labels are "fuzzy" meaning each cell actually has a distribution of scores from each label. Thanks to this, we can examine how often labels are confused for each other. The label threshold determines the minimum influence score a cell must get to be considered labeled. Plotting requires the packages ggplot2 and reshape2.
install.packages("ggplot2") install.packages("reshape2")
cellWalk <- findUncertainLabels(cellWalk, threshold = .5, plot = TRUE)
We can also directly examine label similarity by considering label-to-label influence.
cellWalk <- clusterLabels(cellWalk, plot = TRUE)
Two-dimensional embeddings of cells can be a very helpful tool for understanding cell diversity. We can directly embed cell-to-cell influence scores. Importantly, this portion of the influence matrix is not used in establishing the labeling of cells. Thus this can serve as distinct way to explore how labels were distributed across cells. Because of this, unlike with most scATAC-seq analysis pipelines, clusters observed in this embedding may not directly correspond to labels. This can help understand cell diversity, as well as assist in identifying rare cell types. In addition to ggplot2, embedding also requires the package Rtsne. Cells can alternatively be embedding using a UMAP which requires the package uwot.
install.packages("Rtsne")
cellWalk <- plotCells(cellWalk, labelThreshold = 0, perplexity=3, seed = 1)
It is also possible to plot how strongly a single label influences each cell in the embedding.
cellWalk <- plotCells(cellWalk, cellTypes = c("RG-div2"), seed = 1)
Furthermore, to analyze rare cell types, it can be helpful to only plot a subset of of all labels.
cellWalk <- plotCells(cellWalk, cellTypes = c("RG-early","tRG","vRG"), labelThreshold = 0, seed = 1)
An alternative to two-dimensional embeddings of cells is to generate a Minimum Spanning Tree of the cell-to-cell influence scores. This way of plotting the underlying graph can help emphasize distinct cell types.
install.packages("igraph")
cellWalk <- computeMST(cellWalk, labelThreshold = 0, seed = 1)
A very powerful use for the cell walk is mapping data to labels via cell-to-label influence. For example, we can map enhancers to cell types. For this example we will map a set of 1,000 brain enhancers from the Vista Enhancer Browser.
pathToEnhancers <- system.file("extdata", "sampleEnhancers.bed", package = "CellWalkR") sampleEnhancers <- data.table::fread(pathToEnhancers) sampleEnhancers <- GRanges(sampleEnhancers$V1, IRanges(sampleEnhancers$V2, sampleEnhancers$V3))
labelScores <- labelBulk(cellWalk, sampleEnhancers[1:1000], ATACMat, peaks, allScores = TRUE) mappedLabel <- selectLabels(labelScores, z = 1.5) table(unlist(mappedLabel))
It can help to visualize how these enhancers map to the full hierarchy of cell types. We can see the total counts of enhancers:
install.packages("dendextend") install.packages("circlize")
p <- plotMultiLevelLabels(cellWalk, labelScores, z = 1.5)
Or look where a single enhancer is enriched or depleted:
p <- plotMultiLevelLabels(cellWalk, labelScores, z = 1.5, whichBulk = 45)
Once a cellWalk object has been created, downstream analysis can be performed using an interactive interface. Several additional R packages are required for the interface to function:
install.packages("shiny") install.packages("plotly") install.packages("ggplot2") install.packages("reshape2")
To be able to compute label scores for bulk data, the cellWalk object will need to raw ATAC data to be associated with it:
cellWalk <- storeMat(cellWalk, ATACMat, peaks)
Alternatively, if label scores for bulk data have already been computed, they can be added to the cellWalk object before launching the visualization:
cellWalk <- storeBulk(cellWalk, bulkPeaks=sampleEnhancers[1:1000], labelScores)
An interface can be launched as follows:
launchViz(cellWalk)
CellWalkR can be run on an arbitrary number of sets of labels and filters, each with it's own weight. Filters can selectively be applied to some sets of labels and not others. Here for example, we will add a second set of labels to which the above filter does not apply. These labels are for radial glia derived from the ganglionic eminence rather than the cortex.
pathToLabelsB <- system.file("extdata", "SampleMarkers2.txt", package = "CellWalkR") labelGenesB <- data.table::fread(pathToLabelsB)
ATACGenePeakB <- mapPeaksToGenes(labelGenesB, ATACMat, peaks, regions) labelEdgesB <- computeLabelEdges(labelGenesB, ATACMat, ATACGenePeakB)
Now simply tune edge weights as before with a list of all label edges.
labelEdgesListB <- list(labelEdges, labelEdgesB) edgeWeightsB <- tuneEdgeWeights(cellEdges, labelEdgesListB, labelEdgeOpts = 10^seq(4,7,1), sampleDepth = 1000) head(edgeWeightsB[order(edgeWeightsB$cellHomogeneity, decreasing = TRUE),])
We can then compute a new cell walk using the list of edges and a vector of optimal weights.
cellWalkB <- walkCells(cellEdges, labelEdgesListB, labelEdgeWeights = c(1e+04, 1e+06))
We can take a look at how many cells were assigned the old and new labels:
table(cellWalk$cellLabels, cellWalkB$cellLabels)
Like before, we can analyze the combined label set with a confusion matrix and with hierarchical clustering:
cellWalkB <- findUncertainLabels(cellWalkB, threshold = .5, plot = TRUE) cellWalkB <- clusterLabels(cellWalkB, plot = TRUE)
It is possible to detect potential doublets using CellWalkR by identifing cells that have high scores from two dissimilar cell types.
cellWalk <- scoreDoublets(cellWalk, plot=TRUE)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.