require(BiocStyle)
The main purpose of the LandSCENT
package is to provide a means of estimating the differentiation potency of single cells without the need to assume prior biological knowledge (e.g. marker expression or timepoint). As such, it may provide a more unbiased means for assessing potency or pseudotime.
The package features:
SingleCellExperiment
class and CellDataSet
class for interoperability with a wide range of other Bioconductor packages, like r Biocpkg("scater")
and r Biocpkg("monocle")
;This document gives a detailed tutorial of the LandSCENT
package from data normalization to result visualization. LandSCENT
package requires two main sources of input:
How to prepare these inputs for using in LandSCENT
will be described in detail in the following sections.
LandSCENT
requires as input a user defined functional gene network, for instance, a protein-protein interaction(PPI) network including the main interactions that take place in a cell. Although these networks are mere caricatures of the underlying signaling networks, ignoring time, spatial and biological contexts, one of the discoveries made recently is that cell potency appears to be encoded by a subtle positive correlation between transcriptome and connectome, with hubs in these networks generally exhibit higher expression in more potent cells. For details we refer the reader to our publications given at the end of this vignette[@SCENT1] and [@SCENT4].
In this vignette we will use a previously defined PPI network to calculate all the example results. The specific PPI network we use here is derived from Pathway Commons, which is an integrated resource collating together PPIs from several distinct sources. In particular, the network is constructed by integrating the following sources: the Human Protein Reference Database (HPRD), the National Cancer Institute Nature Pathway Interaction Database (NCI-PID), the Interactome (Intact) and the Molecular Interaction Database (MINT).
We have stored the networks in the package. There are two versions of protein-protein interaction (PPI) network under filenames "net17Jan2016.m.RData" and "net13Jun12.m.RData"(early version). You can access these with the data
function. Here we use the early version network:
library(LandSCENT) data(net13Jun12.m)
Importantly, the nodes (genes) in this network are labeled with Entrez gene IDs, and entries take on values "0" and "1", with "0" indicating that there is no interaction or connection between the two genes, and ”1” indicating there an interaction has been reported. It is also important to note that the diagonal entries are set to "0".
We assume that you have a matrix/object containing expression count data summarised at the level of genes. You then need to do quality control and normalization on the data.
If you have a normalized data matrix already, you can directly go to the 4th section of this vignette.
Moreover, we provide input arguements for SingleCellExperiment
(scater) and CellDataSet
(monocle) class in DoIntegPPI
function. If you have objects of such two classes, you could also directly go to the 4th section.
Here we use a scRNA-Seq dataset from [@ChuData], generated with the Fluidigm C1 platform, as an example. Due to the package size restriction, we cannot store the whole data matrix in the package. However, the example is a sub-dataset of 100 cells with two cell types: pluripotent human embryonic stem cells (hESC) and non-pluripotent progenitors of endothelial cells (EC). You can access the raw data and phenotype information using data
function.
data(rawExample.m) data(phenoExample.v)
The full dataset can be downloaded from the GEO website under accession number GSE75748, and the specific file to download is one of the supplementary file under filename "GSE75748_sc_cell_type_ec.csv.gz". You can download the dataset with getGEOSuppFiles
function in r Biocpkg("GEOquery")
package and extract the raw expression matrix from the supplementary files.
require(GEOquery) require(Biobase) GSE75748 <- getGEOSuppFiles("GSE75748") gunzip(rownames(GSE75748)[3]) rawdata.m <- as.matrix(read.csv("GSE75748/GSE75748_sc_cell_type_ec.csv", row.names = 1))
We also provide the phenotype information of the full dataset in the package. It can be easily loaded into your session using data
function.
data(phenoscChu.v)
Here we use r Biocpkg("scater")
package to do quality control and normalizaion on the raw data. Since LandSCENT
is rather robust to different normalization methods, you can always choose more suitable workflow for your own dataset, just guaranteeing the normalized data meet the specific requirements of LandSCENT
package.
First, we create a SingleCellExperiment
object containing the rawdata. Rows of the object correspond to genes, while columns correspond to single cells.
require(scater) example.sce <- SingleCellExperiment(assay = list(counts = rawExample.m))
Then, we detect low-quality cells based on library size and number of expressed genes in each library. We also select cells with low proportion of mitochondrial genes and spike-in RNA. Using the isOutlier
function in r Biocpkg("scater")
package with default arrguments, we remove 4 cells and most of them (3 cells) are filtered out because of high spike-in gene content.
### Detect mitochondrial gene and spike-in RNA is.mito <- grepl("^MT", rownames(example.sce)) is.spike <- grepl("^ERCC", rownames(example.sce)) counts(example.sce) <- as(counts(example.sce), "dgCMatrix") example.QC <- perCellQCMetrics(example.sce, subsets=list(Spike=is.spike, Mito=is.mito), flatten=FALSE) ### Cell Filtering wih isOutlier function libsize.drop <- isOutlier(example.QC$total, nmads=5, type="lower", log=TRUE); mito.drop <- isOutlier(example.QC$subsets$Mito$sum, nmads=5, type="higher"); spike.drop <- isOutlier(example.QC$subsets$Spike$sum, nmads=10, type="higher"); filter_example.sce <- example.sce[, !(libsize.drop | mito.drop | spike.drop)] phenoExample.v <- phenoExample.v[!(libsize.drop | mito.drop | spike.drop)] data.frame(ByLibSize=sum(libsize.drop), ByMito=sum(mito.drop), BySpike=sum(spike.drop), Remaining=ncol(filter_example.sce)) example.sce <- filter_example.sce
After quality control, we then move to the normalization part. r Biocpkg("scater")
package defines the size factors from the scaled library sizes of all cells.
sizeFactors(example.sce) <- librarySizeFactors(example.sce)
Scaling normalization is then used to remove cell-specific biases, e.g. coverage or capture efficiency. Log-transformed normalized expression values can be simply computed with normalize
function.
Importantly, we need to add an offset value of 1.1 before log-transformation. The offset is added in order to ensure that the minimum value after log-transformation would not be 0, but a nonzero value (typically log2(1.1)~0.14). We do not want zeroes in our final matrix since the computation of signaling entropy rate involves ratios of gene expression values and zeros in the denominator are undefined.
example.sce <- logNormCounts(example.sce, pseudo_count = 1.1) example.m <- as.matrix(assay(example.sce, i = "logcounts")) min(example.m)
Because LandSCENT
will integrate the network with scRNA-Seq data, the row names and column names of the network must use the same gene identifier as used in the scRNA-Seq data.
In our example, rownames of the data matrix are all human gene symbols. So we need to use mapIds
function in r Biocpkg("AnnotationDbi")
package along with r Biocpkg("org.Hs.eg.db")
package to get corresponding human Entrez gene ID.
require(AnnotationDbi) require(org.Hs.eg.db) anno.v <- mapIds(org.Hs.eg.db, keys = rownames(example.m), keytype = "SYMBOL", column = "ENTREZID", multiVals = "first") unique_anno.v <- unique(anno.v) example_New.m <- matrix(0, nrow = length(unique_anno.v), ncol = dim(example.m)[2]) for (i in seq_len(length(unique_anno.v))) { tmp <- example.m[which(anno.v == unique_anno.v[i]) ,] if (!is.null(dim(tmp))) { tmp <- colSums(tmp) / dim(tmp)[1] } example_New.m[i ,] <- example_New.m[i ,] + tmp } rownames(example_New.m) <- unique_anno.v colnames(example_New.m) <- colnames(example.m) example_New.m <- example_New.m[-which(rownames(example_New.m) %in% NA) ,] Example.m <- example_New.m
Now the scRNA-seq data are ready for the calculation of signaling entropy rate.
LandSCENT
packageBefore start introducing functionalities of LandSCENT
package, we give several points for you to check, in case you skipped the one or two sections above:
And more notes for users with SingleCellExperiment
and CellDataSet
objects:
LandSCENT
accepts these two kinds of objects, and will do normalization inside the function process. So there is no need for you to normalize the data before implement LandSCENT
. But it still requires the scRNA-seq data using the same gene identifier with network matrix.The estimation of differentiation potency with LandSCENT
consists of two major steps:
A typical workflow starts from integration of the scRNA-Seq data with the user-defined gene functional network. Here you can simply apply DoIntegPPI
function on Example.m dataset and PPI network net13Jun12.m:
data(Example.m) Example.m <- Example.m[, !(libsize.drop | mito.drop | spike.drop)]
Integration.l <- DoIntegPPI(exp.m = Example.m, ppiA.m = net13Jun12.m) str(Integration.l)
DoIntegPPI
function takes these two arguments as input. The function finds the overlap between the gene identifiers labeling the network and those labeling the rows of the scRNA-seq matrix, and then extracts the maximally connected subnetwork, specified by the adjMC output argument. Also, the function constructs the reduced scRNA-Seq matrix, specified by the expMC output argument.
With the output object Integration.l, we can now proceed to compute the SR value for any given cell, using the CompSRana
function. It takes five objects as input:
DoIntegPPI
function;parallel
package in this function with a defult mc.cores value of 1.SR.o <- CompSRana(Integration.l, local = TRUE, mc.cores = 40)
Here we run the CompSRana
function with 40 cores, which means the computer should have at least 40 processing cores. In your case, you may set mc.cores value based on your own computer/server.
The output argument SR.o is a list that added four elements onto the input Integration.l:
More for users with SingleCellExperiment
and CellDataSet
classes: the SR values will also be added as a new phenotype information onto the original sce and cds objects with name SR
.
One note with the above step: the local gene-based entropies can be used in downstream analyses for ranking genes according to differential entropy, but only if appropriately normalized. For instance, they could be used to identify the main genes driving changes in the global signaling entropy rate of the network. However, if the user only wishes to estimate potency, specifying local=FALSE is fine, which will save some RAM on the output object, which is why we make it the default option.
As we mentioned above, we provide the example's phenotype infromation in the package, stored in the phenoExample.v vector. With SR values and the phenotype information, you can then check that the SR values do indeed correlate with potency:
data(SR.v) Integration.l$SR <- SR.v SR.o <- Integration.l
boxplot(SR.o$SR ~ phenoExample.v, main = "SR values against cell types", xlab = "Cell Types", ylab = "SR values")
Here hESC and EC refer to human embryonic stem cells and progenitors of endothelial cells, respectively.
Having estimated the cell potency values, we can then infer cell potency states with InferPotency
function. This function infers discrete potency states of single cells and its distribution across the single cell populations.
If you have no phenotype information, you can simply implement this method with above aforementioned object SR.o:
InferPotency.o <- InferPotency(SR.o)
Then the infered distinct potency states for every cell will be stored in the element named potencyState
.
InferPotency.o$potencyState
But if phenotype information is provided, InferPotency
will return the distribution of potency states in relation to the phenotype classes provided (e.g. cell-types):
InferPotency.o <- InferPotency(SR.o, pheno.v = phenoExample.v)
InferPotency.o$distPSPH
This result indicates that InferPotency
function inferred 2 potency states, with the first potency state being occupied only by human embryonic stem cells(hESC), while the second potency state is enriched for mesoderm progenitor cells(EC).
Such results will also be stored for SingleCellExperiment
and CellDataSet
objects.
Next step, we may want to explore the heterogeneity in potency within the cell population and infer lineage relationships. One way to approach this question is to use the InferLandmark
function.
This function identifies potency-coexpression clusters of single cells called landmarks, and finally infers the dependencies of these landmarks which can aid in recontructing lineage trajectories.
With aforementioned result InferPotency.o, one can easily implement this function:
InferLandmark.o <- InferLandmark(InferPotency.o, pheno.v = phenoExample.v, reduceMethod = "PCA", clusterMethod = "PAM", k_pam = 2)
InferLandmark
will do dimension reduction inside, so we provide two arguments, reduceMethod and clusterMethod, for you to decide the methods you want.
PCA
or tSNE
.PCA
, then InferLandmark
will return the coordinates with top two components of PCA result and store them in InferLandmark.o$coordinates.If you choose tSNE
, InferLandmark
will firstly implement RMT method on the data matrix to estimate the number of significant components. Then it will do PCA under the estimated number of components. With such PCA result, tSNE will then be performed to generate data coordinates in two-dimensional space. The coordinates will also be stored in InferLandmark.o$coordinates.
For clusterMethod, one can select PAM
or dbscan
.
PAM
: you need to specific the maximal number of the clustes, whose corresponding argument is k_pam
.dbscan
: we provide two arguments, eps_dbscan
and minPts_dbscan
, for you to optimize the clustering result. And you need to tune these two arguments based on your own dataset. The default values are 10 and 5, respectively. The implementation is also very straightforward:InferLandmark.o <- InferLandmark(InferPotency.o, pheno.v = phenoExample.v, reduceMethod = "PCA", clusterMethod = "dbscan", eps_dbscan = 10, minPts_dbscan = 5)
The other results, except coordinates, of this function will be stored in a list named InferLandmark.l inside the output object InferLandmark.o. For example, you can access the landmark index for every single cell with:
InferLandmark.o$InferLandmark.l$cl
You can also access the cell number distribution of phenotype against landmark with:
InferLandmark.o$InferLandmark.l$distPHLM
For more information, you can check the help pages.
It is important to note that the differentiation potency estimation step can also be applied on bulk RNA-seq data. The procedure is identical with single-cell RNA-Seq data, the only difference is in the specific preprocessing and normalization of the data.
data(tsne.o) data(potS.v) data(SR4.v) potS.v <- 4 - potS.v InferLandmark.o <- list(potencyState = potS.v, SR = SR4.v) library(marray)
Here, we used an example dataset contains 3473 human breast epithelial cells from [@BreData] to show our density based visualiazation function Plot_LandSR
and Plot_CellSR
.
Plot_LandSR
functionThis function generates figures which compares cell density across different distinct potency states, which are inftered by InferPotency
function.
We provide arguments for you to choose whether to inherent dimension reduction result from the InferLandmark.o object, or to input the reduced diemension coordinates yourself.
And you can specify the color between all cell density and distinct potency cell density.
The horizon lines of these density maps decrease from left to right, which indicates the cell potency states decrease from high (PS1) to low (PS3).
LandSR.o <- Plot_LandSR(InferLandmark.o, coordinates = tsne.o, colpersp = NULL, colimage = NULL, bty = "f", PDF = FALSE)
The output list LandSR.o will store the input coordinates as an element, in LandSR.o$coordinates. you can easily access it with
LandSR.o$coordinates
Plot_CellSR
functionThis function generates figures that shows cell density on top of cell SR value distribution.
We also provide arguments for you to choose whether to inherent dimension reduction result from the InferLandmark.o object, or to input the reduced diemension coordinates yourself.
And you can specify the color of cell density and SR values distribution.
Note that the input argument num-grid is sensitive to the data size, which indicates the number of grid points in each direction. So it should be well asigned bsed on your own dataset. See more details in package help page.
CellSR.o <- Plot_CellSR(InferLandmark.o, coordinates = tsne.o, num_grid = 35, theta = 40, colpersp = NULL, colimage = NULL, bty = "f", PDF = FALSE)
The output list CellSR.o will also store the input coordinates as an element, in CellSR.o$coordinates.
In LandSCENT
, we provide a function named DoLandSCENT
for users do not care much about the intermedium results but the final information.
It takes the scRNA-seq data and network matrix with some control arguments as its inputs. And return a list contains SR values, potency states and other results after running all the necessary functions. You can also control the plot via PLOT and PDF arguments.
DoLandSCENT.o <- DoLandSCENT(exp.m = Example.m, ppiA.m = net13Jun12.m, mc.cores = 30, pheno.v = NULL, coordinates = NULL, PLOT = FALSE, PDF = FALSE)
In the latest update, we have integrated diffusion maps from package destiny
with our potency estimation,i.e. SR values, to infer differentiation trajectory.
Here we provide two functions:
1. DoDiffusionMap
: This fuction gives utility of constructing diffusion map and selecting the root cell of the tajectory.
2. Plot_DiffusionMap
: Based on the result of DoDiffusionMap
, users can easily plot the diffusion maps with Plot_DiffusionMap
function.
DoDiffusionMap
functionTypically, the main input of DoDiffusionMap
would be the output from InferLandmark
function. It also requires you to specify several arguments based on your own dataset:
DoDiffusionMap.o <- DoDiffusionMap(Integration.l, mean_gap = 1, sd_gap = 1, root = c("cell", "state"), num_comp = 3, k = 30)
In the output object DoDiffusionMap.o
, three new elements will be added:
In case you do not like the way we generate the plot or you want to add some other stuff, these elements are easy to transfer.
Plot_DiffusionMap
functionThis function is quite straightfoward. With aforementioned result DoDiffusionMap.o
, one can easily run it with the following commands:
The default set of Plot_DiffusionMap
would give you a 3D version figure. This function estimates diffusion pseudotime from the diffusion maps constructed from DoDiffusionMap
. So you could choose to color the cells by its SR values or diffusion pseudotime(DPT) with color_by parameter.
Plot_DiffusionMap(DoDiffusionMap.o, dim = c(1, 2, 3), color_by = "SR", phi = 40, theta = 135, bty = "g", PDF = FALSE)
In the figure, Plot_DiffusionMap
automatically highlights the root cell and predicted terminal cells. And we provide two arguments phi and theta to tune the angel in order to have a better view.
There three NOTES need to be addressed:
DoDiffusionMap
function, the plot would fail.ggplot2
to generate the figure. So it will be convient for you to add some customized feature with ggplot2
library functions. As we have shown in the following:Plot_DiffusionMap(DoDiffusionMap.o, dim = c(1, 2), color_by = "DPT", TIPs = c(1, 2, 3), PDF = FALSE) + annotate("text", x = -0.02, y = -0.007, label = "Basal", size = 7) + annotate("text", x = 0.013, y = 0.053, label = "Lum2", size = 7) + annotate("text", x = 0.013, y = -0.023, label = "Lum1", size = 7)
If users provided SingleCellExperiment
or CellDataSet
objects as the input, it then can be easily to extract the sce or cds from the results. The sce or cds objects are stored in the list with name data.sce and data.cds, respectively. And SR values and potency states are already been added to their phenotype information, so you can get such information in the colData or pData, respectively.
Take SR.o as an example:
result.sce <- SR.o$data.sce result.cds <- SR.o$data.cds
With such object, you can then easily interact with monocle, scater and other packages.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.