require(BiocStyle)

Introduction

The main purpose of the LandSCENT package is to provide a means of estimating the differentiation potency of single cells without the need to assume prior biological knowledge (e.g. marker expression or timepoint). As such, it may provide a more unbiased means for assessing potency or pseudotime. The package features:

This document gives a detailed tutorial of the LandSCENT package from data normalization to result visualization. LandSCENT package requires two main sources of input:

How to prepare these inputs for using in LandSCENT will be described in detail in the following sections.

User defined functional gene network

LandSCENT requires as input a user defined functional gene network, for instance, a protein-protein interaction(PPI) network including the main interactions that take place in a cell. Although these networks are mere caricatures of the underlying signaling networks, ignoring time, spatial and biological contexts, one of the discoveries made recently is that cell potency appears to be encoded by a subtle positive correlation between transcriptome and connectome, with hubs in these networks generally exhibit higher expression in more potent cells. For details we refer the reader to our publications given at the end of this vignette[@SCENT1] and [@SCENT4].

In this vignette we will use a previously defined PPI network to calculate all the example results. The specific PPI network we use here is derived from Pathway Commons, which is an integrated resource collating together PPIs from several distinct sources. In particular, the network is constructed by integrating the following sources: the Human Protein Reference Database (HPRD), the National Cancer Institute Nature Pathway Interaction Database (NCI-PID), the Interactome (Intact) and the Molecular Interaction Database (MINT).

We have stored the networks in the package. There are two versions of protein-protein interaction (PPI) network under filenames "net17Jan2016.m.RData" and "net13Jun12.m.RData"(early version). You can access these with the data function. Here we use the early version network:

library(LandSCENT)
data(net13Jun12.m)

Importantly, the nodes (genes) in this network are labeled with Entrez gene IDs, and entries take on values "0" and "1", with "0" indicating that there is no interaction or connection between the two genes, and ”1” indicating there an interaction has been reported. It is also important to note that the diagonal entries are set to "0".

Single cell RNA-seq data

We assume that you have a matrix/object containing expression count data summarised at the level of genes. You then need to do quality control and normalization on the data.

If you have a normalized data matrix already, you can directly go to the 4th section of this vignette.

Moreover, we provide input arguements for SingleCellExperiment (scater) and CellDataSet (monocle) class in DoIntegPPI function. If you have objects of such two classes, you could also directly go to the 4th section.

Here we use a scRNA-Seq dataset from [@ChuData], generated with the Fluidigm C1 platform, as an example. Due to the package size restriction, we cannot store the whole data matrix in the package. However, the example is a sub-dataset of 100 cells with two cell types: pluripotent human embryonic stem cells (hESC) and non-pluripotent progenitors of endothelial cells (EC). You can access the raw data and phenotype information using data function.

data(rawExample.m)
data(phenoExample.v)

The full dataset can be downloaded from the GEO website under accession number GSE75748, and the specific file to download is one of the supplementary file under filename "GSE75748_sc_cell_type_ec.csv.gz". You can download the dataset with getGEOSuppFiles function in r Biocpkg("GEOquery") package and extract the raw expression matrix from the supplementary files.

require(GEOquery)
require(Biobase)
GSE75748 <- getGEOSuppFiles("GSE75748")
gunzip(rownames(GSE75748)[3])
rawdata.m <- as.matrix(read.csv("GSE75748/GSE75748_sc_cell_type_ec.csv", row.names = 1))

We also provide the phenotype information of the full dataset in the package. It can be easily loaded into your session using data function.

data(phenoscChu.v)

Quality control

Here we use r Biocpkg("scater") package to do quality control and normalizaion on the raw data. Since LandSCENT is rather robust to different normalization methods, you can always choose more suitable workflow for your own dataset, just guaranteeing the normalized data meet the specific requirements of LandSCENT package. First, we create a SingleCellExperiment object containing the rawdata. Rows of the object correspond to genes, while columns correspond to single cells.

require(scater)
example.sce <- SingleCellExperiment(assay = list(counts = rawExample.m))

Then, we detect low-quality cells based on library size and number of expressed genes in each library. We also select cells with low proportion of mitochondrial genes and spike-in RNA. Using the isOutlier function in r Biocpkg("scater") package with default arrguments, we remove 4 cells and most of them (3 cells) are filtered out because of high spike-in gene content.

### Detect mitochondrial gene and spike-in RNA

is.mito <- grepl("^MT", rownames(example.sce))
is.spike <- grepl("^ERCC", rownames(example.sce))

counts(example.sce) <- as(counts(example.sce), "dgCMatrix")
example.QC <- perCellQCMetrics(example.sce, subsets=list(Spike=is.spike, Mito=is.mito), flatten=FALSE)

### Cell Filtering wih isOutlier function

libsize.drop <- isOutlier(example.QC$total, nmads=5, type="lower", log=TRUE);
mito.drop <- isOutlier(example.QC$subsets$Mito$sum, nmads=5, type="higher");
spike.drop <- isOutlier(example.QC$subsets$Spike$sum, nmads=10, type="higher");

filter_example.sce <- example.sce[, !(libsize.drop | mito.drop | spike.drop)]
phenoExample.v <- phenoExample.v[!(libsize.drop | mito.drop | spike.drop)]
data.frame(ByLibSize=sum(libsize.drop), ByMito=sum(mito.drop), 
           BySpike=sum(spike.drop), Remaining=ncol(filter_example.sce))
example.sce <- filter_example.sce

Normalization

After quality control, we then move to the normalization part. r Biocpkg("scater") package defines the size factors from the scaled library sizes of all cells.

sizeFactors(example.sce) <- librarySizeFactors(example.sce)

Scaling normalization is then used to remove cell-specific biases, e.g. coverage or capture efficiency. Log-transformed normalized expression values can be simply computed with normalize function.

Importantly, we need to add an offset value of 1.1 before log-transformation. The offset is added in order to ensure that the minimum value after log-transformation would not be 0, but a nonzero value (typically log2(1.1)~0.14). We do not want zeroes in our final matrix since the computation of signaling entropy rate involves ratios of gene expression values and zeros in the denominator are undefined.

example.sce <- logNormCounts(example.sce, pseudo_count = 1.1)
example.m <- as.matrix(assay(example.sce, i = "logcounts"))
min(example.m)

Check gene identifier

Because LandSCENT will integrate the network with scRNA-Seq data, the row names and column names of the network must use the same gene identifier as used in the scRNA-Seq data.

In our example, rownames of the data matrix are all human gene symbols. So we need to use mapIds function in r Biocpkg("AnnotationDbi") package along with r Biocpkg("org.Hs.eg.db") package to get corresponding human Entrez gene ID.

require(AnnotationDbi)
require(org.Hs.eg.db)
anno.v <- mapIds(org.Hs.eg.db, keys = rownames(example.m), keytype = "SYMBOL", 
                 column = "ENTREZID", multiVals = "first")
unique_anno.v <- unique(anno.v)
example_New.m <- matrix(0, nrow = length(unique_anno.v), ncol = dim(example.m)[2])
for (i in seq_len(length(unique_anno.v))) {
  tmp <- example.m[which(anno.v == unique_anno.v[i]) ,]
  if (!is.null(dim(tmp))) {
    tmp <- colSums(tmp) / dim(tmp)[1]
  }
  example_New.m[i ,] <- example_New.m[i ,] + tmp
}
rownames(example_New.m) <- unique_anno.v
colnames(example_New.m) <- colnames(example.m)
example_New.m <- example_New.m[-which(rownames(example_New.m) %in% NA) ,]
Example.m <- example_New.m

Now the scRNA-seq data are ready for the calculation of signaling entropy rate.

How to use LandSCENT package

Before start introducing functionalities of LandSCENT package, we give several points for you to check, in case you skipped the one or two sections above:

And more notes for users with SingleCellExperiment and CellDataSet objects:

Differentiation potency estimation

The estimation of differentiation potency with LandSCENT consists of two major steps:

A typical workflow starts from integration of the scRNA-Seq data with the user-defined gene functional network. Here you can simply apply DoIntegPPI function on Example.m dataset and PPI network net13Jun12.m:

data(Example.m)
Example.m <- Example.m[, !(libsize.drop | mito.drop | spike.drop)]
Integration.l <- DoIntegPPI(exp.m = Example.m, ppiA.m = net13Jun12.m)
str(Integration.l)

DoIntegPPI function takes these two arguments as input. The function finds the overlap between the gene identifiers labeling the network and those labeling the rows of the scRNA-seq matrix, and then extracts the maximally connected subnetwork, specified by the adjMC output argument. Also, the function constructs the reduced scRNA-Seq matrix, specified by the expMC output argument.

With the output object Integration.l, we can now proceed to compute the SR value for any given cell, using the CompSRana function. It takes five objects as input:

SR.o <- CompSRana(Integration.l, local = TRUE, mc.cores = 40)

Here we run the CompSRana function with 40 cores, which means the computer should have at least 40 processing cores. In your case, you may set mc.cores value based on your own computer/server.

The output argument SR.o is a list that added four elements onto the input Integration.l:

More for users with SingleCellExperiment and CellDataSet classes: the SR values will also be added as a new phenotype information onto the original sce and cds objects with name SR.

One note with the above step: the local gene-based entropies can be used in downstream analyses for ranking genes according to differential entropy, but only if appropriately normalized. For instance, they could be used to identify the main genes driving changes in the global signaling entropy rate of the network. However, if the user only wishes to estimate potency, specifying local=FALSE is fine, which will save some RAM on the output object, which is why we make it the default option.

As we mentioned above, we provide the example's phenotype infromation in the package, stored in the phenoExample.v vector. With SR values and the phenotype information, you can then check that the SR values do indeed correlate with potency:

data(SR.v)
Integration.l$SR <- SR.v
SR.o <- Integration.l
boxplot(SR.o$SR ~ phenoExample.v, main = "SR values against cell types", xlab = "Cell Types", ylab = "SR values")

Here hESC and EC refer to human embryonic stem cells and progenitors of endothelial cells, respectively.

Infer the potency states in a cell population

Having estimated the cell potency values, we can then infer cell potency states with InferPotency function. This function infers discrete potency states of single cells and its distribution across the single cell populations.

If you have no phenotype information, you can simply implement this method with above aforementioned object SR.o:

InferPotency.o <- InferPotency(SR.o)

Then the infered distinct potency states for every cell will be stored in the element named potencyState.

InferPotency.o$potencyState

But if phenotype information is provided, InferPotency will return the distribution of potency states in relation to the phenotype classes provided (e.g. cell-types):

InferPotency.o <- InferPotency(SR.o, pheno.v = phenoExample.v)
InferPotency.o$distPSPH

This result indicates that InferPotency function inferred 2 potency states, with the first potency state being occupied only by human embryonic stem cells(hESC), while the second potency state is enriched for mesoderm progenitor cells(EC).

Such results will also be stored for SingleCellExperiment and CellDataSet objects.

Infer potency-coexpression clusters (landmarks)

Next step, we may want to explore the heterogeneity in potency within the cell population and infer lineage relationships. One way to approach this question is to use the InferLandmark function.

This function identifies potency-coexpression clusters of single cells called landmarks, and finally infers the dependencies of these landmarks which can aid in recontructing lineage trajectories.

With aforementioned result InferPotency.o, one can easily implement this function:

InferLandmark.o <- InferLandmark(InferPotency.o, pheno.v = phenoExample.v,
                                 reduceMethod = "PCA", clusterMethod = "PAM",
                                 k_pam = 2)

InferLandmark will do dimension reduction inside, so we provide two arguments, reduceMethod and clusterMethod, for you to decide the methods you want.

  1. For reduceMethod, one can choose PCA or tSNE.
  2. If you choose PCA, then InferLandmark will return the coordinates with top two components of PCA result and store them in InferLandmark.o$coordinates.
  3. If you choose tSNE, InferLandmark will firstly implement RMT method on the data matrix to estimate the number of significant components. Then it will do PCA under the estimated number of components. With such PCA result, tSNE will then be performed to generate data coordinates in two-dimensional space. The coordinates will also be stored in InferLandmark.o$coordinates.

  4. For clusterMethod, one can select PAM or dbscan.

  5. Clustering via PAM: you need to specific the maximal number of the clustes, whose corresponding argument is k_pam.
  6. Clustering via dbscan: we provide two arguments, eps_dbscan and minPts_dbscan, for you to optimize the clustering result. And you need to tune these two arguments based on your own dataset. The default values are 10 and 5, respectively. The implementation is also very straightforward:
InferLandmark.o <- InferLandmark(InferPotency.o, pheno.v = phenoExample.v,
                                 reduceMethod = "PCA", clusterMethod = "dbscan",
                                 eps_dbscan = 10, minPts_dbscan = 5)

The other results, except coordinates, of this function will be stored in a list named InferLandmark.l inside the output object InferLandmark.o. For example, you can access the landmark index for every single cell with:

InferLandmark.o$InferLandmark.l$cl

You can also access the cell number distribution of phenotype against landmark with:

InferLandmark.o$InferLandmark.l$distPHLM

For more information, you can check the help pages.

Application on bulk samples

It is important to note that the differentiation potency estimation step can also be applied on bulk RNA-seq data. The procedure is identical with single-cell RNA-Seq data, the only difference is in the specific preprocessing and normalization of the data.

Density based visualization tool

data(tsne.o)
data(potS.v)
data(SR4.v)
potS.v <- 4 - potS.v
InferLandmark.o <- list(potencyState = potS.v, SR = SR4.v)
library(marray)

Here, we used an example dataset contains 3473 human breast epithelial cells from [@BreData] to show our density based visualiazation function Plot_LandSR and Plot_CellSR.

Plot_LandSR function

This function generates figures which compares cell density across different distinct potency states, which are inftered by InferPotency function.

We provide arguments for you to choose whether to inherent dimension reduction result from the InferLandmark.o object, or to input the reduced diemension coordinates yourself.

And you can specify the color between all cell density and distinct potency cell density.

The horizon lines of these density maps decrease from left to right, which indicates the cell potency states decrease from high (PS1) to low (PS3).

LandSR.o <- Plot_LandSR(InferLandmark.o, coordinates = tsne.o, colpersp = NULL, 
                        colimage = NULL, bty = "f", PDF = FALSE)

The output list LandSR.o will store the input coordinates as an element, in LandSR.o$coordinates. you can easily access it with

LandSR.o$coordinates

Plot_CellSR function

This function generates figures that shows cell density on top of cell SR value distribution.

We also provide arguments for you to choose whether to inherent dimension reduction result from the InferLandmark.o object, or to input the reduced diemension coordinates yourself.

And you can specify the color of cell density and SR values distribution.

Note that the input argument num-grid is sensitive to the data size, which indicates the number of grid points in each direction. So it should be well asigned bsed on your own dataset. See more details in package help page.

CellSR.o <- Plot_CellSR(InferLandmark.o, coordinates = tsne.o, 
                        num_grid = 35, theta = 40, colpersp = NULL, 
                        colimage = NULL, bty = "f", PDF = FALSE)

The output list CellSR.o will also store the input coordinates as an element, in CellSR.o$coordinates.

A "pipeline" function for easy use

In LandSCENT, we provide a function named DoLandSCENT for users do not care much about the intermedium results but the final information. It takes the scRNA-seq data and network matrix with some control arguments as its inputs. And return a list contains SR values, potency states and other results after running all the necessary functions. You can also control the plot via PLOT and PDF arguments.

DoLandSCENT.o <- DoLandSCENT(exp.m = Example.m, ppiA.m = net13Jun12.m,
                             mc.cores = 30, pheno.v = NULL, 
                             coordinates = NULL,
                             PLOT = FALSE, PDF = FALSE)

Employ diffusion maps to infer differentiation trajectory

In the latest update, we have integrated diffusion maps from package destiny with our potency estimation,i.e. SR values, to infer differentiation trajectory.

Here we provide two functions: 1. DoDiffusionMap: This fuction gives utility of constructing diffusion map and selecting the root cell of the tajectory. 2. Plot_DiffusionMap: Based on the result of DoDiffusionMap, users can easily plot the diffusion maps with Plot_DiffusionMap function.

DoDiffusionMap function

Typically, the main input of DoDiffusionMap would be the output from InferLandmark function. It also requires you to specify several arguments based on your own dataset:

  1. The mean_gap, sd_gap is the mean and standard deviation threshold for selecting highly variable genes(HVGs) inside function, respectively.
  2. The root argument has two type of choices: cell and state.
  3. cell means function will choose the root cell to be the one with the highest SR values among all the cells.
  4. state means the function will first cluster the high potency cells and choose a group of cells with highest cluster-median SR values. Then the root cell will be assigned to the highest SR valued cell inside the choosen cluster.
  5. The num_comp defines how many diffusion components you want to calculate. And this would influence the following plot procedure.
  6. k is the number of nearest neighbors to consider while constructing the diffusion maps.
DoDiffusionMap.o <- DoDiffusionMap(Integration.l,
                                   mean_gap = 1, sd_gap = 1,
                                   root = c("cell", "state"),
                                   num_comp = 3,
                                   k = 30)

In the output object DoDiffusionMap.o, three new elements will be added:

  1. A diffusion map object in DM
  2. The diffusion compoents in DMEigen
  3. The index of root cell is stored in root

In case you do not like the way we generate the plot or you want to add some other stuff, these elements are easy to transfer.

Plot_DiffusionMap function

This function is quite straightfoward. With aforementioned result DoDiffusionMap.o, one can easily run it with the following commands:

The default set of Plot_DiffusionMap would give you a 3D version figure. This function estimates diffusion pseudotime from the diffusion maps constructed from DoDiffusionMap. So you could choose to color the cells by its SR values or diffusion pseudotime(DPT) with color_by parameter.

Plot_DiffusionMap(DoDiffusionMap.o,
                  dim = c(1, 2, 3),
                  color_by = "SR",
                  phi = 40,
                  theta = 135,
                  bty = "g",
                  PDF = FALSE)

In the figure, Plot_DiffusionMap automatically highlights the root cell and predicted terminal cells. And we provide two arguments phi and theta to tune the angel in order to have a better view.

There three NOTES need to be addressed:

  1. For argument dim, this is a vector virable. It indicates which dimension to use and in what order. For example, dim = c(1,2,3) means use diffusion component 1,2 and 3 to generrate figures, and the x-axis relates to DC1, y-axis relates to DC2 and z-axis relates to DC3. While dim = c(2,3,5) means x-axis relates to DC2, y-axis relates to DC3 and z-axis relates to DC5. More Importantly, If you specify negative numbers in dim (e.g. dim = c(2,3,-5)), then the corresponding DC5 will be flipped. Moreover, if the maximum dimension you choose here is larger than num_comp in DoDiffusionMap function, the plot would fail.
  2. We provide an option in color_by to be DPT, which gives a diffusion pseudotime(DPT) estimation. In some cases, like the cell number is too little, or the cell clusters are seprated too far with each other, the estimation may fail. So before you generate the figure with DPT, you may need to look into the diffusion map itself, and make sure the DPT estimation could be carry out correctly.
  3. In the 2D version figure, we employed ggplot2 to generate the figure. So it will be convient for you to add some customized feature with ggplot2 library functions. As we have shown in the following:
Plot_DiffusionMap(DoDiffusionMap.o,
                  dim = c(1, 2),
                  color_by = "DPT",
                  TIPs = c(1, 2, 3),
                  PDF = FALSE) +
  annotate("text", x = -0.02, y = -0.007, label = "Basal", size = 7) +
  annotate("text", x = 0.013, y = 0.053, label = "Lum2", size = 7) +
  annotate("text", x = 0.013, y = -0.023, label = "Lum1", size = 7)

Extract objects from function results

If users provided SingleCellExperiment or CellDataSet objects as the input, it then can be easily to extract the sce or cds from the results. The sce or cds objects are stored in the list with name data.sce and data.cds, respectively. And SR values and potency states are already been added to their phenotype information, so you can get such information in the colData or pData, respectively.

Take SR.o as an example:

result.sce <- SR.o$data.sce
result.cds <- SR.o$data.cds

With such object, you can then easily interact with monocle, scater and other packages.

Session information

sessionInfo()

References



ChenWeiyan/LandSCENT documentation built on Aug. 28, 2020, 9:55 p.m.