suppressPackageStartupMessages({ library(ToPASeq) library(graphite) })
This package de-novo implements or adjusts the existing implementations of several different methods for topology-based pathway analysis of gene expression data from microarray and RNA-Seq technologies.
Traditionally, in pathway analysis differentially expressed genes are mapped to reference pathways derived from databases and relative enrichment is assessed. Methods of topology-based pathway analysis are the last generation of pathway analysis methods that take into account the topological structure of a pathway, which helps to increase specificity and sensitivity of the results.
This package implements eight topology-based pathway analysis methods that focus on identification of the pathways that are differentially affected between two condition. Each method is implemented as a single wrapper function which allows the user to call a method in a single command.
We start by loading the package.
library(ToPASeq)
The input data are either normalized (count) data or gene expression data as well as pathway topological structure.
For the sake of simplicity, our package offers in each wrapper function a pre-processing step for RNA-seq normalization as implemented in edgeR and DESeq packages. If necessary, the functions also performs differential gene expression analysis through calling limma and DESeq2 packages. Additionally, the user can prepare the data by his preffered method and skip the built-in normalization and/or differential expression analysis.
Pathways and their topological structures are an important input for the analysis. They are represented as graphs $G=(V,E)$, where $V$ denotes a set of vertices or nodes represented by genes and $E \subseteq V \times V$ is a set of edges between nodes (oriented or not, depending on the method) representing the interaction between genes. These structures can be downloaded from public databases such as KEGG or Biocarta or are available through other packages such as r Biocpkg("graphite")
.
ToPASeq is build upon r Biocpkg("graphite")
R-package were patways for multiple species and from multiple database can be obtained. In these pathways, protein complexes are expanded into cliques since it is assumed that all units from one complex interact with each other. A clique, from graph theory, is a subset of vertices such that every two vertices in the subset are connected by an edge. On the other hand, gene families are expanded into separate nodes with same incoming and/or outgoing edges, because they are believed to be interchangable. Additionaly, proteins and metabolites are differentiated in the pathway. Interactions between proteins are propagated through metabolites and viceversa.
For RNA-seq data, we consider transcriptome profiles of four primary human airway smooth muscle cell lines in two conditions: control and treatment with dexamethasone Himes et al., 2014.
We load the r Biocpkg("airway")
dataset
library(airway) data(airway)
For further analysis, we only keep genes that are annotated to an ENSEMBL gene ID.
airSE <- airway[grep("^ENSG", rownames(airway)),] dim(airSE) assay(airSE)[1:4,1:4]
We also remove genes with zero counts.
airSE <- airway[rowSums(assay(airSE))>0,] dim(airSE) assay(airSE)[1:4,1:4]
We define the grouping variable
group <- ifelse(airway$dex == "trt", 1, 0) table(group)
Here, we retrieve human KEGG pathways.
library(graphite) pwys <- pathways(species="hsapiens", database="kegg")[1:5] pwys
As the airway dataset uses ENSEMBL gene IDs, but the nodes of the pathways are based on NCBI Entrez Gene IDs,
nodes(pwys[[1]])
we first map the gene IDs in the pathways from ENTREZ IDs to ENSEMBL.
pwys<-graphite::convertIdentifiers(pwys,"ENSEMBL") nodes(pwys[[1]])
The methods described below can be applied on microarry data by setting argument type
to "MA"
. All methods include also differential expression analysis which is perfomed by r Biocpkg("limma")
package. For RNA-Seq data, voom transformation is applied first and the user can change the method with arguments norm.method
and test.method
.
TopologyGSA represents a multivariable method in which the expression of genes is modelled with Gausian Graphical Models with covariance matrix reflecting the pathway topology. It uses the the Iterative Proportional Scaling algorithm to estimate the covariance matrices. The vectors of means are compared by Hotelling's T2 test. The method has requirements on the minimal number of samples for each pathway. This method was first implemented in the r Biocpkg("TopologyGSA")
package. In here, we have optimalized its performance by using different function for obtaining cliques from each pathway.
The method can be used with a single command
top<-TopologyGSA(assay(airSE), group, pwys, type="RNASeq", nperm=10, norm.method="edgeR") res(top)
Apart from the expected arguments: a gene expression data matrix, a vector of class labels and a list of pathways, the user needs to specify the type
argument which decides on the type of the data ("RNASeq"
for RNA-Seq data). The others arguments are optional. The perms
argument sets the number of permutations to be used in the statistical tests. By default both mean and variance tests are run, this can be changed to only variance test by setting method="var"
. Arguments convertTo
and convertBy
control the conversion of the node labels in the pathways. The default setting is convertTo="none"
which performs no conversion. Please note, that the node labels should be the same as the rownames of gene expression data matrix. The threshold for variance test is specified with alpha
argument. The implementation allows also testing of all the cliques present in the graph by setting testCliques=TRUE
. Please note that these tests may take quite a long time. The implementation returns also a gene-level statistics of the differential expression of genes performed via 'limma' package.
Another multivariable method implemented in the package is r Biocpkg("DEGraph")
. This method assumes the same direction in the differential expresion of genes belonging to a pathway. It performs the regular Hotelling's T2 test in the graph-Fourier space restricted to its first $k$ components which is more powerful than test in the full graph-Fourier space or in the original space.
We apply the method with
deg<-DEGraph(assay(airSE), group, pwys, type="RNASeq", norm.method="edgeR") res(deg)
Apart from the expected arguments: a gene expression data matrix, a vector of class labels and a list of pathways, the user needs to specify the type
argument which decides on the type of the data ("RNASeq"
for RNA-Seq data). The others arguments are optional. Arguments convertTo
and convertBy
control the conversion of the node labels in the pathways. The default setting is convertTo="none"
which performs no conversion. Please note, that the node labels should be the same as the rownames of gene expression data matrix. Since, the DEGraph method runs a statistical test for each connected component of a pathway, a method for assigning a global p-value for whole pathway is needed. The user can select from three approaches: the minimum, the mean and the p-value of the biggest component. This is specified via overall
argument. The implementation returns also a gene-level statistics of the differential expression of genes performed via `limma package.
The last multivariable method available within this package is called clipper. This method is similar to the topologyGSA as it uses the same two-step approach. However, the Iterative Proportional Scaling algorithm was subsituted with a shrinkage procedure of James-Stein-type which additionally allows proper estimates also in the situation when number of samples is smaller than the number of genes in a pathway. The tests on a pathway-level are follwed with a search for the most affected path in the graph.
The method can be applied with
cli<-clipper(assay(airSE), group, pwys,type="RNASeq", method="mean", nperm=10, norm.method="edgeR") res(cli)$results[[1]]
Apart from the expected arguments: a gene expression data matrix, a vector of class labels and a list of pathways, the user needs to specify the type
argument which decides on the type of the data ("RNASeq"
for RNA-Seq data). The others arguments are optional. Arguments convertTo
and convertBy
control the conversion of the node labels in the pathways. The default setting is convertTo="none"
which performs no conversion. Please note, that the node labels should be the same as the rownames of gene expression data matrix. Also, both mean and variance tests are run, this can be changed to only variance test by setting method="var"
. The nperm
controls the number of permutations in the statistical tests. Similarly as in topologyGSA, the implementation allows testing of all the cliques present in the graph by setting testCliques=TRUE
. Please note that these tests may take quite a long time. The implementation returns also a gene-level statistics of the differential expression of genes performed via limma
package.
The function returns two types of the results on pathway-level. The first (printed above), is a table of p-values and q-values related to the differential expression and concentration of the pathways.
The most well-known topology-based pathway analysis method is SPIA. In there, two evidences of differential expression of a pathway are combined. The first evidence is a regular so called overrepresentation analysis in which the statistical significance of the number of differentially expressed genes belonging to a pathway is assessed. The second evidence reflects the pathway topology and it is called the pertubation factor. The authors assume that a differentially expressed gene at the begining of a pathway topology (e.g. a receptor in a signaling pathway) has a stronger effect on the functionality of a pathway than a differentially expressed gene at the end of a pathway (e.g. a transcription factor in a signaling pathway). The pertubation factors of all genes are calculated from a system of linear equations and then combined within a pathway. The two evidences in a form of p-values are finally combined into a global p-value, which is used to rank the pathways.
spi<-SPIA(assay(airSE), group,pwys , type="RNASeq", logFC.th=-1, test.method="edgeR") res(spi)
Apart from the expected arguments: a gene expression data matrix, a vector of class labels and a list of pathways, the user needs to specify the type
argument which decides on the type of the data ("RNASeq"
for RNA-Seq data). Alternatively, the user can supply the results of the differential expression analysis of genes instead of gene expression data matrix in two forms:
type
to "DEtable"
type
to "DElist"
The others arguments are optional. Arguments convertTo
and convertBy
control the conversion of the node labels in the pathways. The default setting is convertTo="none"
which performs no conversion. Please note, that the node labels should be the same as the rownames of gene expression data matrix. The default thresholds for the differential expression analysis of genes are set with arguments logFC.th
and p.val.th
. The user can omit one of these criteria by setting the agrument negative value, as is shown also in the example. The implementation returns also a gene-level statistics of the differential expression of genes. These statistics are later used in the visualization of a selected pathway.
TAPPA was among the first topology-based pathway analysis methods. It was inspired in chemointformatics and their models for predicting the structure of molecules. In TAPPA, the gene expression values are standardized and sigma-transformed within a samples. Then, a pathway is seen a molecule, individual genes as atoms and the energy of a molecule is a score defined for one sample. This score is called Pathway Connectivity Index. The difference of expression is assessed via a common univariable two sample test - Mann-Whitney in our implemetation.
tap<-TAPPA(assay(airSE), group, pwys, type="RNASeq", norm.method = "edgeR") res(tap)
Apart from the expected arguments: a gene expression data matrix, a vector of class labels and a list of pathways, the user needs to specify the type
argument which decides on the type of the data ("RNASeq"
for RNA-Seq data). The others arguments are optional. Arguments convertTo
and convertBy
control the conversion of the node labels in the pathways. The default setting is convertTo="none"
which performs no conversion. Please note, that the node labels should be the same as the rownames of gene expression data matrix. The user can also specified whether the normalization step (standardization and sigma-transformation) should be perfomed (normalize=TRUE
). If verbose=TRUE
, function prints out the titles of pathways as their are analysed. The implementation returns also a gene-level statistics of the differential expression of genes.
PRS is another method that works with gene-level statistics and a list of differentially expresed genes. The pathway topology is incorporated as the number of downstream differentially expressed genes. The gene-level log fold-changes are weigted by this number and sumed up into a pathway-level score. A statistical significance is assessed by a permutations of genes.
Prs<-PRS_wrapper( assay(airSE), group, pwys, type="RNASeq", logFC.th=-1, nperm=10, test.method="edgeR") res(Prs)
Arguments of this functions are almost the same as in SPIA
. Apart from the expected arguments: a gene expression data matrix, a vector of class labels and a list of pathways, the user needs to specify the type
argument which decides on the type of the data ("RNASeq"
for RNA-Seq data). Alternatively, the user can supply the results of the differential expression analysis of genes instead of gene expression data matrix in two forms:
type
to "DEtable"
type
to "DElist"
The others arguments are optional. Arguments convertTo
and convertBy
control the conversion of the node labels in the pathways. The default setting is convertTo="none"
which performs no conversion. Please note, that the node labels should be the same as the rownames of gene expression data matrix. The default thresholds for the differential expression analysis of genes are set with arguments logFC.th
and p.val.th
. The user can omit one of these criteria by setting the agrument negative value, as is shown also in the example. The implementation returns also a gene-level statistics of the differential expression of genes. There is one extra argument nperm
which controls the number of permutations.
\section{PWEA} The last method available in this package is called PathWay Enrichment Analysis (PWEA). This is actually a weigthed form of common Gene Set Enrichment Analysis (GSEA). The weights are called Topological Influence Factor (TIF) and are defined as a geometic mean of ratios of Pearson's correlation coefficient and the distance of two genes in a pathway. The weights of genes outside a pathway are assigned randomly from normal distribution with parameters estimated from the weights of genes in all pathways. A statistical significance of a pathway is assessed via Kolmogorov-Simirnov-like test statistic comparing two cumulative distribution functions with class label permutations.
pwe<-PWEA(assay(airSE), group, pwys, type="RNASeq", nperm=10, test.method="edgeR") res(pwe)
Apart from the expected arguments: a gene expression data matrix, a vector of class labels and a list of pathways, the user needs to specify the type
argument which decides on the type of the data ("RNASeq"
for RNA-Seq data). Alternatively, instead of gene expression data matrix the user can supply a list of observed and random gene-level statistics and set type
to "DEtable"
. The observed gene-level statistics are expected as data frame with columns: ID gene identifiers (they must match with the node labels), logFC log fold-changes, t t-statistics, pval p-values, padj** adjusted p-values.. A data.frame of similar data.frames is expected for random statistics (it is an output from sapply function when the applied function returns a data frame). Columns should refer to the results from individual analyses after class label permutation. The others arguments are optional.Arguments convertTo
and convertBy
control the conversion of the node labels in the pathways. The default setting is convertTo="none"
which performs no conversion. Please note, that the node labels should be the same as the rownames of gene expression data matrix. The alpha
parameter sets a threshold for gene weights. The purpose of this filtering is to reduce the possiblity that a weight of a gene that is tighly correlated with a few genes are lowered by the weak correlation with other genes in a pathway. The implementation returns also a gene-level statistics of the differential expression of genes. The nperm
argument controls the number of permutations.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.