library(knitr) opts_chunk$set(collapse = TRUE)
Pathway Expression Profiles (PEPs) are based on the expression of all
the pathways (or generic gene sets) belonging to a collection under a
given experimental condition, as opposed to individual
genes. gep2pep
supports the conversion of gene expression profiles
(GEPs) to PEPs and performs enrichment analysis of both pathways and
conditions.
gep2pep
creates a local repository of gene sets, which can also be
imported from the MSigDB database [1]. The local repository is in the
repo
format. When a GEP, defined as a ranked list of genes, is
passed to buildPEPs
, the stored database of pathways is used to
convert the GEP to a PEP and permanently store the latter.
One type of analysis that can be performed on PEPs and that is
directly supported by gep2pep
is the Drug-Set Enrichment Analysis
(DSEA, see reference below). It finds pathways that are consistently
dysregulated by a set of drugs, as opposed to a background of other
drugs. Of course PEPs may refer to non-pharmacological conditions
(genetic perturbations, disease states, etc.) for analogous
analyses. See the CondSEA
function.
A complementary approach is that of finding conditions under which a
set of pathways is consistently UP- or DOWN-regulated. This is the
pathway-based version of the Gene Set Enrichment Analysis (GSEA). As
an application example, this approach can be used to find drugs
mimicking the dysregulation of a gene by looking for drugs
dysregulating the pathways involving the gene (this has been published
as the gene2drug
tool [3]). See the PathSEA
function.
This vignette uses toy data, as real data can be computationally
expensive. Connectivity Map data [4] (drug induced gene expression
profiles) pre-converted to PEPs can be downloaded from
http://dsea.tigem.it in the gep2pep
format. At the end of this
vignette, a precomputed example is reported in which that data is
used.
In order to use the package, it must be loaded as follows (the GSEABase package will also be used in this vignette to access gene set data):
library(gep2pep) suppressMessages(library(GSEABase))
The MSigDB is a
curated database of gene set collections. The entire database can be
downloaded as a single XML file and used by gep2pep
. The following
commented code would import the database once downloaded (gep2pep uses
a slight variation of the BroadCollection
type used by MSigDB, named
CategorizedCollection
, thus a conversion is necessary):
## db <- importMSigDB.xml("msigdb_v6.1.xml") ## db <- as.CategorizedCollection(db)
However, for this vignette a small excerpt will be used.
db <- loadSamplePWS()
The database is in GSEABase::GeneSetCollection
format and
includes 30 pathways, each of which is included in one of 3 different
collections. Following MSigDB conventions, each collection is
identified by a "category" and "subCategory" fields. But the
CategorizedCollection type allows them to be arbitrary strings, as
opposed to MSigDB categories. This allows for the creation of custom
collection with categories. gep2pep
puts together into a
single collection identifier using makeCollectionIDs
.
colltypes <- sapply(db, collectionType) cats <- sapply(colltypes, attr, "category") subcats <- sapply(colltypes, attr, "subCategory") print(cats) print(subcats) makeCollectionIDs(db)
In order to build a local gep2pep
repository containing pathway
data, createRepository
is used:
repoRoot <- file.path(tempdir(), "gep2pep_data") rp <- createRepository(repoRoot, db)
The repository is in repo
format (see later in this document how to
access the repository directly). However, knowing repo
is not
necessary to use gep2pep
. The following lists the contents of the
repository, loads the GeneSetCollection
object containing all the
TFT database gene sets and finally shows the description of the
pathway "E47_01":
rp TFTsets <- loadCollection(rp, "c3_TFT") TFTsets description(TFTsets[["E47_01"]])
rp$get
is a repo
command, see later in this vignette.
Pathway Expression Profiles (PEPs) are created from Gene Expression Profiles (GEPs) using pathway information from the repository. GEPs must be provided as a matrix with rows corresponding to genes and columns corresponding to conditions (conditions). Genes and conditions must be specified through row and column names respectively. The values must be ranks: for each condition, the genes must be ranked from that being most UP-regulated (rank 1) to that being most DOWN-regulated (rank equal to the number of rows of the matrix).
One well known database that can be obtained in this format is for
example the Connectivty Map. A small excerpt (after further
processing) is included with the gep2pep
. The excerpt must be
considered as a dummy example, as it only includes 500 genes for 5
conditions. It can be loaded as follows:
geps <- loadSampleGEP() dim(geps) geps[1:5, 1:3]
The GEPs can be converted to PEPs using the buildPEPs
function. They
are stored as repository items by the names "category_subcategory".
buildPEPs(rp, geps)
Each PEP is composed of an Enrichment Score (ES) -- p-value (PV) pair associated to each pathway. ESs and PVs are stored in two separated matrices. For each condition, the p-value reports wether a pathway is significantly dysregulated and the sign of the corresponding ES indicates the direction (UP- or DOWN-regulation).
loadESmatrix(rp, "c3_TFT")[1:3, 1:3] loadPVmatrix(rp, "c3_TFT")[1:3, 1:3]
Suppose the stored PEPs correspond to pharmacological
perturbations. Then gep2pep
can perform Drug-Set Enrichment Analysis
(DSEA, see Napolitano et al., 2016, Bioinformatics). It finds pathways
that are consistently dysregulated by a set of drugs, as opposed to a
background of other drugs. Of course PEPs may refer to
non-pharmacological conditions (genetic perturbations, disease states,
etc.) for analogous analyses (Condition-Set Enrichment Analysis,
CondSEA). Given a set pgset
of drugs of interest, CondSEA (which in
this case is a DSEA) is performed as follows:
pgset <- c("(+)_chelidonine", "(+/_)_catechin") psea <- CondSEA(rp, pgset)
The result is a list of of 2 elements, named "CondSEA" and "details", the most important of which is the former. Per-collection results can be accessed as follows:
getResults(psea, "c3_TFT")
In this dummy example the statistical background is made of only 3 GEPs (we added 5 in total), thus, as expected, there are no significant p-values. For the c3_MIR collection, the pathway most UP-regulated by the chosen set of two drugs is M5012, while the most DOWN-regulated is M18759. They are respectively described as:
sets <- loadCollection(rp, "c3_MIR") wM5012 <- which(sapply(sets, setIdentifier)=="M5012") wM18759 <- which(sapply(sets, setIdentifier)=="M18759") description(sets[[wM5012]]) description(sets[[wM18759]])
The analysis can be exported in XLS format as follows:
exportSEA(rp, psea)
Performing CondSEA
using Pathway Expression Profiles derived from
drug-induced gene expression profiles yields Drug Set Enrichment
Analysis (DSEA [2]).
A complementary approach to CondSEA is Pathway-Set Enrichment Analysis (PathSEA). PathSEA searches for conditions that consistently dysregulate a set of pathways. It can be seen as a pathway-based version of the popular Gene Set Enrichment Analysis (GSEA). The PathSEA is run independently in each pathway collection.
pathways <- c("M11607", "M10817", "M16694", ## from c3_TFT "M19723", "M5038", "M13419", "M1094") ## from c4_CGN w <- sapply(db, setIdentifier) %in% pathways subdb <- db[w] psea <- PathSEA(rp, subdb)
getResults(psea, "c3_TFT")
PathSEA results are analogous to those of CondSEA, but condition-wise. A set of pathways con also be obtained starting from a gene of interest, for example:
pathways <- gene2pathways(rp, "FAM126A") pathways
Using a gene to obtain the pathways and performing PathSEA
with
drug-induced Pathway Expression Profiles yields "gene2drug" analysis,
see the following reference:
Precomputed Pathway Expression Profiles of the Connectivity Map data in the gep2pep format can be downloaded, unpacked and opened as follows:
download.file("http://dsea.tigem.it/data/Cmap_MSigDB_v6.1_PEPs.tar.gz", "Cmap_MSigDB_v6.1_PEPs.tar.gz") untar("Cmap_MSigDB_v6.1_PEPs.tar.gz") rpBig <- openRepository("Cmap_MSigDB_v6.1_PEPs")
Using these data, two kinds of analysis can be performed:
Drug Set Enrichment Analysis, which looks for commond pathways shared by a set of drugs.
Gene2drug analysis, which looks for drugs dysregulating a gene of interest.
The analyses below are not built at runtime with this document and could become outdated.
Drug Set Enrichment Analysis for a set of HDAC inhibitors using the Gene Ontology collections can be performed as follows:
csea <- CondSEA(rpBig, c("scriptaid", "trichostatin_a", "valproic_acid", "vorinostat", "hc_toxin", "bufexamac"), collections=c("C5_BP", "C5_MF", "C5_CC")) ## [16:41:40] Working on collection: C5_BP ## [16:41:42] Common conditions removed from bgset ## [16:41:42] Row-ranking collection ## [16:41:48] Computing enrichments ## [16:41:58] done ## [16:41:58] Working on collection: C5_MF ## [16:41:58] Row-ranking collection ## [16:42:00] Computing enrichments ## [16:42:02] done ## [16:42:02] Working on collection: C5_CC ## [16:42:02] Row-ranking collection ## [16:42:03] Computing enrichments ## [16:42:04] done
The following code retrieves information about the top 10 pathways ranked by CondSEA in GO-MF.
library(GSEABase) setids <- sapply(loadCollection(rpBig, "C5_MF"), setIdentifier) MFresults <- getResults(csea, "C5_MF") w <- match(rownames(MFresults)[1:10], setids) top10 <- loadCollection(rpBig, "C5_MF")[w] sapply(top10, setName) ## [1] "GO_TRANSCRIPTION_FACTOR_ACTIVITY_PROTEIN_BINDING" ## [2] "GO_TRANSCRIPTION_COACTIVATOR_ACTIVITY" ## [3] "GO_PHOSPHATIDYLCHOLINE_1_ACYLHYDROLASE_ACTIVITY" ## [4] "GO_RETINOIC_ACID_RECEPTOR_BINDING" ## [5] "GO_PRE_MRNA_BINDING" ## [6] "GO_N_ACETYLTRANSFERASE_ACTIVITY" ## [7] "GO_CYTOSKELETAL_PROTEIN_BINDING" ## [8] "GO_PEPTIDE_N_ACETYLTRANSFERASE_ACTIVITY" ## [9] "GO_ACETYLTRANSFERASE_ACTIVITY" ## [10] "GO_HYDROGEN_EXPORTING_ATPASE_ACTIVITY"
Note that 2 main effects of HDAC inhibitors have been correctly identified: regulation of transcription, and alteration of the acetylation/deacetylation homeostasis. The full analysis can be exported to the Excel format with:
exportSEA(rpBig, csea)
A Gene2drug analysis can be performed starting by a gene of interest, for example the TFEB gene. Pathways including the gene are found as follows:
pws <- gene2pathways(rpBig, "TFEB")
The following code runs the PathSEA analysis on the pathways involving TFEB. Also in this case the analysis is performed on Gene Ontology collections. Note that a warning is thrown as the GO-CC category has no annotation for TFEB (this is ok).
psea <- PathSEA(rpBig, pws, collections=c("C5_BP", "C5_MF", "C5_CC")) ## Warning: [17:17:13] There is at least one selected collections for ## which no pathway has been provided ## [17:17:13] Removing pathways not in specified collections ## [17:17:13] Working on collection: C5_BP ## [17:17:13] Common pathway sets removed from bgset ## [17:17:15] Column-ranking collection ## [17:17:22] Computing enrichments ## [17:17:29] done ## [17:17:29] Working on collection: C5_MF ## [17:17:29] Common pathway sets removed from bgset ## [17:17:29] Column-ranking collection ## [17:17:30] Computing enrichments ## [17:17:32] done
Thus the top 10 drugs causing (or mimicking) TFEB upregulation are:
getResults(psea, "C5_BP")[1:10,] ## ES PV ## loperamide 0.7324720 1.504075e-11 ## proadifen 0.7278256 2.079448e-11 ## hydroquinine 0.7220082 3.110434e-11 ## bepridil 0.6904276 2.616027e-10 ## clomipramine 0.6891810 2.839879e-10 ## alexidine 0.6741085 7.574142e-10 ## digitoxigenin 0.6737685 7.741670e-10 ## lanatoside_c 0.6651556 1.342553e-09 ## helveticoside 0.6642112 1.425479e-09 ## ouabain 0.6631157 1.527949e-09
Note that the top drug, loperamide, has been demonstrated to induce TFEB translocation at very low concentrations [3]. As before, the full analysis can be exported to the Excel format with:
exportSEA(rpBig, psea)
gep2pep
users don't need to understand how to interact with a
gep2pep
repository, however it can be useful in some
cases. gep2pep
repositories are in the repo
format (see the repo
package), so they can be accessed as any other repo
repository. However item tags should not be changed, as they are used
by gep2pep
to identify data types. Each gep2pep
repository always
contains a special project item including repository and session
information, which can be shown as follows:
rp$info("gep2pep repository")
Poject name and description can be provided when creating the
repository (createRepository
), or edited with rp$set("gep2pep
repository", newname="my project name", description="my project
description")
. The repository will also contain an item for each
pathway collection, and possibly an item for each corresponding PEP
collection, as in this example:
rp
In order to look at the space that the repository is using, the following command can be used:
rp$info()
put
, set
and other repo
commands can be used to alter repository
contents directly, however this could leave the repository in an
inconsistent state. The following code checks a repository for
possible problems:
checkRepository(rp)
The last check (summary of commond conditions) ensures that the same conditions have been computed for all the pathway collections, which is however not mandatory.
Additional methodological help can be found at:
unlink(repoRoot, TRUE)
[1] Subramanian A. et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS 102, 15545-15550 (2005).
[2] Napolitano F. et al, Drug-set enrichment analysis: a novel tool to investigate drug mode of action. Bioinformatics 32, 235-241 (2016).
[3] Napolitano F. et al, gene2drug: a Computational Tool for Pathway-based Rational Drug Repositioning, bioRxiv (2017) 192005; doi: https://doi.org/10.1101/192005
[4] Lamb, J. et al. The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease. Science 313, 1929-1935 (2006).
gep2pep
version r packageVersion("gep2pep")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.