dsea_GSEA | R Documentation |
The Drug Set Enrichment Analysis (DSEA) with GSEA algorithm
(dsea_GSEA
function) performs DSEA
with the GSEA algorithm from Subramanian et al. (2005). In case of DSEA, drug
identifiers combined with their ranking scores of an upstream GESS method are
used, such as the NCS values from the LINCS method. To use drug instead of
gene labels for GSEA, the former are mapped to functional categories,
including GO or KEGG, based on drug-target interaction annotations provided
by databases such as DrugBank, ChEMBL, CLUE or STITCH.
The DSEA with Hypergeometric Test (dsea_hyperG
) performs DSEA
based on the hypergeometric distribution. In case of DSEA, the identifiers of
the top ranking drugs from a GESS result table are used. To use drug
instead of gene labels for this test, the former are mapped to functional
categories, including GO, KEGG or Mode of Action (MOA) categories, based on
drug-target interaction annotations provided by databases such as DrugBank,
ChEMBL, CLUE or STITCH. Currently, the MOA annotation used by this function
are from the CLUE website (https://clue.io).
Compared to the related Target Set Enrichment Analysis (TSEA), the DSEA approach has the advantage that the drugs in the query test sets are usually unique allowing to use them without major modifications to the underlying statistical method(s).
The Target Set Enrichment Analysis (TSEA) with hypergeometric test
(tsea_dup_hyperG
function) performs TSEA based on a modified
hypergeometric test that supports test sets with duplications. This is
achieved by maintaining the frequency information of
duplicated items in form of weighting values.
The TSEA with mGSEA algorithm (tsea_mGSEA
function) performs a
Modified Gene Set Enrichment Analysis (mGSEA) that supports test sets
(e.g. genes or protein IDs) with duplications. The duplication support is
achieved by a weighting method for duplicated items, where the weighting is
proportional to the frequency of the items in the test set.
The TSEA with meanAbs (tsea_mabs
) method is a simple but effective
functional enrichment statistic (Fang et al., 2012). As required for TSEA,
it supports query label sets (here for target proteins/genes) with
duplications by transforming them to score ranked label lists and then
calculating mean absolute scores of
labels in label set S
.
dsea_GSEA(
drugList,
type = "GO",
ont = "BP",
exponent = 1,
nPerm = 1000,
minGSSize = 10,
maxGSSize = 500,
pvalueCutoff = 0.05,
pAdjustMethod = "BH"
)
dsea_hyperG(
drugs,
type = "GO",
ont = "BP",
pvalueCutoff = 0.05,
pAdjustMethod = "BH",
qvalueCutoff = 0.2,
minGSSize = 10,
maxGSSize = 500
)
tsea_dup_hyperG(
drugs,
universe = "Default",
type = "GO",
ont = "MF",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
minGSSize = 5,
maxGSSize = 500,
dt_anno = "all",
readable = FALSE
)
tsea_mGSEA(
drugs,
type = "GO",
ont = "MF",
nPerm = 1000,
exponent = 1,
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
minGSSize = 5,
maxGSSize = 500,
verbose = FALSE,
dt_anno = "all",
readable = FALSE
)
tsea_mabs(
drugs,
type = "GO",
ont = "MF",
nPerm = 1000,
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
minGSSize = 5,
maxGSSize = 500,
dt_anno = "all",
readable = FALSE
)
drugList |
named numeric vector, where the names represent drug labels and the numeric component scores. This can be all drugs of a GESS result that are ranked by GESS scores, such as NCS scores from the LINCS method. Note, drugs with scores of zero are ignored by this method. |
type |
one of 'GO', 'KEGG' or 'Reactome' if TSEA methods. |
ont |
character(1). If type is 'GO', assign |
exponent |
integer value used as exponent in GSEA algorithm. It defines
the weight of the items in the item set |
nPerm |
integer defining the number of permutation iterations for calculating p-values |
minGSSize |
integer, minimum size of each gene set in annotation system.
Annotation categories with less than |
maxGSSize |
integer, maximum size of each gene set in annotation system.
Annotation categories with more genes/drugs annotated than |
pvalueCutoff |
double, p-value cutoff to return only enrichment results for functional categories meeting a user definable confidence threshold |
pAdjustMethod |
p-value adjustment method, one of 'holm', 'hochberg', 'hommel', 'bonferroni', 'BH', 'BY', 'fdr' |
drugs |
character vector containing drug identifiers used for functional
enrichment testing. This can be the top ranking drugs from a GESS result.
Internally, drug test sets are translated to the corresponding target protein
test sets based on the drug-target annotations provided under the
|
qvalueCutoff |
double, qvalue cutoff, similar to |
universe |
character vector defining the universe of genes/proteins. If set as 'Default', it uses all genes/proteins present in the corresponding annotation system (e.g. GO, KEGG or Reactome). If 'type' is 'GO', it can be assigned a custom vector of gene SYMBOL IDs. If 'type' is 'KEGG' or 'Reactome', the vector needs to contain Entrez gene IDs. |
dt_anno |
drug-target annotation source. It is the same argument as the
|
readable |
TRUE or FALSE, it applies when type is 'KEGG' or 'Reactome' indicating whether to convert gene Entrez ids to gene Symbols in the 'itemID' column in the result table. |
verbose |
TRUE or FALSE, print message or not |
The classical hypergeometric test assumes uniqueness in its test sets. To maintain the duplication information in the test sets used for TSEA, the values of the total number of genes/proteins in the test set and the number of genes/proteins in the test set annotated at a functional category are adjusted by maintaining their frequency information in the test set rather than counting each entry only once. Removing duplications in TSEA would be inappropriate since it would erase one of the most important pieces of information of this approach.
The original GSEA method proposed by Subramanian et at., 2005 uses
predefined gene sets S
defined by functional annotation systems
such as GO and KEGG. The goal is to determine whether the genes in S
are randomly distributed throughout a ranked test gene list L
(e.g. all genes ranked by log2 fold changes) or enriched at the top or
bottom of the test list. This is expressed by an
Enrichment Score (ES
) reflecting the degree to which a set S
is overrepresented at the extremes of L
.
For TSEA, the query is a target protein set where duplicated entries need to
be maintained. To perform GSEA with duplication support, here referred to as
mGSEA, the target set is transformed to a score ranked target list
L_tar
of all targets provided by the
corresponding annotation system. For each target in the query target set,
its frequency is divided by the number of targets in the target set,
which is the weight of that target.
For targets present in the annotation system but absent in the
target set, their scores are set to 0. Thus, every target in the annotation
system will be assigned a score and then sorted decreasingly to obtain
L_tar
.
In case of TSEA, the original GSEA method cannot be used directly since a
large portion of zeros exists in L_tar
. If the scores of the genes in
set S
are all zeros, N_R
(sum of scores of genes in set
S
) will be zero, which cannot be used as the denominator.
In this case, ES
is set to -1. If only some genes in set S
have scores of zeros then N_R
is set to a larger number to decrease
the weight of the genes in S
that have non-zero scores.
The reason for this modification is that if only one gene in gene set
S
has a non-zero score and this gene ranks high in L_tar
,
the weight of this gene will be 1 resulting in an ES(S)
close to 1.
Thus, the original GSEA method will score the gene set S
as
significantly enriched. However, this is undesirable because in this
example only one gene is shared among the target set and the gene set
S
. Therefore, giving small weights (lowest non-zero score in L_tar
)
to genes in S
that have zero scores could decrease the weight of the
genes in S
that have non-zero scores, thereby decreasing the false
positive rate. To favor truly enriched functional categories (gene set S
)
at the top of L_tar
, only gene sets with positive ES
are selected.
The input for the mabs method is L_tar
, the same as for mGSEA. In this
enrichment statistic, mabs(S)
, of a label (e.g. gene/protein) set
S
is calculated as mean absolute scores of the labels in S
. In
order to adjust for size variations in label set S
, 1000 random
permutations of L_tar
are performed to determine mabs(S,pi)
.
Subsequently, mabs(S)
is normalized by subtracting the median of the
mabs(S,pi)
and then dividing by the standard deviation of
mabs(S,pi)
yielding the normalized scores Nmabs(S)
. Finally, the
portion of mabs(S,pi)
that is greater than mabs(S)
is used as
nominal p-value (Fang et al., 2012). The resulting nominal p-values are
adjusted for multiple hypothesis testing using the Benjamini-Hochberg method.
feaResult
object, the result table contains the
enriched functional categories (e.g. GO terms or KEGG pathways) ranked by
the corresponding enrichment statistic.
Descriptions of the columns in FEA result tables stored in the
feaResult
object that can be accessed with the result
method in
tabular format, here tibble
.
ont: in case of GO, one of BP, MF, CC, or ALL
ID: GO or KEGG IDs
Description: description of functional category
GeneRatio: ratio of genes in the test set that are annotated at a specific GO node or KEGG pathway
BgRatio: ratio of background genes that are annotated at a specific GO node or KEGG pathway
itemID: IDs of items (genes for TSEA, drugs for DSEA) overlapping among test and annotation sets.
setSize: size of the functional category
pvalue from tsea_dup_hyperG
: raw p-value of enrichment test
p.adjust: p-value adjusted for multiple hypothesis testing based on method specified under pAdjustMethod argument
qvalue: q value calculated with R's qvalue function to control FDR
enrichmentScore: ES from the GSEA algorithm (Subramanian et al., 2005). The score is calculated by walking down the gene list L, increasing a running-sum statistic when we encounter a gene in S and decreasing when it is not. The magnitude of the increment depends on the gene scores. The ES is the maximum deviation from zero encountered in the random walk. It corresponds to a weighted Kolmogorov-Smirnov-like statistic.
NES: Normalized enrichment score. The positive and negative
enrichment scores are normalized separately by permutating the
composition of the gene list L nPerm
times, and dividing the
enrichment score by the mean of the permutation ES with the same sign.
pvalue from tsea_mGSEA
: The nominal p-value of the ES is
calculated using a permutation test. Specifically, the composition of
the gene list L is permuted and the ES of the gene set is recomputed for
the permutated data generating a null distribution for the ES.
The p-value of the observed ES is then calculated relative to this
null distribution.
leadingEdge: Genes in the gene set S (functional category) that appear in the ranked list L at, or before, the point where the running sum reaches its maximum deviation from zero. It can be interpreted as the core of a gene set that accounts for the enrichment signal.
ledge_rank: Ranks of genes in 'leadingEdge' in gene list L.
mabs: given a scored ranked gene list L
, mabs(S)
represents the mean absolute scores of the genes in set S
.
Nmabs: normalized mabs(S)
GSEA algorithm: Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Mesirov, J. P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43), 15545-15550. URL: https://doi.org/10.1073/pnas.0506580102
MeanAbs algorithm: Fang, Z., Tian, W., & Ji, H. (2012). A network-based gene-weighting approach for pathway analysis. Cell Research, 22(3), 565-580. URL: https://doi.org/10.1038/cr.2011.149
feaResult
,
GO_DATA_drug
data(drugs10)
############ DSEA GSEA method ############
dl <- c(rev(seq(0.1, 0.5, by=0.05)), 0)
names(dl)=drugs10
## KEGG annotation system
# gsea_k_res <- dsea_GSEA(drugList=dl, type="KEGG", exponent=1, nPerm=100,
# pvalueCutoff=0.5, minGSSize=2)
# result(gsea_k_res)
############### DSEA Hypergeometric Test ###########
## GO annotation system
# hyperG_res <- dsea_hyperG(drugs=drugs10, type="GO", ont="MF")
# result(hyperG_res)
## KEGG annotation system
# hyperG_k_res <- dsea_hyperG(drugs=drugs10, type="KEGG",
# pvalueCutoff=1, qvalueCutoff=1,
# minGSSize=10, maxGSSize=500)
# result(hyperG_k_res)
############### TSEA dup_hyperG method ########
## GO annotation system
# res1 <- tsea_dup_hyperG(drugs=drugs10, universe="Default",
# type="GO", ont="MF", pvalueCutoff=0.05,
# pAdjustMethod="BH", qvalueCutoff=0.1,
# minGSSize=5, maxGSSize=500)
# result(res1)
#
## KEGG annotation system
# res2 <- tsea_dup_hyperG(drugs=drugs10, type="KEGG",
# pvalueCutoff=0.1, qvalueCutoff=0.2,
# minGSSize=10, maxGSSize=500)
#
## Reactome annotation system
# res3 <- tsea_dup_hyperG(drugs=drugs10, type="Reactome",
# pvalueCutoff=1, qvalueCutoff=1)
############# TSEA mGSEA method ############
## GO annotation system
# res1 <- tsea_mGSEA(drugs=drugs10, type="GO", ont="MF", exponent=1,
# nPerm=1000, pvalueCutoff=1, minGSSize=5)
# result(res1)
# res2 <- tsea_mGSEA(drugs=drugs10, type="KEGG", exponent=1,
# nPerm=100, pvalueCutoff=1, minGSSize=5)
# result(res2)
## Reactome annotation system
# res3 <- tsea_mGSEA(drugs=drugs10, type="Reactome", pvalueCutoff=1)
# result(res3)
############# MeanAbs method ##############
## GO annotation system
# res1 <- tsea_mabs(drugs=drugs10, type="GO", ont="MF", nPerm=1000,
# pvalueCutoff=0.05, minGSSize=5)
# result(res1)
## KEGG annotation system
# res2 <- tsea_mabs(drugs=drugs10, type="KEGG", nPerm=1000,
# pvalueCutoff=0.05, minGSSize=5)
# result(res2)
## Reactome annotation system
# res3 <- tsea_mabs(drugs=drugs10, type="Reactome", pvalueCutoff=1)
# result(res3)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.