fea: FEA Methods
In yduan004/signatureSearch: Environment for Gene Expression Searching Combined with Functional Enrichment Analysis

dsea_GSEA

R Documentation

FEA Methods

Description

The Drug Set Enrichment Analysis (DSEA) with GSEA algorithm (dsea_GSEA function) performs DSEA with the GSEA algorithm from Subramanian et al. (2005). In case of DSEA, drug identifiers combined with their ranking scores of an upstream GESS method are used, such as the NCS values from the LINCS method. To use drug instead of gene labels for GSEA, the former are mapped to functional categories, including GO or KEGG, based on drug-target interaction annotations provided by databases such as DrugBank, ChEMBL, CLUE or STITCH.

The DSEA with Hypergeometric Test (dsea_hyperG) performs DSEA based on the hypergeometric distribution. In case of DSEA, the identifiers of the top ranking drugs from a GESS result table are used. To use drug instead of gene labels for this test, the former are mapped to functional categories, including GO, KEGG or Mode of Action (MOA) categories, based on drug-target interaction annotations provided by databases such as DrugBank, ChEMBL, CLUE or STITCH. Currently, the MOA annotation used by this function are from the CLUE website (https://clue.io).

Compared to the related Target Set Enrichment Analysis (TSEA), the DSEA approach has the advantage that the drugs in the query test sets are usually unique allowing to use them without major modifications to the underlying statistical method(s).

The Target Set Enrichment Analysis (TSEA) with hypergeometric test (tsea_dup_hyperG function) performs TSEA based on a modified hypergeometric test that supports test sets with duplications. This is achieved by maintaining the frequency information of duplicated items in form of weighting values.

The TSEA with mGSEA algorithm (tsea_mGSEA function) performs a Modified Gene Set Enrichment Analysis (mGSEA) that supports test sets (e.g. genes or protein IDs) with duplications. The duplication support is achieved by a weighting method for duplicated items, where the weighting is proportional to the frequency of the items in the test set.

The TSEA with meanAbs (tsea_mabs) method is a simple but effective functional enrichment statistic (Fang et al., 2012). As required for TSEA, it supports query label sets (here for target proteins/genes) with duplications by transforming them to score ranked label lists and then calculating mean absolute scores of labels in label set S.

Usage

dsea_GSEA(
  drugList,
  type = "GO",
  ont = "BP",
  exponent = 1,
  nPerm = 1000,
  minGSSize = 10,
  maxGSSize = 500,
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH"
)

dsea_hyperG(
  drugs,
  type = "GO",
  ont = "BP",
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH",
  qvalueCutoff = 0.2,
  minGSSize = 10,
  maxGSSize = 500
)

tsea_dup_hyperG(
  drugs,
  universe = "Default",
  type = "GO",
  ont = "MF",
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05,
  qvalueCutoff = 0.05,
  minGSSize = 5,
  maxGSSize = 500,
  dt_anno = "all",
  readable = FALSE
)

tsea_mGSEA(
  drugs,
  type = "GO",
  ont = "MF",
  nPerm = 1000,
  exponent = 1,
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05,
  minGSSize = 5,
  maxGSSize = 500,
  verbose = FALSE,
  dt_anno = "all",
  readable = FALSE
)

tsea_mabs(
  drugs,
  type = "GO",
  ont = "MF",
  nPerm = 1000,
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05,
  minGSSize = 5,
  maxGSSize = 500,
  dt_anno = "all",
  readable = FALSE
)

Arguments

`drugList`	named numeric vector, where the names represent drug labels and the numeric component scores. This can be all drugs of a GESS result that are ranked by GESS scores, such as NCS scores from the LINCS method. Note, drugs with scores of zero are ignored by this method.
`type`	one of 'GO', 'KEGG' or 'Reactome' if TSEA methods. `type` can also be set as 'MOA' is DSEA methods are used.
`ont`	character(1). If type is 'GO', assign `ont` (ontology) one of 'BP','MF', 'CC' or 'ALL'. If type is 'KEGG' or 'Reactome', `ont` is ignored.
`exponent`	integer value used as exponent in GSEA algorithm. It defines the weight of the items in the item set `S`.
`nPerm`	integer defining the number of permutation iterations for calculating p-values
`minGSSize`	integer, minimum size of each gene set in annotation system. Annotation categories with less than `minGSSize` genes/drugs will be ignored by enrichment test. If `type` is 'MOA', it may be beneficial to set `minGSSize` to lower values (e.g. 2) than for other functional annotation systems. This is because certain MOA categories contain only 2 drugs.
`maxGSSize`	integer, maximum size of each gene set in annotation system. Annotation categories with more genes/drugs annotated than `maxGSSize` will be ignored by enrichment test.
`pvalueCutoff`	double, p-value cutoff to return only enrichment results for functional categories meeting a user definable confidence threshold
`pAdjustMethod`	p-value adjustment method, one of 'holm', 'hochberg', 'hommel', 'bonferroni', 'BH', 'BY', 'fdr'
`drugs`	character vector containing drug identifiers used for functional enrichment testing. This can be the top ranking drugs from a GESS result. Internally, drug test sets are translated to the corresponding target protein test sets based on the drug-target annotations provided under the `dt_anno` argument.
`qvalueCutoff`	double, qvalue cutoff, similar to `pvalueCutoff`
`universe`	character vector defining the universe of genes/proteins. If set as 'Default', it uses all genes/proteins present in the corresponding annotation system (e.g. GO, KEGG or Reactome). If 'type' is 'GO', it can be assigned a custom vector of gene SYMBOL IDs. If 'type' is 'KEGG' or 'Reactome', the vector needs to contain Entrez gene IDs.
`dt_anno`	drug-target annotation source. It is the same argument as the `database` argument of the `get_targets` function. Usually, it is recommended to set the 'dt_anno' to 'all' since it provides the most complete drug-target annotations. Choosing a single annotation source results in sparser drug-target annotations (particularly CLUE), and thus less complete enrichment results.
`readable`	TRUE or FALSE, it applies when type is 'KEGG' or 'Reactome' indicating whether to convert gene Entrez ids to gene Symbols in the 'itemID' column in the result table.
`verbose`	TRUE or FALSE, print message or not

Details

The classical hypergeometric test assumes uniqueness in its test sets. To maintain the duplication information in the test sets used for TSEA, the values of the total number of genes/proteins in the test set and the number of genes/proteins in the test set annotated at a functional category are adjusted by maintaining their frequency information in the test set rather than counting each entry only once. Removing duplications in TSEA would be inappropriate since it would erase one of the most important pieces of information of this approach.

The original GSEA method proposed by Subramanian et at., 2005 uses predefined gene sets S defined by functional annotation systems such as GO and KEGG. The goal is to determine whether the genes in S are randomly distributed throughout a ranked test gene list L (e.g. all genes ranked by log2 fold changes) or enriched at the top or bottom of the test list. This is expressed by an Enrichment Score (ES) reflecting the degree to which a set S is overrepresented at the extremes of L.

For TSEA, the query is a target protein set where duplicated entries need to be maintained. To perform GSEA with duplication support, here referred to as mGSEA, the target set is transformed to a score ranked target list L_tar of all targets provided by the corresponding annotation system. For each target in the query target set, its frequency is divided by the number of targets in the target set, which is the weight of that target. For targets present in the annotation system but absent in the target set, their scores are set to 0. Thus, every target in the annotation system will be assigned a score and then sorted decreasingly to obtain L_tar.

In case of TSEA, the original GSEA method cannot be used directly since a large portion of zeros exists in L_tar. If the scores of the genes in set S are all zeros, N_R (sum of scores of genes in set S) will be zero, which cannot be used as the denominator. In this case, ES is set to -1. If only some genes in set S have scores of zeros then N_R is set to a larger number to decrease the weight of the genes in S that have non-zero scores.

The reason for this modification is that if only one gene in gene set S has a non-zero score and this gene ranks high in L_tar, the weight of this gene will be 1 resulting in an ES(S) close to 1. Thus, the original GSEA method will score the gene set S as significantly enriched. However, this is undesirable because in this example only one gene is shared among the target set and the gene set S. Therefore, giving small weights (lowest non-zero score in L_tar) to genes in S that have zero scores could decrease the weight of the genes in S that have non-zero scores, thereby decreasing the false positive rate. To favor truly enriched functional categories (gene set S) at the top of L_tar, only gene sets with positive ES are selected.

The input for the mabs method is L_tar, the same as for mGSEA. In this enrichment statistic, mabs(S), of a label (e.g. gene/protein) set S is calculated as mean absolute scores of the labels in S. In order to adjust for size variations in label set S, 1000 random permutations of L_tar are performed to determine mabs(S,pi). Subsequently, mabs(S) is normalized by subtracting the median of the mabs(S,pi) and then dividing by the standard deviation of mabs(S,pi) yielding the normalized scores Nmabs(S). Finally, the portion of mabs(S,pi) that is greater than mabs(S) is used as nominal p-value (Fang et al., 2012). The resulting nominal p-values are adjusted for multiple hypothesis testing using the Benjamini-Hochberg method.

Value

feaResult object, the result table contains the enriched functional categories (e.g. GO terms or KEGG pathways) ranked by the corresponding enrichment statistic.

Column description

Descriptions of the columns in FEA result tables stored in the feaResult object that can be accessed with the result method in tabular format, here tibble.

ont: in case of GO, one of BP, MF, CC, or ALL
ID: GO or KEGG IDs
Description: description of functional category
GeneRatio: ratio of genes in the test set that are annotated at a specific GO node or KEGG pathway
BgRatio: ratio of background genes that are annotated at a specific GO node or KEGG pathway
itemID: IDs of items (genes for TSEA, drugs for DSEA) overlapping among test and annotation sets.
setSize: size of the functional category
pvalue from tsea_dup_hyperG: raw p-value of enrichment test
p.adjust: p-value adjusted for multiple hypothesis testing based on method specified under pAdjustMethod argument
qvalue: q value calculated with R's qvalue function to control FDR
enrichmentScore: ES from the GSEA algorithm (Subramanian et al., 2005). The score is calculated by walking down the gene list L, increasing a running-sum statistic when we encounter a gene in S and decreasing when it is not. The magnitude of the increment depends on the gene scores. The ES is the maximum deviation from zero encountered in the random walk. It corresponds to a weighted Kolmogorov-Smirnov-like statistic.
NES: Normalized enrichment score. The positive and negative enrichment scores are normalized separately by permutating the composition of the gene list L nPerm times, and dividing the enrichment score by the mean of the permutation ES with the same sign.
pvalue from tsea_mGSEA: The nominal p-value of the ES is calculated using a permutation test. Specifically, the composition of the gene list L is permuted and the ES of the gene set is recomputed for the permutated data generating a null distribution for the ES. The p-value of the observed ES is then calculated relative to this null distribution.
leadingEdge: Genes in the gene set S (functional category) that appear in the ranked list L at, or before, the point where the running sum reaches its maximum deviation from zero. It can be interpreted as the core of a gene set that accounts for the enrichment signal.
ledge_rank: Ranks of genes in 'leadingEdge' in gene list L.
mabs: given a scored ranked gene list L, mabs(S) represents the mean absolute scores of the genes in set S.
Nmabs: normalized mabs(S)

References

GSEA algorithm: Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Mesirov, J. P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43), 15545-15550. URL: https://doi.org/10.1073/pnas.0506580102

MeanAbs algorithm: Fang, Z., Tian, W., & Ji, H. (2012). A network-based gene-weighting approach for pathway analysis. Cell Research, 22(3), 565-580. URL: https://doi.org/10.1038/cr.2011.149

Examples

data(drugs10)

############ DSEA GSEA method ############
dl <- c(rev(seq(0.1, 0.5, by=0.05)), 0)
names(dl)=drugs10
## KEGG annotation system
# gsea_k_res <- dsea_GSEA(drugList=dl, type="KEGG", exponent=1, nPerm=100, 
#                         pvalueCutoff=0.5, minGSSize=2)
# result(gsea_k_res)

############### DSEA Hypergeometric Test ###########
## GO annotation system
# hyperG_res <- dsea_hyperG(drugs=drugs10, type="GO", ont="MF")
# result(hyperG_res)
## KEGG annotation system
# hyperG_k_res <- dsea_hyperG(drugs=drugs10, type="KEGG", 
#                             pvalueCutoff=1, qvalueCutoff=1, 
#                             minGSSize=10, maxGSSize=500)
# result(hyperG_k_res) 

############### TSEA dup_hyperG method ########
## GO annotation system
# res1 <- tsea_dup_hyperG(drugs=drugs10, universe="Default", 
#                         type="GO", ont="MF", pvalueCutoff=0.05,
#                         pAdjustMethod="BH", qvalueCutoff=0.1, 
#                         minGSSize=5, maxGSSize=500)
# result(res1)
#
## KEGG annotation system
# res2 <- tsea_dup_hyperG(drugs=drugs10, type="KEGG", 
#                         pvalueCutoff=0.1, qvalueCutoff=0.2, 
#                         minGSSize=10, maxGSSize=500)
#
## Reactome annotation system
# res3 <- tsea_dup_hyperG(drugs=drugs10, type="Reactome", 
#                         pvalueCutoff=1, qvalueCutoff=1)

############# TSEA mGSEA method ############
## GO annotation system
# res1 <- tsea_mGSEA(drugs=drugs10, type="GO", ont="MF", exponent=1, 
#                    nPerm=1000, pvalueCutoff=1, minGSSize=5)
# result(res1)
# res2 <- tsea_mGSEA(drugs=drugs10, type="KEGG", exponent=1, 
#                    nPerm=100, pvalueCutoff=1, minGSSize=5)
# result(res2)
## Reactome annotation system
# res3 <- tsea_mGSEA(drugs=drugs10, type="Reactome", pvalueCutoff=1)
# result(res3)

############# MeanAbs method ##############
## GO annotation system
# res1 <- tsea_mabs(drugs=drugs10, type="GO", ont="MF", nPerm=1000, 
#                   pvalueCutoff=0.05, minGSSize=5)
# result(res1)
## KEGG annotation system
# res2 <- tsea_mabs(drugs=drugs10, type="KEGG", nPerm=1000, 
#                   pvalueCutoff=0.05, minGSSize=5)
# result(res2)
## Reactome annotation system
# res3 <- tsea_mabs(drugs=drugs10, type="Reactome", pvalueCutoff=1)
# result(res3)

yduan004/signatureSearch documentation built on Feb. 19, 2024, 9:30 a.m.