R/signatureSearch-package.r

#' Environment for Gene Expression Signature Searching Combined with 
#' Functional Enrichment Analysis
#'
#' @name signatureSearch-package
#' @aliases signatureSearch-package signatureSearch 
#' @docType package
#' @useDynLib signatureSearch
#' @import Rcpp
#' @description 
#' Welcome to the signatureSearch package! This package implements algorithms 
#' and data structures for performing gene expression signature (GES) searches, 
#' and subsequently interpreting the results functionally with specialized 
#' enrichment methods. These utilities are useful for studying the effects of 
#' genetic, chemical and environmental perturbations on biological systems. 
#' Specifically, in drug discovery they can be used for identifying novel modes 
#' of action (MOA) of bioactive compounds from reference databases such as 
#' LINCS containing the genome-wide GESs from tens of thousands of drug and 
#' genetic perturbations (Subramanian et al. 2017)
#' 
#' A typical GES search (GESS) workflow can be divided into two major steps.
#' First, GESS methods are used to identify perturbagens such as drugs that 
#' induce GESs similar to a query GES of interest. The queries can be drug-, 
#' disease- or phenotype-related GESs. Since the MOAs of most drugs in the 
#' corresponding reference databases are known, the resulting associations are 
#' useful to gain insights into pharmacological and/or disease mechanisms, and 
#' to develop novel drug repurposing approaches.
#' 
#' Second, specialized functional enrichment analysis (FEA) methods using 
#' annotations systems, such as Gene Ontologies (GO), KEGG and Reactome pathways 
#'  have been developed and implemented in this package to 
#' efficiently interpret GESS results. The latter are usually composed of lists 
#' of perturbagens (e.g. drugs) ranked by the similarity metric of the 
#' corresponding GESS method.
#' 
#' Finally, network reconstruction functionalities are integrated for 
#' visualizing the final results, e.g. in form of drug-target networks.
#' 
#' @section Terminology:
#' The term Gene Expression Signatures (GESs) can refer to at least four 
#' different situations of pre-processed gene expression data: (1) normalized 
#' gene expression intensity values (or counts for RNA-Seq); (2) log2 fold 
#' changes (LFC), z-scores or p-values obtained from analysis routines of 
#' differentially expressed genes (DEGs); (3) rank transformed versions of the 
#' expression values obtained under (1) and (2); and (4) gene identifier sets 
#' extracted from the top and lowest ranks under (3), such as n top up/down 
#' regulated DEGs.
#' 
#' @details 
#' The GESS methods include \code{CMAP}, \code{LINCS}, \code{gCMAP}, 
#' \code{Fisher} and \code{Cor}. For detailed
#' description, please see help files of each method. Most methods 
#' can be easily paralleled for multiple query signatures.
#' 
#' GESS results are lists of perturbagens (here drugs) ranked by their 
#' signature similarity to a query signature of interest. Interpreting these 
#' search results with respect to the cellular networks and pathways affected 
#' by the top ranking drugs is difficult. To overcome this challenge, the 
#' knowledge of the target proteins of the top ranking drugs can be used to 
#' perform functional enrichment analysis (FEA) based on community annotation 
#' systems, such as Gene Ontologies (GO), pathways (e.g. KEGG, Reactome), drug 
#' MOAs or Pfam domains. For this, the ranked drug sets are converted into 
#' target gene/protein sets to perform Target Set Enrichment Analysis (TSEA) 
#' based on a chosen annotation system. Alternatively, the functional 
#' annotation categories of the targets can be assigned to the drugs directly 
#' to perform Drug Set Enrichment Analysis (DSEA). Although TSEA and DSEA are 
#' related, their enrichment results can be distinct. This is mainly due to 
#' duplicated targets present in the test sets of the TSEA methods, whereas 
#' the drugs in the test sets of DSEA are usually unique. Additional reasons 
#' include differences in the universe sizes used for TSEA and DSEA.
#' 
#' Importantly, the duplications in the test sets of the TSEA are due to the 
#' fact that many drugs share the same target proteins. Standard enrichment 
#' methods would eliminate these duplications since they assume uniqueness 
#' in the test sets. Removing duplications in TSEA would be inappropriate 
#' since it would erase one of the most important pieces of information of 
#' this approach. To solve this problem, we have developed and implemented in 
#' this package weighting methods (\code{dup_hyperG}, \code{mGSEA} and 
#' \code{meanAbs}) for duplicated targets, where the weighting 
#' is proportional to the frequency of the targets in the test set.
#' 
#' Instead of translating ranked lists of drugs into target sets, as for TSEA, 
#' the functional annotation categories of the targets can be assigned to the 
#' drugs directly to perform DSEA instead. Since the drug lists from GESS 
#' results are usually unique, this strategy overcomes the duplication problem 
#' of the TSEA approach. This way classical enrichment methods, such as GSEA or 
#' tests based on the hypergeometric distribution, can be readily applied 
#' without major modifications to the underlying statistical methods. As 
#' explained above, TSEA and DSEA performed with the same enrichment statistics 
#' are not expected to generate identical results. Rather they often complement 
#' each other's strengths and weaknesses.
#' 
#' To perform TSEA and DSEA, drug-target annotations are essential. They can be
#' obtained from several sources, including DrugBank, ChEMBL, STITCH, and the 
#' Touchstone dataset from the LINCS project (https://clue.io/). Most 
#' drug-target annotations provide UniProt identifiers for the target proteins. 
#' They can be mapped, if necessary via their encoding genes, to the chosen 
#' functional annotation categories, such as GO or KEGG. To minimize bias in 
#' TSEA or DSEA, often caused by promiscuous binders, it can be beneficial to 
#' remove drugs or targets that bind to large numbers of distinct proteins or 
#' drugs, respectively.
#' 
#' Note, most FEA tests involving proteins in their test sets are performed on 
#' the gene level in \code{signatureSearch}. This way one can avoid additional 
#' duplications due to many-to-one relationships among proteins and their 
#' encoding gents. For this, the corresponding functions in signatureSearch 
#' will usually translate target protein sets into their encoding gene sets 
#' using identifier mapping resources from R/Bioconductor such as the 
#' \code{org.Hs.eg.db} annotation package. Because of this as well as 
#' simplicity, the text in the vignette and help files of this package will 
#' refer to the targets of drugs almost interchangeably as proteins or genes, 
#' even though the former are the direct targets and the latter only the 
#' indirect targets of drugs.  
#' 
#' @seealso 
#' Methods for GESS:
#'   \itemize{
#'     \item \code{\link{gess_cmap}}, \code{\link{gess_lincs}}, 
#'           \code{\link{gess_gcmap}} \code{\link{gess_fisher}}, 
#'           \code{\link{gess_cor}}
#'   }
#'    
#' Methods for FEA:
#'   \itemize{
#'      \item TSEA methods:    
#'          \code{\link{tsea_dup_hyperG}}, \code{\link{tsea_mGSEA}},
#'          \code{\link{tsea_mabs}}
#'      
#'      \item DSEA methods:
#'      \code{\link{dsea_hyperG}}, \code{\link{dsea_GSEA}}
#'   }
#' @author
#' \itemize{
#'   \item Yuzhu Duan (yduan004@ucr.edu)
#'   \item Brendan Gongol (bgong001@ucr.edu>)
#'   \item Thomas Girke (thomas.girke@ucr.edu)
#' } 
#' 
#' @references 
#' Subramanian, Aravind, Rajiv Narayan, Steven M Corsello, David D Peck, Ted E 
#' Natoli, Xiaodong Lu, Joshua Gould, et al. 2017. A Next Generation 
#' Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell 171 
#' (6): 1437-1452.e17. http://dx.doi.org/10.1016/j.cell.2017.10.049
#' 
#' Lamb, Justin, Emily D Crawford, David Peck, Joshua W Modell, Irene C Blat, 
#' Matthew J Wrobel, Jim Lerner, et al. 2006. The Connectivity Map: Using 
#' Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease. 
#' Science 313 (5795): 1929-35. http://dx.doi.org/10.1126/science.1132939
#' 
#' Sandmann, Thomas, Sarah K Kummerfeld, Robert Gentleman, and Richard Bourgon. 
#' 2014. gCMAP: User-Friendly Connectivity Mapping with R. Bioinformatics 30 
#' (1): 127-28. http://dx.doi.org/10.1093/bioinformatics/btt592
#' 
#' Subramanian, Aravind, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, 
#' Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, et al. 2005. 
#' Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting 
#' Genome-Wide Expression Profiles. Proc. Natl. Acad. Sci. U. S. A. 102 (43): 
#' 15545-50. http://dx.doi.org/10.1073/pnas.0506580102

NULL


#' Drug Names Used in Examples
#' 
#' A character vector containing the names of the top 10 drugs in the GESS 
#' result from the \code{\link{gess_lincs}} method used in the vignette of 
#' signatureSearch.
#'
#' @name drugs10
#' @aliases drugs10
#' @docType data
#' @examples
#' # Load drugs object
#' data(drugs10)
#' drugs10
#' @keywords datasets
"drugs10"

#' Target Sample Data Set
#' 
#' A named numeric vector with Gene Symbols as names. It is the first 1000
#' elements from the 'targets' slot of the 'mgsea_res' result object introduced 
#' in the vignette of this package. The scores represent the weights of the 
#' target genes/proteins in the target set of the selected top 10 drugs.
#'
#' @name targetList
#' @aliases targetList
#' @docType data
#' @examples 
#' # Load object
#' data(targetList)
#' head(targetList)
#' tail(targetList)
#' @keywords datasets
"targetList"

#' LINCS 2017 Cell Type Information
#' 
#' It contains cell type (tumor or normal), primary site and subtype 
#' annotations of cells in LINCS 2017 database. 
#'
#' @name cell_info
#' @aliases cell_info
#' @docType data
#' @format A \code{tibble} object with 30 rows and 4 columns.
#' @examples 
#' # Load object
#' data(cell_info)
#' head(cell_info)
#' @keywords datasets
"cell_info"

#' LINCS 2020 Cell Type Information
#' 
#' It contains cell type (tumor or normal), primary site, subtype etc.
#' annotations of cells in LINCS 2020 database. 
#'
#' @name cell_info2
#' @aliases cell_info2
#' @docType data
#' @format A \code{tibble} object with 240 rows and 21 columns.
#' @examples 
#' # Load object
#' data(cell_info2)
#' head(cell_info2)
#' @keywords datasets
"cell_info2"

#' MOA to Gene Mappings
#' 
#' It is a list containing MOA terms to gene Entrez id mappings from ChEMBL 
#' database 
#'
#' @name chembl_moa_list
#' @aliases chembl_moa_list
#' @docType data
#' @examples 
#' # Load object
#' data(chembl_moa_list)
#' head(chembl_moa_list)
#' @keywords datasets
"chembl_moa_list"

#' MOA to Drug Name Mappings
#' 
#' It is a list containing MOA terms to drug name mappings obtained from 
#' Touchstone database at CLUE website (https://clue.io/) 
#'
#' @name clue_moa_list
#' @aliases clue_moa_list
#' @docType data
#' @examples 
#' # Load object
#' data(clue_moa_list)
#' head(clue_moa_list)
#' @keywords datasets
"clue_moa_list"

#' LINCS Signature Information
#' 
#' It is a tibble of 3 columns containing treatment information of GESs in the
#' LINCS database. The columns contain the perturbation name, cell
#' type and perturbation type (all of them are compound treatment, trt_cp).
#'
#' @name lincs_sig_info
#' @aliases lincs_sig_info
#' @docType data
#' @format A \code{tibble} object with 45,956 rows and 3 columns.
#' @examples 
#' # Load object
#' data(lincs_sig_info)
#' head(lincs_sig_info)
#' @keywords datasets
"lincs_sig_info"

#' LINCS 2017 Perturbation Information
#' 
#' It is a tibble containing annotation information of compounds in LINCS 2017 
#' database including perturbation name, type, whether in touchstone database, 
#' INCHI key, canonical smiles, PubChem CID as well as annotations from ChEMBL
#' database, including ChEMBL ID, DrugBank ID, max FDA phase, therapeutic flag,
#' first approval, indication class, mechanism of action, disease efficacy et al. 
#'
#' @name lincs_pert_info
#' @aliases lincs_pert_info
#' @docType data
#' @format A \code{tibble} object with 8,140 rows and 40 columns.
#' @examples 
#' # Load object
#' data(lincs_pert_info)
#' lincs_pert_info
#' @keywords datasets
"lincs_pert_info"

#' LINCS 2020 Perturbation Information
#' 
#' It is a tibble containing annotation information of compounds in LINCS 2020 
#' beta database including perturbation id, perturbation name, canonical smiles,
#' Inchi key, compound aliases, target and MOA. The PubChem CID and many other
#' annotations from ChEMBL database were obtained from 2017 LINCS pert info by
#' by left joining with pert_iname.
#'
#' @name lincs_pert_info2
#' @aliases lincs_pert_info2
#' @docType data
#' @format A \code{tibble} object with 34419 rows and 48 columns.
#' @examples 
#' # Load object
#' data(lincs_pert_info2)
#' lincs_pert_info2
#' @keywords datasets
"lincs_pert_info2"

#' Instance Information of LINCS Expression Database
#' 
#' It is a tibble of 3 columns containing compound treatment information of 
#' GEP instances in the LINCS expression database. 
#' The columns contain the compound name, cell type and perturbation type 
#' (all of them are compound treatment, trt_cp).
#'
#' @name lincs_expr_inst_info
#' @aliases lincs_expr_inst_info
#' @docType data
#' @format A \code{tibble} object with 38,824 rows and 3 columns.
#' @examples 
#' # Load object
#' data(lincs_expr_inst_info)
#' head(lincs_expr_inst_info)
#' @keywords datasets
"lincs_expr_inst_info"
yduan004/signatureSearch documentation built on Feb. 19, 2024, 9:30 a.m.