knitr::opts_chunk$set( collapse = TRUE, warning = FALSE, message = FALSE, comment = "#>" )
The molecular signatures database (MSigDB) is one of the largest collections of molecular signatures or gene expression signatures. A variety of gene expression signatures are hosted on this database including experimentally derived signatures and signatures representing pathways and ontologies from other curated databases. This rich collection of gene expression signatures (>25,000) can facilitate a wide variety of signature-based analyses, the most popular being gene set enrichment analyses. These signatures can be used to perform enrichment analysis in a DE experiment using tools such as GSEA, fry (from limma) and camera (from limma). Alternatively, they can be used to perform single-sample gene-set analysis of individual transcriptomic profiles using approaches such as singscore, ssGSEA and GSVA.
This package provides the gene sets in the MSigDB in the form of GeneSet
objects. This data structure is specifically designed to store information about gene sets, including their member genes and metadata. Other packages, such as msigdbr
and EGSEAdata
provide these gene sets too, however, they do so by storing them as lists or tibbles. These structures are not specific to gene sets therefore do not allow storage of important metadata associated with each gene set, for example, their short and long descriptions. Additionally, the lack of structure allows creation of invalid gene sets. Accessory functions implemented in the GSEABase
package provide a neat interface to interact with GeneSet
objects.
This package can be installed using the code below:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("msigdb")
This ExperimentHub package processes the latest version of the MSigDB database into R objects that can be queried using the GSEABase R/Bioconductor package. The entire database is stored in a GeneSetCollection
object which in turn stores each signature as a GeneSet
object. All empty gene expression signatures (i.e. no genes formed the signature) have been dropped. Data in this package can be downloaded using the ExperimentHub
interface as shown below.
To download the data, we first need to get a list of the data available in the msigdb
package and determine the unique identifiers for each data. The query()
function assists in getting this list.
library(msigdb) library(ExperimentHub) library(GSEABase)
eh = ExperimentHub() query(eh , 'msigdb')
Data can then be downloaded using the unique identifier.
eh[['EH5421']]
Data can also be downloaded using the custom accessor `msigdb::getMsigdb()`:
#use the custom accessor to select a specific version of MSigDB msigdb.hs = getMsigdb(org = 'hs', id = 'SYM', version = '7.4') msigdb.hs
KEGG gene sets cannot be integrated within this ExperimentHub package due to licensing limitations. However, users can download, process and integrate the data directly from the MSigDB when needed. This can be done using the code that follows.
msigdb.hs = appendKEGG(msigdb.hs) msigdb.hs
A GeneSetCollection object is effectively a list therefore all list processing functions such as length
and lapply
can be used to process its constituents
length(msigdb.hs)
Each signature is stored in a GeneSet
object and can be processed using functions in the GSEABase
R/Bioconductor package.
gs = msigdb.hs[[1000]] gs #get genes in the signature geneIds(gs) #get collection type collectionType(gs) #get MSigDB category bcCategory(collectionType(gs)) #get MSigDB subcategory bcSubCategory(collectionType(gs)) #get description description(gs) #get details details(gs)
We can also summarise some of these values across the entire database. Description of these codes can be found at the MSigDB website (https://www.gsea-msigdb.org/gsea/msigdb).
#calculate the number of signatures in each category table(sapply(lapply(msigdb.hs, collectionType), bcCategory)) #calculate the number of signatures in each subcategory table(sapply(lapply(msigdb.hs, collectionType), bcSubCategory)) #plot the distribution of sizes hist(sapply(lapply(msigdb.hs, geneIds), length), main = 'MSigDB signature size distribution', xlab = 'Signature size')
Most gene set analysis is performed within specific collections rather than across the entire database. This package comes with functions to subset specific collections. The list of all collections and sub-collections present within a GeneSetCollection object can be listed using the functions below:
listCollections(msigdb.hs) listSubCollections(msigdb.hs)
Specific collections can be retrieved using the code below:
#retrieeve the hallmarks gene sets subsetCollection(msigdb.hs, 'h') #retrieve the biological processes category of gene ontology subsetCollection(msigdb.hs, 'c5', 'GO:BP')
Any gene-set collection can be easily transformed for usage with limma::fry
by first transforming it into a list of gene IDs and following that with a transformation to indices as shown below.
library(limma) #create expression data allg = unique(unlist(geneIds(msigdb.hs))) emat = matrix(0, nrow = length(allg), ncol = 6) rownames(emat) = allg colnames(emat) = paste0('sample', 1:6) head(emat)
#retrieve collections hallmarks = subsetCollection(msigdb.hs, 'h') msigdb_ids = geneIds(hallmarks) #convert gene sets into a list of gene indices fry_indices = ids2indices(msigdb_ids, rownames(emat)) fry_indices[1:2]
The mouse MSigDB has been created in collaboration with Gordon K. Smyth and Alex Garnham from WEHI. The code they use to generate the mouse MSigDB has been used in this package. Detailed description of the steps conducted to convert human gene expression signatures to mouse can be found at http://bioinf.wehi.edu.au/MSigDB. Mouse homologs for human genes were obtained using the HCOP database (as of 18/03/2021).
All the above functions apply to the mouse MSigDB and can be used to interact with the collection.
msigdb.mm = getMsigdb(org = 'mm', id = 'SYM', version = '7.4') msigdb.mm
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.