In zhushijia/GIGSEAdata: Gene set collections for the GIGSEA package

Abstract

GIGSEAdata is the gene set collection used for GIGSEA (Genotype Imputed Gene Set Enrichment Analysis), which is a novel SNP enrichment method that uses GWAS-and-eQTL-imputed differential gene expression to interrogate gene set enrichment for the trait-associated SNPs. The gene sets are saved as matrices. Such matrices are largely sparse, so, in order to save space, we used the functions provided by the R package "Matrix" to build the sparse matrices and saved into the GIGSEAdata package.

1. Description of gene sets

GIGSEA is built on the weighted linear regression model, so it permits both discrete-valued and continuous-valued gene sets. In the GIGSEA package, we already included four categories of gene sets: "MSigDB.KEGG.Pathway", "MSigDB.miRNA", "MSigDB.TF", and "TargetScan.miRNA". Here, we added two more categories in the GIGSEAdata package:

1) discrete-valued gene sets: - org.Hs.eg.GO: Gene sets that contain genes annotated by the same Gene Ontology (GO) term. For each GO term, we not only incorporate its own gene sets, but also incorporate the gene sets belonging to its offsprings. See the database "org.Hs.eg.GO.db" and "GO.db" in R.

2) continuous-valued gene sets: - Fantom5.TF: The human transcript promoter locations were obtained from Fantom5. Based on the promoter locations, the tool MotEvo was used to predict the human transcriptional factor (TF) target sites. The dataset contains 500 Positional Weight Matrices (PWM) and 21964 genes. For each PWM, there is a list of associated human TFs, ordered by percent identity of TFs known to bind sites of the PWM. The list of associations was checked manually. The entire set of PWMs and mapping to associated TFs is available from the SwissRegulon website http://www.swissregulon.unibas.ch. - TargetScan.miRNA: Gene sets of predicted human miRNA target sites were downloaded from TargetScan. TargetScan groups miRNAs that have identical subsequences at positions 2 through 8 of the miRNA, i.e. the 2-7 seed region plus the 8th nucleotide, and provides predictions for each such seed motif. TargetScan covers 87 human miRNA seed motifs in total. It provides a score for each seed motif and each RefSeq transcript, called preferential conservation scoring (aggregate Pct), which shows consistently high performance in various benchmark tests. To obtain a site count associated with each gene, we average the TargetScan Pct scores of all RefSeq transcripts associated with each gene. It comprises 87 miRNA seed motifs and 9861 genes. See http://www.targetscan.org.

2 Load data of gene sets:

We first take as an example of the gene set "org.Hs.eg.GO"", where the row represents the gene, and the column represents the GO term. Each entry takes discrete values of 0 or 1, where 1 represents the gene (row) belongs to the GO term (column), and otherwise, not.

library(GIGSEAdata)
data(org.Hs.eg.GO)
class(org.Hs.eg.GO)
names(org.Hs.eg.GO)
dim(org.Hs.eg.GO$net)
head(colnames(org.Hs.eg.GO$net))
head(rownames(org.Hs.eg.GO$net))
head(org.Hs.eg.GO$annot)
head(org.Hs.eg.GO$net[,1:30])