gsaseq: Generalised gene set testing for RNA-seq data
In missMethyl: Analysing Illumina HumanMethylation BeadChip Data

Description Usage Arguments Details Value Author(s) See Also Examples

Given a user defined list of gene sets, gsaseq will test whether significantly differentially expressed genes are enriched in these gene sets.

gsaseq(
  sig.de,
  universe,
  collection,
  plot.bias = FALSE,
  gene.length = NULL,
  sort = TRUE
)

`sig.de`	Character vector of significant differentially expressed genes to test for gene set enrichment. Must be Entrez Gene ID format.
`universe`	Character vector of all genes analysed in the experiment. Must be Entrez Gene ID format.
`collection`	A list of user specified gene sets to test. Can also be a single character vector gene set. Gene identifiers must be Entrez Gene IDs.
`plot.bias`	Logical, if true a plot showing gene length bias related to differential expression will be displayed.
`gene.length`	A vector containing the gene lengths for each gene in the same order as `universe`.
`sort`	Logical, if TRUE then the output dataframe is sorted by p-value.

This function is a generalised version of goana and kegga from the limma package in that it can take a user-defined list of differentially expressed genes and perform gene set enrichment analysis, and is not limited to only testing GO and KEGG categories. It is not as flexible as goana and kegga. Please note the vector of differentially expressed genes and list of gene sets must be Entrez Gene IDs.

The gsaseq function will test for enrichment using a hypergeometric test if the gene.length parameter is NULL. If the gene.length parameter is supplied then the p-values are derived from Walllenius' noncentral hypergeometric distribution from the BiasedUrn package. Please note that the gene.length parameter must be in the same order and of the same length as the universe parameter.

A data frame with a row for each gene set and the following columns:

`N`	number of genes in the gene set
`DE`	number of genes that are differentially expressed
`P.DE`	p-value for over-representation of the gene set
`FDR`	False discovery rate, calculated using the method of Benjamini and Hochberg (1995).

Belinda Phipson

goana,kegga,camera,roast

## Not run:  # to avoid timeout on Bioconductor build
library(org.Hs.eg.db)
# Use org.Hs.eg.db to extract GO terms
GOtoID <- suppressMessages(select(org.Hs.eg.db, keys=keys(org.Hs.eg.db), 
                                  columns=c("ENTREZID","GO"), 
                                  keytype="ENTREZID"))
head(GOtoID)

# Define the universe as random sample of 20000 genes in the annotation
universe <- sample(unique(GOtoID$ENTREZID),20000)

# Randomly sample 500 genes as DE
de.genes <- sample(universe, 500)

# Generate random gene lengths for genes in universe
# This is based on the true distribution of log(gene length) of genes in the
# hg19 genome
logGL <- rnorm(length(universe),mean=7.9, sd=1.154)
genelength <- exp(logGL)

# Define a list of gene sets containing two GO terms
setname1 <- GOtoID$GO[1]
setname1
keep.set1 <- GOtoID$GO %in% setname1
set1 <- GOtoID$ENTREZID[keep.set1]
setname2 <- GOtoID$GO[2]
setname2
keep.set2 <- GOtoID$GO %in% setname2
set2 <- GOtoID$ENTREZID[keep.set2]
# Make the gene sets into a list
sets <- list(set1, set2)
names(sets) <- c(setname1,setname2)

# Test for enrichment of gene sets with no gene length bias
# The genes are randomly selected so we don't expect significant results
gsaseq(sig.de = de.genes, universe = universe, collection = sets)

# Test for enrichment of gene sets taking into account gene length bias
# Since the gene lengths are randomly generated this shouldn't make much 
# difference to the results
# Using log(gene length) or gene length doesn't make a difference to the 
# p-values because the probability weighting function is transformation
# invariant
gsaseq(sig.de = de.genes, univers = universe, collection = sets, 
gene.length = genelength)

## End(Not run)