missMethyl: Analysing Illumina HumanMethylation BeadChip Data

Documented in gsameth topGSA

#' Generalised gene set testing for Illumina's methylation array data
#' 
#' Given a user specified list of gene sets to test, \code{gsameth} tests
#' whether significantly differentially methylated CpG sites are enriched in
#' these gene sets.
#' 
#' This function extends \code{gometh}, which only tests GO and KEGG pathways.
#' \code{gsameth} can take a list of user specified gene sets and test whether
#' the significant CpG sites are enriched in these pathways. \code{gsameth}
#' maps the CpG sites to Entrez Gene IDs and tests for pathway enrichment using
#' Wallenius' concentral hypergeometric test, taking into account the number of 
#' CpG sites per gene on the 450K/EPIC arrays. Please note the gene ids for the 
#' collection of gene,sets must be Entrez Gene IDs. If \code{prior.prob} is set 
#' to FALSE, then prior probabilities are not used and it is assumed that each 
#' gene is equally likely to have a significant CpG site associated with it. 
#' 
#' The testing now also takes into account that some CpGs map to multiple genes. 
#' For a small number of gene families, this previously caused their associated 
#' GO categories/gene sets to be erroneously overrepresented and thus highly 
#' significant. If \code{fract.counts=FALSE} then CpGs are allowed to map to 
#' multiple genes (this is NOT recommended). 
#' 
#' A new feature of \code{gometh} and \code{gsameth} is the ability to restrict 
#' the input CpGs by genomic feature with the argument \code{genomic.features}. 
#' The possible options include "ALL", "TSS200", "TSS1500", "Body", "1stExon", 
#' "3'UTR", "5'UTR" and "ExonBnd", and the user may specify any combination. 
#' Please not that "ExonBnd" is not an annotatedfeature on 450K arrays. For 
#' example if you are interested in the promoter region only, you could specify 
#' \code{genomic.features = c("TSS1500","TSS200","1stExon")}. The default 
#' behaviour is to test all input CpGs \code{sig.cpg} even if the user specifies
#'  "ALL" and one or more other features.
#' 
#' Genes associated with each CpG site are obtained from the annotation 
#' package \code{IlluminaHumanMethylation450kanno.ilmn12.hg19} if the array 
#' type is "450K". For the EPIC array, the annotation package
#' \code{IlluminaHumanMethylationEPICanno.ilm10b4.hg19} is used. To use a
#' different annotation package, please supply it using the \code{anno}
#' argument. 
#' 
#' In order to get a list which contains the mapped Entrez gene IDS,
#' please use the \code{getMappedEntrezIDs} function. 
#' 
#' If you are interested in which genes overlap with the genes in the gene set, 
#' setting \code{sig.genes} to TRUE will output an additional column in the 
#' results data frame that contains all the significant differentially 
#' methylated gene symbols, comma separated. The default is FALSE.
#' 
#' @param sig.cpg Character vector of significant CpG sites to test for gene
#' set enrichment.
#' @param all.cpg Character vector of all CpG sites tested. Defaults to all CpG
#' sites on the array.
#' @param collection A list of user specified gene sets to test. Can also be a
#' single character vector gene set. Gene identifiers must be Entrez Gene IDs.
#' @param array.type The Illumina methylation array used. Options are "450K" or
#' "EPIC". Defaults to "450K".
#' @param plot.bias Logical, if true a plot showing the bias due to the
#' differing numbers of probes per gene will be displayed
#' @param prior.prob Logical, if true will take into account the probability of
#' significant differentially methylation due to numbers of probes per gene. If
#' false, a hypergeometric test is performed ignoring any bias in the data.
#' @param anno Optional. A \code{DataFrame} object containing the complete
#' array annotation as generated by the \code{\link{minfi}}
#' \code{\link{getAnnotation}} function. Speeds up execution, if provided.
#' @param equiv.cpg Logical, if true then equivalent numbers of cpgs are used
#' for odds calculation rather than total number cpgs. Only used if 
#' \code{prior.prob=TRUE}.
#' @param fract.counts Logical, if true then fractional counting of cpgs is used
#' to account for cgps that map to multiple genes. Only used if 
#' \code{prior.prob=TRUE}.
#' @param genomic.features Character vector or scalar indicating whether the 
#' gene set enrichment analysis should be restricted to CpGs from specific 
#' genomic locations. Options are "ALL", "TSS200","TSS1500","Body","1stExon",
#' "3'UTR","5'UTR","ExonBnd"; and the user can select any combination. Defaults
#' to "ALL".
#' @param sig.genes Logical, if true then the significant differentially 
#' methylated genes that overlap with the gene set of interest is outputted 
#' as the final column in the results table. Default is FALSE.
#' 
#' @return A data frame with a row for each gene set and the following columns:
#' \item{N}{ number of genes in the gene set } \item{DE}{ number of genes that
#' are differentially methylated } \item{P.DE}{ p-value for over-representation
#' of the gene set } \item{FDR}{ False discovery rate, calculated using the
#' method of Benjamini and Hochberg (1995).  } \item{SigGenesInSet}{ Significant 
#' differentially methylated genes overlapping with the gene set of interest. }
#' @author Belinda Phipson
#' @seealso \code{\link{gometh},\link{getMappedEntrezIDs}}
#' @references Phipson, B., Maksimovic, J., and Oshlack, A. (2016). missMethyl:
#' an R package for analysing methylation data from Illuminas
#' HumanMethylation450 platform. \emph{Bioinformatics}, \bold{15};32(2),
#' 286--8. 
#' 
#' Geeleher, P., Hartnett, L., Egan, L. J., Golden, A., Ali, R. A. R.,
#' and Seoighe, C. (2013). Gene-set analysis is severely biased when applied to
#' genome-wide methylation data. \emph{Bioinformatics}, \bold{29}(15),
#' 1851--1857. 
#' 
#' Young, M. D., Wakefield, M. J., Smyth, G. K., and Oshlack, A.
#' (2010). Gene ontology analysis for RNA-seq: accounting for selection bias.
#' \emph{Genome Biology}, 11, R14. 
#' 
#' Ritchie, M. E., Phipson, B., Wu, D., Hu, Y.,
#' Law, C. W., Shi, W., and Smyth, G. K. (2015). limma powers differential
#' expression analyses for RNA-sequencing and microarray studies. \emph{Nucleic
#' Acids Research}, gkv007. 
#' 
#' Benjamini, Y., and Hochberg, Y. (1995). Controlling
#' the false discovery rate: a practical and powerful approach to multiple
#' testing. \emph{Journal of the Royal Statistical Society Series}, B,
#' \bold{57}, 289-300.
#' @examples
#' 
#' \dontrun{ # to avoid timeout on Bioconductor build
#' library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
#' library(org.Hs.eg.db)
#' library(limma)
#' ann <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
#' # Randomly select 1000 CpGs to be significantly differentially methylated
#' sigcpgs <- sample(rownames(ann),1000,replace=FALSE)
#' # All CpG sites tested
#' allcpgs <- rownames(ann)
#' # Use org.Hs.eg.db to extract a GO term
#' GOtoID <- suppressMessages(select(org.Hs.eg.db, keys=keys(org.Hs.eg.db), 
#'                                   columns=c("ENTREZID","GO"), 
#'                                   keytype="ENTREZID"))
#' setname1 <- GOtoID$GO[1]
#' setname1
#' keep.set1 <- GOtoID$GO %in% setname1
#' set1 <- GOtoID$ENTREZID[keep.set1]
#' setname2 <- GOtoID$GO[2]
#' setname2
#' keep.set2 <- GOtoID$GO %in% setname2
#' set2 <- GOtoID$ENTREZID[keep.set2]
#' # Make the gene sets into a list
#' sets <- list(set1, set2)
#' names(sets) <- c(setname1,setname2)
#' # Testing with prior probabilities taken into account
#' # Plot of bias due to differing numbers of CpG sites per gene
#' gst <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, 
#'                 plot.bias = TRUE, prior.prob = TRUE)
#' topGSA(gst)
#' 
#' # Add significant gene symbols in each set to output
#' gst <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, 
#'                 plot.bias = TRUE, prior.prob = TRUE, sig.genes = TRUE)
#' topGSA(gst)
#' 
#' # Testing ignoring bias
#' gst.bias <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, 
#'                     prior.prob = FALSE)
#' topGSA(gst.bias)
#' 
#' # Restrict to CpGs in gene bodies
#' gst.body <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets,
#'                     genomic.features = "Body")
#' topGSA(gst.body)
#' 
#' }
#' 
#' @import IlluminaHumanMethylation450kanno.ilmn12.hg19 IlluminaHumanMethylationEPICanno.ilm10b4.hg19
#' @export gsameth
gsameth <- function(sig.cpg, all.cpg=NULL, collection, 
                    array.type = c("450K","EPIC"), plot.bias=FALSE, 
                    prior.prob=TRUE, anno=NULL, equiv.cpg = TRUE,
                    fract.counts = TRUE, 
                    genomic.features = c("ALL", "TSS200","TSS1500","Body",
                                         "1stExon","3'UTR","5'UTR","ExonBnd"),
                    sig.genes = FALSE)
  # Generalised version of gometh with user-specified gene sets 
  # Gene sets collections must be Entrez Gene ID
  # Can take into account probability of differential methylation 
  # based on numbers of probes on array per gene.
  # Belinda Phipson
  # 10 February 2016
  # Updated 21 March 2019
{
  
  if(!is.vector(sig.cpg))
    stop("Input CpG list is not a character vector")
  array.type <- match.arg(toupper(array.type),c("450K","EPIC"))
  genomic.features <- match.arg(genomic.features, c("ALL", "TSS200","TSS1500",
                                                    "Body", "1stExon","3'UTR",
                                                    "5'UTR","ExonBnd"), 
                                several.ok = TRUE)
  
  if(length(genomic.features) > 1 & any(grepl("ALL", genomic.features))){
    message("All input CpGs are used for testing.") 
    genomic.features <- "ALL"   
  } 
  
  if(array.type == "450K" & any(("ExonBnd" %in% genomic.features))){
      stop("'ExonBnd' is not an annotated feature on 450K arrays,
           please remove it from your genomic.feature parameter
           specification.") 
  }
  
  # Get mapped entrez gene IDs from CpG probe names
  if(!is.null(anno)){
    out <- getMappedEntrezIDs(sig.cpg=sig.cpg,all.cpg=all.cpg,
                              array.type=array.type, anno, 
                              genomic.features = genomic.features)
  } else {
    out <- getMappedEntrezIDs(sig.cpg=sig.cpg,all.cpg=all.cpg,
                              array.type=array.type, 
                              genomic.features = genomic.features)
  }
  sorted.eg.sig <- out$sig.eg
  eg.universe <- out$universe
  freq_genes <- out$freq
  test.de <- out$de
  frac <- out$fract.counts
  equiv <- out$equiv
  
  # Check collection is a list with character vectors
  if(!is.list(collection))
    collection <- list(collection=collection)
  collection <- lapply(collection, as.character)
  # Make sure gene set collections don't have any NAs
  collection <- lapply(collection, function(x) x[!is.na(x)])
  # Remove genes that are NOT in the universe from collections
  collection <- lapply(collection, function(x) x[x %in% eg.universe])
  # Remove collections with no genes left after universe filter
  inUniv <- sapply(collection, function(x) length(x) > 0)
  collection <- collection[inUniv]

  # Estimate prior probabilities
  if(prior.prob){
    if(equiv.cpg){ 
        # use "equivalent" no. of cpgs in odds calculation
        pwf <- .estimatePWF(D=test.de,bias=as.vector(equiv))
        if(plot.bias)
            .plotBias(D=test.de,bias=as.vector(equiv))
    } else {
        pwf <- .estimatePWF(D=test.de,bias=as.vector(freq_genes))
        if(plot.bias)
            .plotBias(D=test.de,bias=as.vector(freq_genes))
    }
  }
  
  results <- matrix(NA,ncol=4,nrow=length(collection))
  colnames(results) <- c("N","DE","P.DE","FDR")
  rownames(results) <- names(collection)
  results[,"N"] <- unlist(lapply(collection,length))
  if(sig.genes) SigGenesInSet <- rep(NA,length(collection))
  
  if(prior.prob & fract.counts){ 
      # use fractional counting to account for cpgs that map to multiple genes
      results[,"DE"] <- unlist(lapply(collection, function(x) 
          sum((sorted.eg.sig %in% x) * frac$frac)))
  } else {
      results[,"DE"] <- unlist(lapply(collection, function(x) 
          sum((sorted.eg.sig %in% x))))
  } 
  
  Nuniverse <- length(eg.universe)
  m <- length(sorted.eg.sig)
  
  # Hypergeometric test with prior probabilities
  if(prior.prob){
    for(i in 1:length(collection)){
      InSet <- eg.universe %in% collection[[i]]
      pw.red <- sum(pwf[InSet])/results[i,"N"]
      pw.white <- sum(pwf[!InSet])/(Nuniverse-results[i,"N"])
      odds <- pw.red/pw.white
      results[i,"P.DE"] <- BiasedUrn::pWNCHypergeo(results[i,"DE"],
                                                   results[i,"N"],
                                                   Nuniverse-results[i,"N"],
                                                   m,odds,lower.tail=FALSE) + 
          BiasedUrn::dWNCHypergeo(results[i,"DE"],
                                  results[i,"N"],
                                  Nuniverse-results[i,"N"],
                                  m,odds)
      if(sig.genes){
        # Get gene symbols of significant genes
        SigGenesEntrezID <- sorted.eg.sig[sorted.eg.sig %in% collection[[i]]]
        SigGenesSymbol <- suppressMessages(AnnotationDbi::select(org.Hs.eg.db, 
                                             keys = SigGenesEntrezID,
                                             columns = "SYMBOL"))
        SigGenesInSet[i] <- paste(SigGenesSymbol$SYMBOL,collapse=",")
      }
      if(results[i,"P.DE"]==0){
        message("Pvalue of exactly zero detected. Performing hypergeometric 
                test for gene set ", rownames(results)[i])
        results[i,"P.DE"] <- stats::phyper(q=results[i,"DE"]-0.5,m=m,
                                           n=Nuniverse-m,k=results[i,"N"],
                                           lower.tail=FALSE)
      }
    }
  }
  # Hypergeometric test without prior probabilities
  else{
    for(i in 1:length(collection)){
      results[i,"P.DE"] <- stats::phyper(q=results[i,"DE"]-0.5,m=m,
                                         n=Nuniverse-m,k=results[i,"N"],
                                         lower.tail=FALSE)
      if(sig.genes){
        # Get gene symbols of significant genes
        SigGenesEntrezID <- sorted.eg.sig[sorted.eg.sig %in% collection[[i]]]
        SigGenesSymbol <- suppressMessages(AnnotationDbi::select(org.Hs.eg.db, 
                                                               keys = SigGenesEntrezID,
                                                               columns = "SYMBOL"))
        SigGenesInSet[i] <- paste(SigGenesSymbol$SYMBOL,collapse=",")
      }
    }
  }
  results[,"FDR"] <- stats::p.adjust(results[,"P.DE"],method="BH")
  results[,"DE"] <- floor(results[,"DE"])
  if(sig.genes) data.frame(results, SigGenesInSet)
  else data.frame(results)
}



#' Get table of top 20 enriched pathways
#' 
#' After using \code{gsameth}, calling topGSA will output the top 20 most
#' significantly enriched pathways.
#' 
#' This function will output the top 20 most significant pathways from a
#' pathway analysis using the \code{gsameth} function. The output is ordered by
#' p-value.
#' 
#' @param gsa Matrix, from output of \code{gsameth}.
#' @param number Scalar, number of pathway results to output. Default is 20.
#' @param sort Logical, should the table be ordered by p-value. Default is
#' TRUE.
#' @return A matrix ordered by P.DE, with a row for each gene set and the
#' following columns: \item{N}{ number of genes in the gene set } \item{DE}{
#' number of genes that are differentially methylated } \item{P.DE}{ p-value
#' for over-representation of the gene set } \item{FDR}{ False discovery rate,
#' calculated using the method of Benjamini and Hochberg (1995) }.  
#' \item{SigGenesInSet}{ Significant differentially methylated genes overlapping
#' with the gene set of interest. }
#' @author Belinda Phipson
#' @seealso \code{\link{gsameth}}
#' @examples
#' 
#' library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
#' library(org.Hs.eg.db)
#' library(limma)
#' ann <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
#' 
#' # Randomly select 1000 CpGs to be significantly differentially methylated
#' sigcpgs <- sample(rownames(ann),1000,replace=FALSE)
#' 
#' # All CpG sites tested
#' allcpgs <- rownames(ann)
#' 
#' # Use org.Hs.eg.db to extract a GO term
#' GOtoID <- toTable(org.Hs.egGO2EG)
#' setname1 <- GOtoID$go_id[1]
#' setname1
#' keep.set1 <- GOtoID$go_id %in% setname1
#' set1 <- GOtoID$gene_id[keep.set1]
#' setname2 <- GOtoID$go_id[2]
#' setname2
#' keep.set2 <- GOtoID$go_id %in% setname2
#' set2 <- GOtoID$gene_id[keep.set2]
#' 
#' # Make the gene sets into a list
#' sets <- list(set1, set2)
#' names(sets) <- c(setname1,setname2)
#' 
#' # Testing with prior probabilities taken into account
#' # Plot of bias due to differing numbers of CpG sites per gene
#' gst <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, 
#' plot.bias = TRUE, prior.prob = TRUE)
#' topGSA(gst)
#' 
#' # Testing ignoring bias
#' gst.bias <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, 
#'                     prior.prob = FALSE)
#' topGSA(gst.bias)
#' 
#' @export topGSA
topGSA <- function(gsa,number=20,sort=TRUE)
  # Function to output the top 20 enriched gene sets from a gene set 
  # enrichment analysis using gsameth.
  # Belinda Phipson
  # 12 February
{
  if(nrow(gsa) < number)
    number <- nrow(gsa)
  if(sort){
    o <- order(gsa[,"P.DE"])
    out <- gsa[o,]
    out[1:number,]
  }
  else
    gsa[1:number,]
}