R/polyenrich.R
In chipenrich: Gene Set Enrichment For ChIP-seq Peak Data

Documented in polyenrich

#' Run Poly-Enrich on narrow genomic regions
#'
#' Poly-Enrich is also designed for narrow peaks, for experiments with 100,000s of
#' peaks, or in cases where the number of binding sites per gene affects its regulation.
#' If unsure whether to use chipenrich or polyenrich, then we recommend hybridenrich.
#' For more details, see the 'Poly-Enrich Method' section below. For help choosing a method,
#' see the 'Choosing A Method' section below, or see the vignette.
#'
#' @section Poly-Enrich Method:
#' The Poly-Enrich method uses the number of peaks in genes in its model for
#' enrichment: \code{num_peaks ~ GO + s(log10_length)}. Here, \code{GO} is a
#' binary vector indicating whether a gene is in the gene set being tested,
#' \code{num_peaks} is a numeric vector indicating the number of peaks in each
#' gene, and \code{s(log10_length)} is a negative binomial cubic smoothing spline
#' which adjusts for the relationship between the number of peaks in a gene and
#' locus length.
#'
#' @section Poly-Enrich Weighting Options:
#' Poly-Enrich also allows weighting of individual peaks. Currently the options are:
#' \describe{
#'  \item{'signalValue:'}{ weighs each peak based on the Signal Value given in the
#' narrowPeak format or a user-supplied column, normalized to have mean 1.}
#'  \item{'logsignalValue:'}{ weighs each peak based on the log Signal Value given in the
#' narrowPeak format or a user-supplied column, normalized to have mean 1.}
#'  \item{'multiAssign:'}{ weighs each peak by the inverse of the number of genes
#' it is assigned to.}
#' }
#'
#' @section Choosing A Method:
#' The following guidelines are intended to help select an enrichment function:
#' \describe{
#'	\item{broadenrich():}{ is designed for use with broad peaks that may intersect
#' multiple gene loci, and cumulatively cover greater than 5\% of the genome. For
#' example, ChIP-seq experiments for histone modifications.}
#'	\item{chipenrich():}{ is designed for use with 1,000s or 10,000s of narrow
#' peaks which results in fewer gene loci containing a peak overall. For example,
#' ChIP-seq experiments for transcription factors.}
#'	\item{polyenrich():}{ is also designed for narrow peaks, for experiments with
#' 100,000s of peaks, or in cases where the number of binding sites per gene affects
#' its regulation. If unsure whether to use chipenrich or polyenrich, then we recommend
#' hybridenrich.}
#'  \item{hybridenrich():}{ is a combination of chipenrich and polyenrich, to be
#' used when one is unsure which is the optimal method. }
#' }
#'
#' @section Randomizations:
#' Randomization of locus definitions allows for the assessment of Type I Error
#' under the null hypothesis. The randomization codes are:
#' \describe{
#'	\item{\code{NULL}:}{ No randomizations, the default.}
#' 	\item{'complete':}{ Shuffle the \code{gene_id} and \code{symbol} columns of the
#' \code{locusdef} together, without regard for the chromosome location, or locus length.
#' The null hypothesis is that there is no true gene set enrichment.}
#' 	\item{'bylength':}{ Shuffle the \code{gene_id} and \code{symbol} columns of the
#' \code{locusdef} together within bins of 100 genes sorted by locus length. The null
#' hypothesis is that there is no true gene set enrichment, but with preserved locus
#' length relationship.}
#' 	\item{'bylocation':}{ Shuffle the \code{gene_id} and \code{symbol} columns of the
#' \code{locusdef} together within bins of 50 genes sorted by genomic location. The null
#' hypothesis is that there is no true gene set enrichment, but with preserved
#' genomic location.}
#' }
#' The return value with a selected randomization is the same list as without.
#' To assess the Type I error, the \code{alpha} level for the particular data set
#' can be calculated by dividing the total number of gene sets with p-value < \code{alpha}
#' by the total number of tests. Users may want to perform multiple randomizations
#' for a set of peaks and take the median of the \code{alpha} values.
#'
#' @param peaks Either a file path or a \code{data.frame} of peaks in BED-like
#' format. If a file path, the following formats are fully supported via their
#' file extensions: .bed, .broadPeak, .narrowPeak, .gff3, .gff2, .gff, and .bedGraph
#' or .bdg. BED3 through BED6 files are supported under the .bed extension. Files
#' without these extensions are supported under the conditions that the first 3
#' columns correspond to 'chr', 'start', and 'end' and that there is either no
#' header column, or it is commented out. If a \code{data.frame} A BEDX+Y style
#' \code{data.frame}. See \code{GenomicRanges::makeGRangesFromDataFrame} for
#' acceptable column names.
#' @param out_name Prefix string to use for naming output files. This should not
#' contain any characters that would be illegal for the system being used (Unix,
#' Windows, etc.) The default value is "polyenrich", and a file "polyenrich_results.tab"
#' is produced. If \code{qc_plots} is set, then a file "polyenrich_qcplots.png"
#' is produced containing a number of quality control plots. If \code{out_name}
#' is set to NULL, no files are written, and results then must be retrieved from
#' the list returned by \code{polyenrich}.
#' @param out_path Directory to which results files will be written out. Defaults
#' to the current working directory as returned by \code{\link{getwd}}.
#' @param genome One of the \code{supported_genomes()}.
#' @param genesets A character vector of geneset databases to be tested for
#' enrichment. See \code{supported_genesets()}. Alternately, a file path to a
#' a tab-delimited text file with header and first column being the geneset ID
#' or name, and the second column being Entrez Gene IDs. For an example custom
#' gene set file, see the vignette.
#' @param locusdef One of: 'nearest_tss', 'nearest_gene', 'exon', 'intron', '1kb',
#' '1kb_outside', '1kb_outside_upstream', '5kb', '5kb_outside', '5kb_outside_upstream',
#' '10kb', '10kb_outside', '10kb_outside_upstream'. For a description of each,
#' see the vignette or \code{\link{supported_locusdefs}}. Alternately, a file path for
#' a custom locus definition. NOTE: Must be for a \code{supported_genome()}, and
#' must have columns 'chr', 'start', 'end', and 'gene_id' or 'geneid'. For an
#' example custom locus definition file, see the vignette.
#' @param method A character string specifying the method to use for enrichment
#' testing. Current options are \code{polyenrich} and \code{polyenrich_weighted}.
#' @param weighting A character string specifying the weighting method if method is
#' chosen to be 'polyenrich_weighted'. Current options are: 'signalValue',
#' 'logsignalValue', and 'multiAssign'.
#' @param mappability One of \code{NULL}, a file path to a custom mappability file,
#' or an \code{integer} for a valid read length given by \code{supported_read_lengths}.
#' If a file, it should contain a header with two column named 'gene_id' and 'mappa'.
#' Gene IDs should be Entrez IDs, and mappability values should range from 0 and 1.
#' For an example custom mappability file, see the vignette. Default value is NULL.
#' @param qc_plots A logical variable that enables the automatic generation of
#' plots for quality control.
#' @param min_geneset_size Sets the minimum number of genes a gene set may have
#' to be considered for enrichment testing.
#' @param max_geneset_size Sets the maximum number of genes a gene set may have
#' to be considered for enrichment testing.
#' @param randomization One of \code{NULL}, 'complete', 'bylength', or 'bylocation'.
#' See the Randomizations section above.
#' @param n_cores The number of cores to use for enrichment testing. We recommend
#' using only up to the maximum number of \emph{physical} cores present, as
#' virtual cores do not significantly decrease runtime. Default number of cores
#' is set to 1. NOTE: Windows does not support multicore enrichment.
#'
#' @return A list, containing the following items:
#'
#' \item{opts }{A data frame containing the arguments/values passed to \code{polyenrich}.}
#'
#' \item{peaks }{
#' A data frame containing peak assignments to genes. Peaks which do not overlap
#' a gene locus are not included. Each peak that was assigned to a gene is listed,
#' along with the peak midpoint or peak interval coordinates (depending on which
#' was used), the gene to which the peak was assigned, the locus start and end
#' position of the gene, and the distance from the peak to the TSS.
#'
#' The columns are:
#'
#' \describe{
#'   \item{peak_id}{ an ID given to unique combinations of chromosome, peak start, and peak end. }
#'   \item{chr}{ the chromosome the peak originated from. }
#'   \item{peak_start}{ start position of the peak. }
#'   \item{peak_end}{ end position of the peak. }
#'   \item{peak_midpoint}{ the midpoint of the peak. }
#'   \item{gene_id}{ the Entrez ID of the gene to which the peak was assigned. }
#'   \item{gene_symbol}{ the official gene symbol for the gene_id (above). }
#'   \item{gene_locus_start}{ the start position of the locus for the gene to which the peak was assigned (specified by the locus definition used.) }
#'   \item{gene_locus_end}{ the end position of the locus for the gene to which the peak was assigned (specified by the locus definition used.) }
#'   \item{nearest_tss}{ the closest TSS to this peak (for any gene, not necessarily the gene this peak was assigned to.) }
#'   \item{nearest_tss_gene}{ the gene having the closest TSS to the peak (should be the same as gene_id when using the nearest TSS locus definition.) }
#'   \item{nearest_tss_gene_strand}{ the strand of the gene with the closest TSS. }
#' }}
#'
#' \item{peaks_per_gene }{
#' A data frame of the count of peaks per gene. The columns are:
#'
#' \describe{
#'   \item{gene_id}{ the Entrez Gene ID. }
#'   \item{length}{ the length of the gene's locus (depending on which locus definition you chose.)}
#'   \item{log10_length}{ the log10(locus length) for the gene.}
#'   \item{num_peaks}{ the number of peaks that were assigned to the gene, given the current locus definition. }
#'   \item{peak}{ whether or not the gene has a peak. }
#' }}
#'
#' \item{results }{
#' A data frame of the results from performing the gene set enrichment test on
#' each geneset that was requested (all genesets are merged into one final data
#' frame.) The columns are:
#'
#' \describe{
#'   \item{Geneset.ID}{ is the identifier for a given gene set from the selected database.  For example, GO:0000003. }
#'   \item{Geneset.Type}{ specifies from which database the Geneset.ID originates.  For example, "Gene Ontology Biological Process."}
#'   \item{Description}{ gives a definition of the geneset. For example, "reproduction."}
#'   \item{P.Value}{ is the probability of observing the degree of enrichment of the gene set given the null hypothesis that peaks are not associated with any gene sets.}
#'   \item{FDR}{ is the false discovery rate proposed by Bejamini \& Hochberg for adjusting the p-value to control for family-wise error rate.}
#'   \item{Odds.Ratio}{ is the estimated odds that peaks are associated with a given gene set compared to the odds that peaks are associated with other gene sets, after controlling for locus length and/or mappability.  An odds ratio greater than 1 indicates enrichment, and less than 1 indicates depletion.}
#'   \item{N.Geneset.Genes}{ is the number of genes in the gene set.}
#'   \item{N.Geneset.Peak.Genes}{ is the number of genes in the genes set that were assigned at least one peak.}
#'   \item{Geneset.Avg.Gene.Length}{ is the average length of the genes in the gene set.}
#'   \item{Geneset.Peak.Genes}{ is the list of genes from the gene set that had at least one peak assigned.}
#'
#' }}
#'
#' @examples
#'
#' # Run Poly-Enrich using an example dataset, assigning peaks to the nearest TSS,
#' # and on a small custom geneset
#' data(peaks_E2F4, package = 'chipenrich.data')
#' peaks_E2F4 = subset(peaks_E2F4, peaks_E2F4$chrom == 'chr1')
#' gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich')
#' results = polyenrich(peaks_E2F4, method='polyenrich', locusdef='nearest_tss',
#' 			genome = 'hg19', genesets=gs_path, out_name=NULL)
#'
#' # Get the list of peaks that were assigned to genes.
#' assigned_peaks = results$peaks
#'
#' # Get the results of enrichment testing.
#' enrich = results$results
#'
#' @export
#' @include constants.R utils.R supported.R setup.R randomize.R
#' @include read.R assign_peaks.R peaks_per_gene.R
#' @include plot_dist_to_tss.R plot_polyenrich_spline.R
#' @include test_polyenrich.R test_polyenrich_slow.R test_polyenrich_weighted.R test_polyapprox.R
polyenrich = function(
	peaks,
	out_name = "polyenrich",
	out_path = getwd(),
	genome = supported_genomes(),
	genesets = c(
		'GOBP',
		'GOCC',
		'GOMF'),
	locusdef = "nearest_tss",
	method = 'polyenrich',
    weighting = NULL,
	mappability = NULL,
	qc_plots = TRUE,
	min_geneset_size = 15,
	max_geneset_size = 2000,
	randomization = NULL,
	n_cores = 1
) {
	genome = match.arg(genome)

	n_cores = reset_ncores_for_windows(n_cores)

	############################################################################
	# Collect options for opts output
	opts_list = as.list(sys.call())
	opts_list = opts_list[2:length(opts_list)]

	opts = data.frame(
		parameters = names(opts_list),
		values = as.character(opts_list),
		stringsAsFactors = FALSE
	)

	############################################################################
	# Setup locus definitions, genesets, and mappability

	ldef_list = setup_locusdef(locusdef, genome, randomization)
	ldef = ldef_list[['ldef']]
	tss = ldef_list[['tss']]

	geneset_list = setup_genesets(gs_codes = genesets, ldef_obj = ldef, genome = genome, min_geneset_size = min_geneset_size, max_geneset_size = max_geneset_size)

	mappa = setup_mappa(mappa_code = mappability, genome = genome, ldef_code = locusdef, ldef_obj = ldef)

	############################################################################
	# CHECK method and get() it if okay
	testf = get_test_method(method)
	test_func = get(testf)
	method_name = METHOD_NAMES[[method]]

	############################################################################
	############################################################################
    # Start enrichment process
	############################################################################
	############################################################################

	######################################################
	# Read in and format peaks (from data.frame or file)
	if (class(peaks) == "data.frame") {
		message('Reading peaks from data.frame...')
		peakobj = load_peaks(peaks)
	} else if (class(peaks) == "character") {
		peakobj = read_bed(peaks)
	}

	# Number of peaks in data.
	num_peaks = length(peakobj)

	######################################################
	# Assign peaks to genes.
	message("Assigning peaks to genes with assign_peaks(...) ..")
	assigned_peaks = assign_peaks(peakobj, ldef, tss, weighting)

	######################################################
	# Compute peaks per gene table
	ppg = num_peaks_per_gene(assigned_peaks, ldef, mappa)

    ######################################################
    # If using weights but not polyenrich_weighted, stop
    if (!is.null(weighting) & method != "polyenrich_weighted") {
        stop("You need to use method = polyenrich_weighted if choosing weighting!")
    }

    ######################################################
    # If using the weighted method, add the weights column
    if (method == "polyenrich_weighted") {
        if (is.null(weighting)) {
            # No weights specified
            stop("No weight options selected!")
        } else if (!all(weighting %in% c("logsignalValue","signalValue","multiAssign"))) {
            # Unsupported weights
            stop(sprintf("Unsupported weights: %s",
                paste(weighting[which(!(weighting %in% c("logsignalValue","signalValue","multiAssign")))],collapse=", ")))
        }

        assigned_peaks = calc_peak_weights(assigned_peaks, weighting)
        ppg = calc_genes_peak_weight(assigned_peaks, ppg)
    }

	######################################################
	# Enrichment
	results = list()
	for (gobj in geneset_list) {
		message(sprintf("Test: %s",method_name))
		message(sprintf("Genesets: %s",gobj@type))
		message("Running tests..")
		if (testf == "test_polyenrich_slow") {
			rtemp = test_func(gobj,ppg,n_cores)
		}
		if (testf == "test_polyenrich") {
			rtemp = test_func(gobj,ppg,n_cores)
		}
        if (testf == "test_polyenrich_weighted") {
            # Note that there is an extra input for the name of the count column
            # In the future when we make a function purely for enrichment, this will be where you
            # give the name of the count column.
            rtemp = test_func(gobj,ppg,n_cores, "sum_peak_weight")
        }
        if (testf == "test_polyapprox") {
            rtemp = test_func(gobj,ppg,n_cores)
        }

		# Annotate with geneset descriptions.
		rtemp$"Description" = as.character(mget(rtemp$Geneset.ID, gobj@set.name, ifnotfound=NA))
		rtemp$"Geneset.Type" = gobj@type

		results[[gobj@type]] = rtemp
	}
	enrich = Reduce(rbind,results)

	######################################################
	# Post-process enrichment
	# Order columns, add enriched/depleted column as needed, remove bad tests,
	# sort by p-value, rename rownames to integers
	enrich = post_process_enrichments(enrich)

	######################################################
	# Write result objects to files
	if (!is.null(out_name)) {
		filename_analysis = file.path(out_path, sprintf("%s_results.tab", out_name))
		write.table(enrich, file = filename_analysis, row.names = FALSE, quote = FALSE, sep = "\t")
		message("Wrote results to: ", filename_analysis)

		filename_peaks = file.path(out_path, sprintf("%s_peaks.tab", out_name))
		write.table(assigned_peaks, file = filename_peaks, row.names = FALSE, quote = FALSE, sep = "\t")
		message("Wrote peak-to-gene assignments to: ", filename_peaks)

		filename_opts = file.path(out_path, sprintf("%s_opts.tab", out_name))
		write.table(opts, file = filename_opts, row.names = FALSE, quote = FALSE, sep = "\t")
		message("Wrote run options/arguments to: ", filename_opts)

		filename_ppg = file.path(out_path, sprintf("%s_peaks-per-gene.tab", out_name))
		write.table(ppg, file = filename_ppg, row.names = FALSE, quote = FALSE, sep = "\t")
		message("Wrote count of peaks per gene to: ", filename_ppg)

		if (qc_plots) {
			filename_qcplots = file.path(out_path, sprintf("%s_qcplots.png", out_name))
			grDevices::png(filename_qcplots)
				print(..plot_polyenrich_spline(gpw = ppg, mappability = mappability, num_peaks = num_peaks))
				print(..plot_dist_to_tss(peakobj, tss))
			grDevices::dev.off()
			message("Wrote QC plots to: ",filename_qcplots)
		}
	}

	######################################################
	# Return objects as list
	return(list(
		peaks = assigned_peaks,
		results = enrich,
		opts = opts,
		peaks_per_gene = ppg
	))
}