Nothing
#' @title Peak Calling and Peak Statistics Quantification on MeRIP-seq Dataset.
#'
#' @description \code{exomePeak2} conducts peak calling and peak statistics calculation from \strong{BAM} files of a MeRIP-seq experiment.
#' The function integrates the following steps of a standard MeRIP-seq data analysis pipeline.
#'
#' \enumerate{
#' \item Check and index the BAM files with \code{\link{scanMeripBAM}}.
#' \item Call modification peaks on exons with \code{\link{exomePeakCalling}}.
#' \item Calculate offset factors of GC content biases with \code{\link{normalizeGC}}.
#' \item Calculate (differential) modification statistics with the generalized linear model (GLM) using \code{\link{glmM}} or \code{\link{glmDM}}.
#' \item Export the peaks/sites statistics with user defined format by \code{\link{exportResults}}.
#' }
#'
#' See the help pages of the corresponding functions for the complete documentation.
#'
#' @details \code{\link{exomePeak2}} call RNA modification peaks and calculate peak statistics from \strong{BAM} files of a MeRIP-seq experiment.
#'
#' The transcript annotation (from either the \code{\link{TxDb}} object or the \strong{GFF} file) should be provided to perform analysis on exons.
#'
#' The \code{\link{BSgenome}} object is also required to perform the GC content bias adjustment.
#' If the \code{bsgenome} and the \code{genome} arguments are not provided (\code{= NULL}), the downstream analysis will proceed without GC content bias corrections.
#'
#' If the \strong{BAM} files in treated samples are provided at the arguments \code{bam_treated_ip} and \code{bam_treated_input}, the statistics of differential modification analysis on peaks/sites will be reported.
#'
#' Under the default setting, \code{\link{exomePeak2}} will save the results of (differential) modification analysis under a folder named \code{'exomePeak2_output'}.
#' The results generated include a \strong{BED} file and a \strong{CSV} table that stores the locations and statistics of the (differential) modified peaks/sites.
#'
#' @param bam_ip a \code{character} vector for the BAM file directories of the (control) IP samples.
#' @param bam_input a \code{character} vector for the BAM file directories of the (control) input samples.
#' @param bam_treated_ip a \code{character} vector for the BAM file directories of the treated IP samples.
#' @param bam_treated_input a \code{character} vector for the BAM file directories of the treated input samples.
#'
#' If the bam files do not contain treatment group, user should only fill the arguments of \code{BAM_ip} and \code{BAM_input}.
#'
#' @param paired_end a \code{logical} of whether the data comes from the Paired-End Library, \code{TRUE} if the data is Paired-End sequencing; default \code{FALSE}.
#' @param library_type a \code{character} specifying the protocal type of the RNA-seq library, can be one in \code{c("unstranded", "1st_strand", "2nd_strand")}; default \code{= "unstranded"}.
#'
#' \describe{
#' \item{\strong{unstranded}}{The randomly primed RNA-seq library type, i.e. both the strands generated during the first and the second strand sythesis are sequenced; example: Standard Illumina.}
#' \item{\strong{1st_strand}}{The first strand-specific RNA-seq library, only the strand generated during the first strand sythesis is sequenced; examples: dUTP, NSR, NNSR.}
#' \item{\strong{2nd_strand}}{The second strand-specific RNA-seq library, only the strand generated during the second strand sythesis is sequenced; examples: Ligation, Standard SOLiD.}
#' }
#'
#' @param txdb a \code{\link{TxDb}} object for the transcript annotation.
#'
#' If the \code{TxDb} object is not available, it could be a \code{character} string of the UCSC genome name which is acceptable by \code{\link{makeTxDbFromUCSC}}. For example: \code{"hg19"}.
#'
#' @param bsgenome a \code{\link{BSgenome}} object for the genome sequence information.
#'
#' If the \code{BSgenome} object is not available, it could be a \code{character} string of the UCSC genome name which is acceptable by \code{\link{getBSgenome}}. For example: \code{"hg19"}.
#'
#' @param genome a \code{character} string of the UCSC genome name which is acceptable by \code{\link{getBSgenome}} or/and \code{\link{makeTxDbFromUCSC}}. For example: \code{"hg19"}.
#'
#' By default, the argument = NA, it should be provided when the \code{BSgenome} or/and the \code{TxDb} object are not available.
#'
#' @param gff_dir optional, a \code{character} which specifies the directory toward a gene annotation GFF/GTF file, it is applied when the \code{TxDb} object is not available, default \code{= NULL}.
#'
#' @param fragment_length a positive integer number for the expected fragment length in nucleotides; default \code{= 100}.
#'
#' @param binding_length a positive integer number for the expected binding length of the anti-modification antibody in IP samples; default \code{= 25}.
#'
#' @param step_length a positive integer number for the shift distances of the sliding window; default \code{= binding_length}.
#'
#' @param peak_width a \code{numeric} value for the minimum width of the merged peaks; default \code{= fragment_length} .
#'
#' @param glm_type a \code{character} speciefies the type of Generalized Linear Model (GLM) fitted for the purpose of statistical inference during peak calling, which can be one of the \code{c("DESeq2", "NB", "Poisson")}.
#'
#' \describe{
#' \item{\strong{\code{DESeq2}}}{Fit the GLM defined in the function \code{\link{DESeq}}, which is the NB GLM with regulated estimation of the overdispersion parameters.}
#'
#' \item{\strong{\code{NB}}}{Fit the ordinary Negative Binomial (NB) GLM.}
#'
#' \item{\strong{\code{Poisson}}}{Fit the Poisson GLM.}
#' }
#'
#' By default, the DESeq2 GLMs are fitted on the data set with > 1 biological replicates for both the IP and input samples, the Poisson GLM will be fitted otherwise.
#'
#' @param pc_count_cutoff a \code{numeric} value for the cutoff on average window's reads count in peak calling; default \code{= 5}.
#'
#' @param bg_count_cutoff a \code{numeric} value for the cutoff on average window's reads count in background identification; default \code{= 50}.
#'
#' @param p_cutoff a \code{numeric} value for the cutoff on p values in peak calling; default \code{= 1e-05}.
#'
#' @param p_adj_cutoff a \code{numeric} value for the cutoff on Benjamini Hochberg adjusted p values in peak calling; default \code{= NULL}.
#'
#' @param log2FC_cutoff a \code{numeric} value for the cutoff on log2 IP over input fold changes in peak calling; default \code{= 1}.
#'
#' @param consistent_peak a \code{logical} of whether the positive peaks returned should be consistent among all the replicates; default \code{= FALSE}.
#'
#' @param consistent_log2FC_cutoff a \code{numeric} for the modification log2 fold changes cutoff in the peak consisency calculation; default = 1.
#'
#' @param consistent_fdr_cutoff a \code{numeric} for the BH adjusted C-test p values cutoff in the peak consistency calculation; default { = 0.05}. Check \code{\link{ctest}}.
#'
#' @param alpha a \code{numeric} for the binomial quantile used in the consitent peak filter; default\code{ = 0.05}.
#'
#' @param p0 a \code{numeric} for the binomial proportion parameter used in the consistent peak filter; default \code{= 0.8}.
#'
#' For a peak to be consistently methylated, the minimum number of significant enriched replicate pairs is defined as the 1 - alpha quantile of a binomial distribution with p = p0 and N = number of possible pairs between replicates.
#'
#' The consistency defined in this way is equivalent to the rejection of an exact binomial test with null hypothesis of p < p0 and N = replicates number of IP * replicates number of input.
#'
#' @param parallel a \code{logical} value indicating whether to use parallel computation, it will require > 16GB memory if \code{parallel = TRUE}; default \code{= FALSE}.
#'
#' @param mod_annot a \code{\link{GRanges}} object for user provided single based RNA modification annotation.
#'
#' If user provides the single based RNA modification annotation, exomePeak2 will perform reads count and (differential) modification quantification on the provided annotation.
#'
#' The single base annotation will be flanked by length = floor(fragment_length - binding_length/2) to account for the fragment length of the sequencing library.
#'
#' @param manual_background a \code{\link{GRanges}} object for the user provided unmodified background; default \code{= NULL}.
#'
#' @param correct_GC_bg a \code{logical} value of whether to estimate the GC content linear effect on background regions during modification level quantification; default \code{= TRUE}.
#'
#' If \code{= TRUE}, it could lead to a more accurate estimation of GC content bias for the RNA modifications that are highly biologically related to GC content.
#'
#' @param qtnorm a \code{logical} of whether to perform subset quantile normalization after the GC content linear effect correction; default \code{= FALSE}.
#'
#' If \code{qtnorm = TRUE}, subset quantile normalization will be applied within the IP and input samples seperately to account for the inherent differences between the marginal distributions of IP and input samples.
#'
#' @param background_method a \code{character} specifies the method of finding background regions for peak detection, i.e. to identify the windows without modification signal. It could be one of "Gaussian_mixture", "m6Aseq_prior", "manual", and "all"; default \code{= "all"}.
#'
#' In order to accurately account for the technical variations, it is often important to estimate the sequencing depth and GC content linear effects on windows without modification signals.
#'
#' The following methods are supported in \code{ExomePeak2} to differentiate the no modification background windows from the modification containig windows.
#'
#' \describe{
#'
#' \item{\strong{\code{all}}}{Use all windows as the background.
#' This choise assumes no differences in the effects of technical features between the background and the modification regions.}
#'
#' \item{\strong{\code{Gaussian_mixture}}}{The background is identified by Multivariate Gaussian Mixture Model (MGMM) with 2 mixing components on the vectors containing methylation level estimates and GC content, the background regions are predicted by the Bayes Classifier on the learned GMM.}
#'
#' \item{\strong{\code{m6Aseq_prior}}}{The background is identified by the prior knowledge of m6A topology, the windows that are not overlapped with long exons (exon length >= 400bp) and 5'UTR are treated as the background windows.
#'
#' This type of background should not be used if the MeRIP-seq data is not targetting on m6A methylation.
#'
#' }
#'
#' \item{\strong{\code{manual}}}{The background regions are defined by the user manually at the argument \code{manual_background}.}
#'
#' }
#'
#'
#' @param LFC_shrinkage a \code{character} for the method of emperical bayes shrinkage on log2FC, could be one of \code{c("apeglm", "ashr", "Gaussian", "none")}; Default \code{= "apeglm"}.
#'
#' see \code{\link{lfcShrink}} for more details; if "none" is selected, only the MLE will be returned.
#'
#' @param table_style a \code{character} for the style of the table being exported, could be one of \code{c("bed", "granges")}; Default \code{= "bed"}.
#'
#' @param export_results a \code{logical} of whether to save the results on disk; default \code{= TRUE}.
#'
#' @param export_format a \code{character} vector for the format(s) of the result being exported, could be the subset of \code{c("CSV","BED","RDS")}; Default \code{= c("CSV","BED","RDS")}.
#'
#' @param save_plot_GC a \code{logical} of whether to generate the plots for GC content bias assessment; default \code{= TRUE}.
#'
#' @param save_plot_analysis a \code{logical} of whether to generate the plots for genomic analysis on modification sites; default \code{= FALSE}.
#'
#' @param save_plot_name a \code{character} for the name of the plots being saved; Default \code{= "Plot"}.
#'
#' @param save_dir a \code{character} for the name of the directory being saved; Default \code{= "exomePeak2_output"}.
#'
#' @param peak_calling_mode a \code{character} specifies the scope of peak calling on genome, can be one of \code{c("exon", "full_transcript", "whole_genome")}; Default \code{= "exon"}.
#'
#' \describe{
#' \item{\strong{\code{exon}}}{generate sliding windows on exon regions.}
#'
#' \item{\strong{\code{full_transcript}}}{generate sliding windows on the full transcripts (include both introns and exons).}
#'
#' \item{\strong{\code{whole_genome}}}{generate sliding windows on the whole genome (include introns, exons, and the intergenic regions).}
#' }
#'
#' P.S. The full transcript mode and the whole genome mode demand big memory size (> 4GB) for large genomes.
#'
#' @return
#' a \code{\link{SummarizedExomePeak}} object.
#'
#' @examples
#'
#' #Specify File Directories
#'
#' GENE_ANNO_GTF = system.file("extdata", "example.gtf", package="exomePeak2")
#'
#' f1 = system.file("extdata", "IP1.bam", package="exomePeak2")
#' f2 = system.file("extdata", "IP2.bam", package="exomePeak2")
#' f3 = system.file("extdata", "IP3.bam", package="exomePeak2")
#' f4 = system.file("extdata", "IP4.bam", package="exomePeak2")
#' IP_BAM = c(f1,f2,f3,f4)
#' f1 = system.file("extdata", "Input1.bam", package="exomePeak2")
#' f2 = system.file("extdata", "Input2.bam", package="exomePeak2")
#' f3 = system.file("extdata", "Input3.bam", package="exomePeak2")
#' INPUT_BAM = c(f1,f2,f3)
#'
#' # Peak Calling
#'
#' sep <- exomePeak2(bam_ip = IP_BAM,
#' bam_input = INPUT_BAM,
#' gff_dir = GENE_ANNO_GTF,
#' genome = "hg19",
#' paired_end = FALSE)
#' sep
#'
#' # Differential Modification Analysis on Modification Peaks (Comparison of Two Conditions)
#'
#' f1 = system.file("extdata", "treated_IP1.bam", package="exomePeak2")
#' TREATED_IP_BAM = c(f1)
#' f1 = system.file("extdata", "treated_Input1.bam", package="exomePeak2")
#' TREATED_INPUT_BAM = c(f1)
#'
#' sep <- exomePeak2(bam_ip = IP_BAM,
#' bam_input = INPUT_BAM,
#' bam_treated_input = TREATED_INPUT_BAM,
#' bam_treated_ip = TREATED_IP_BAM,
#' gff_dir = GENE_ANNO_GTF,
#' genome = "hg19",
#' paired_end = FALSE)
#' sep
#'
#' # Modification Level Quantification on Single Based Modification Annotation
#'
#' f2 = system.file("extdata", "mod_annot.rds", package="exomePeak2")
#'
#' MOD_ANNO_GRANGE <- readRDS(f2)
#'
#' sep <- exomePeak2(bam_ip = IP_BAM,
#' bam_input = INPUT_BAM,
#' gff_dir = GENE_ANNO_GTF,
#' genome = "hg19",
#' paired_end = FALSE,
#' mod_annot = MOD_ANNO_GRANGE)
#' sep
#'
#' # Differential Modification Analysis on Single Based Modification Annotation
#'
#' sep <- exomePeak2(bam_ip = IP_BAM,
#' bam_input = INPUT_BAM,
#' bam_treated_input = TREATED_INPUT_BAM,
#' bam_treated_ip = TREATED_IP_BAM,
#' gff_dir = GENE_ANNO_GTF,
#' genome = "hg19",
#' paired_end = FALSE,
#' mod_annot = MOD_ANNO_GRANGE)
#' sep
#'
#'
#' @seealso \code{\link{exomePeakCalling}}, \code{\link{glmM}}, \code{\link{glmDM}}, \code{\link{normalizeGC}}, \code{\link{exportResults}}, \code{\link{plotLfcGC}}
#' @importFrom GenomicAlignments summarizeOverlaps
#' @importFrom Rsamtools asMates
#' @importFrom utils capture.output
#' @import GenomicRanges
#' @import SummarizedExperiment
#'
#' @docType methods
#'
#' @name exomePeak2
#'
#' @rdname exomePeak2
#'
#' @export
#'
exomePeak2 <- function(bam_ip = NULL,
bam_input = NULL,
bam_treated_ip = NULL,
bam_treated_input = NULL,
txdb = NULL,
bsgenome = NULL,
genome = NA,
gff_dir = NULL,
mod_annot = NULL,
paired_end = FALSE,
library_type = c("unstranded",
"1st_strand",
"2nd_strand"),
fragment_length = 100,
binding_length = 25,
step_length = binding_length,
peak_width = fragment_length/2,
pc_count_cutoff = 5,
bg_count_cutoff = 50,
p_cutoff = 1e-05,
p_adj_cutoff = NULL,
log2FC_cutoff = 1,
consistent_peak = FALSE,
consistent_log2FC_cutoff = 1,
consistent_fdr_cutoff = 0.05,
alpha = 0.05,
p0 = 0.8,
parallel = FALSE,
background_method = c("all",
"Gaussian_mixture",
"m6Aseq_prior",
"manual"),
manual_background = NULL,
correct_GC_bg = TRUE,
qtnorm = FALSE,
glm_type = c("DESeq2",
"Poisson",
"NB"),
LFC_shrinkage = c("apeglm",
"ashr",
"Gaussian",
"none"),
export_results = TRUE,
export_format = c("CSV",
"BED",
"RDS"),
table_style = c("bed",
"granges"),
save_plot_GC = TRUE,
save_plot_analysis = FALSE,
save_plot_name = "",
save_dir = "exomePeak2_output",
peak_calling_mode = c("exon",
"full_tx",
"whole_genome")
) {
LFC_shrinkage <- match.arg(LFC_shrinkage)
stopifnot(all(export_format %in% c("CSV", "BED", "RDS")))
table_style <- match.arg(table_style)
background_method <- match.arg(background_method)
glm_type <- match.arg(glm_type)
peak_calling_mode <- match.arg(peak_calling_mode)
if(!is.null(bam_treated_ip) & LFC_shrinkage == "Gaussian"){
stop("Gaussian prior is not applicable for differential modification analysis.")
}
stopifnot(fragment_length > 0)
stopifnot(step_length > 0)
stopifnot(peak_width > 0)
stopifnot(log2FC_cutoff >= 0)
stopifnot(pc_count_cutoff >= 0)
if(!is.na(genome)) {
if(!is(bsgenome,"BSgenome")) bsgenome = genome
if(!is(txdb,"TxDb") & is.null(gff_dir)) txdb = genome
}
if(!is.null(bsgenome)) {
bsgenome <- getBSgenome(bsgenome)
}
if(!is.null(gff_dir) & is.null(txdb)) {
message("Make the TxDb object ... ", appendLF = FALSE)
txdb <- suppressMessages( makeTxDbFromGFF(gff_dir) )
message("OK")
} else {
if (is.null(txdb)) {
stop("Require argument of txdb or gff_dir for transcript annotation.")
}
if (!is(txdb, "TxDb")) {
message("Make the TxDb object ... ", appendLF = FALSE)
txdb <- suppressMessages( makeTxDbFromUCSC(txdb) )
message("OK")
}
}
if( peak_calling_mode != "exon") {
txdb <- convertTxDb(txdb, type = peak_calling_mode)
}
argg <- as.list(environment()) #Get arguments information
if(length(argg$library_type) > 1) argg$library_type = argg$library_type[1]
if(length(argg$export_format) > 1) argg$export_format = argg$export_format[1]
argg <- lapply(argg, capture.output)
#Check the completeness of the genome annotation
merip_bam_lst <- scanMeripBAM(
bam_ip = bam_ip,
bam_input = bam_input,
bam_treated_ip = bam_treated_ip,
bam_treated_input = bam_treated_input,
paired_end = paired_end,
library_type = library_type
)
sep <- exomePeakCalling(merip_bams = merip_bam_lst,
txdb = txdb,
bsgenome = bsgenome,
gff_dir = gff_dir,
fragment_length = fragment_length,
binding_length = binding_length,
step_length = binding_length,
peak_width = fragment_length/2,
pc_count_cutoff = pc_count_cutoff,
bg_count_cutoff = bg_count_cutoff,
p_cutoff = p_cutoff,
p_adj_cutoff = p_adj_cutoff,
log2FC_cutoff = log2FC_cutoff,
consistent_peak = consistent_peak,
consistent_log2FC_cutoff = consistent_log2FC_cutoff,
consistent_fdr_cutoff = consistent_fdr_cutoff,
alpha = alpha,
p0 = p0,
parallel = parallel,
mod_annot = mod_annot,
background_method = background_method,
manual_background = manual_background,
correct_GC_bg = correct_GC_bg,
glm_type = glm_type,
qtnorm = qtnorm
)
sep <- estimateSeqDepth(sep)
if(!is.null(bsgenome)) {
message("Estimate offsets of GC content biases on modification peaks/sites ... ", appendLF = FALSE)
sep <- normalizeGC(sep,
feature = ifelse(correct_GC_bg,"Background","All"),
qtnorm = qtnorm)
message("OK")
}
if(any(sep$design_Treatment)){
message("Differential modification analysis with interactive GLM ... ", appendLF = FALSE)
sep <- suppressMessages( glmDM(sep, LFC_shrinkage = LFC_shrinkage, glm_type = glm_type) )
message("OK")
} else {
sep <- glmM(sep, LFC_shrinkage = LFC_shrinkage, glm_type = glm_type)
}
if( export_results ) {
exportResults(sep,
format = export_format,
inhibit_filter = (!is.null( mod_annot ))|(any(grepl("Diff",colnames(exomePeak2Results( sep ))))),
table_style = table_style,
save_dir = save_dir)
writeLines( format_argg(argg) , file.path(save_dir, "RunInfo.txt") )
}
if( !is.null(bsgenome) & save_plot_GC ) {
suppressMessages(
plotLfcGC(sep = sep,
save_pdf_prefix = save_plot_name,
save_dir = save_dir)
)
}
if (save_plot_analysis) {
# The function of Guitar plot is currently inhibited in exomePeak2, it will be recovered when the Guitar package has fixed some critical issues.
#
# if (!require(Guitar)) {
# warning(
# "the 'Guitar' package is not installed, skipping the distribution plot on travis coordinate."
# )
# } else {
# message("Generating the distribution plot on travis coordinate ...")
# plotGuitar(
# sep,
# txdb = txdb,
# save_pdf_prefix = save_plot_name,
# save_dir = save_dir
# )
# }
plotExonLength(sep,
txdb = txdb,
save_pdf_prefix = save_plot_name,
save_dir = save_dir)
plotReadsGC(sep = sep,
save_pdf_prefix = save_plot_name,
save_dir = save_dir)
}
return(sep)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.