generateVcfReport | R Documentation |
This function reports base frequencies at particular genomic positions and tests their association with the methylation status of the sequencing reads.
generateVcfReport(
bam,
vcf,
vcf.style = NULL,
bed = NULL,
report.file = NULL,
zero.based.bed = FALSE,
threshold.reads = TRUE,
threshold.context = c("CG", "CHG", "CHH", "CxG", "CX"),
min.context.sites = 2,
min.context.beta = 0.5,
max.outofcontext.beta = 0.1,
...,
gzip = FALSE,
verbose = TRUE
)
bam |
BAM file location string OR preprocessed output of
|
vcf |
Variant Call Format (VCF) file location string OR a VCF object
returned by |
vcf.style |
string for the seqlevels style of the VCF file, if different from BED file/object. Only has effect when 'vcf' parameter points to the VCF file location and 'bed' is not NULL. Possible values:
|
bed |
Browser Extensible Data (BED) file location string OR object of
class |
report.file |
file location string to write the VCF report. If NULL
(the default) then report is returned as a
|
zero.based.bed |
boolean defining if BED coordinates are zero based (default: FALSE). |
threshold.reads |
boolean defining if sequence reads should be thresholded before counting bases in reference and variant epialleles (default: TRUE). Disabling thresholding is possible but makes no sense in the context of this function, because all the reads will be assigned to the variant epiallele, which will result in Fisher's Exact test p-value of 1 (in columns 'FEp+' and 'FEP-'). As thresholding is not recommended for long-read sequencing data, this function is not recommended for such data either. |
threshold.context |
string defining cytosine methylation context used for thresholding the reads:
This option has no effect when read thresholding is disabled. |
min.context.sites |
non-negative integer for minimum number of cytosines within the 'threshold.context' (default: 2). Reads containing fewer within-the-context cytosines are considered completely unmethylated (thus belonging to the reference epiallele). This option has no effect when read thresholding is disabled. |
min.context.beta |
real number in the range [0;1] (default: 0.5). Reads with average beta value for within-the-context cytosines below this threshold are considered completely unmethylated (thus belonging to the reference epiallele). This option has no effect when read thresholding is disabled. |
max.outofcontext.beta |
real number in the range [0;1] (default: 0.1). Reads with average beta value for out-of-context cytosines above this threshold are considered completely unmethylated (thus belonging to the reference epiallele). This option has no effect when read thresholding is disabled. |
... |
other parameters to pass to the
|
gzip |
boolean to compress the report (default: FALSE). |
verbose |
boolean to report progress and timings (default: TRUE). |
Using BAM reads and sequence variation information as an input, 'generateVcfReport' function thresholds the reads (for paired-end sequencing alignment files - read pairs as a single entity) according to supplied parameters and calculates the occurrence of Reference and Alternative bases within reads, taking into the account DNA strand the read mapped to and average methylation level (epiallele status) of the read.
The information on sequence variation can be supplied as a Variant Call
Format (VCF) file location or an object of class VCF, returned by the
readVcf
function call. As whole-genome VCF
files can be extremely large, it is strongly advised to use only relevant
subset of their data, prefiltering the VCF object manually before calling
'generateVcfReport' or specifying 'bed' parameter when 'vcf' points to the
location of such large VCF file. Please note that all the BAM, BED and VCF
files must use the same style for seqlevels (i.e. chromosome names).
After counting, function checks if certain bases occur more often within reads belonging to certain epialleles using Fisher Exact test (HTSlib's own implementation) and reports separate p-values for reads mapped to "+" (forward) and "-" (reverse) DNA strands.
Please note that the final report currently includes only the VCF entries with single-base REF and ALT alleles. Also, the default ('min.baseq=0') output of 'generateVcfReport' is equivalent to the one of 'samtools mplieup -Q 0 ...', and therefore may result in false SNVs caused by misalignments. Remember to increase 'min.baseq' ('samtools mplieup -Q' default value is 13) to obtain higher-quality results.
Read thresholding by an average methylation level used in this function
makes little sense for long-read sequencing alignments,
as such reads can cover multiple regions with very different DNA methylation
properties. Instead, please use extractPatterns
,
limiting pattern output to the region of interest only.
data.table
object containing VCF report or
NULL if report.file was specified. The report columns are:
name – variation identifier (e.g. "rs123456789")
seqnames – reference sequence name
range – genomic coordinates of the variation
REF – base at the reference allele
ALT – base at the alternative allele
[M|U][+|-][Ref|Alt] – number of Reference or Alternative bases that were found at this particular position within Methylated (above threshold) or Unmethylated (below threshold) reads that were mapped to "+" (forward) or "-" (reverse) DNA strand. NA values mean that it is not possible to determine the number of bases due to the bisulfite conversion-related limitations (C->T variants on "+" and G->A on "-" strands)
SumRef – sum of all Reference base counts
SumAlt – sum of all Alternative base counts
FEp+ – Fisher Exact test p-value for association of a variation with methylation status of the reads that map to the "+" (forward) DNA strand. Calculated using following contingency table:
M+Ref | M+Alt |
U+Ref | U+Alt |
FEp- – Fisher Exact test p-value for association of a variation with methylation status of the reads that map to the "-" (reverse) DNA strand. Calculated using following contingency table:
M-Ref | M-Alt |
U-Ref | U-Alt |
preprocessBam
for preloading BAM data,
generateCytosineReport
for methylation statistics at the level
of individual cytosines, generateBedReport
for genomic
region-based statistics, extractPatterns
for exploring
methylation patterns and plotPatterns
for pretty plotting
of its output, generateBedEcdf
for analysing the
distribution of per-read beta values, and 'epialleleR' vignettes for the
description of usage and sample data.
GRanges
class for working with genomic ranges,
readVcf
function for loading VCF data,
seqlevelsStyle
function for getting or setting
the seqlevels style.
capture.bam <- system.file("extdata", "capture.bam", package="epialleleR")
capture.bed <- system.file("extdata", "capture.bed", package="epialleleR")
capture.vcf <- system.file("extdata", "capture.vcf.gz",
package="epialleleR")
# VCF report
vcf.report <- generateVcfReport(bam=capture.bam, bed=capture.bed,
vcf=capture.vcf)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.