callGenotypes | R Documentation |
Calls genotypes from a set of tallies (such as a VRanges
or VCF
file) and the coverage (currently as a BigWigFile
). We call the
genotype with the highest likelihood, where the likelihood is based on
a binomial model of the variant frequency.
## S4 method for signature 'VRanges'
callGenotypes(variants, cov,
param = CallGenotypesParam(variants),
BPPARAM = defaultBPPARAM())
## S4 method for signature 'TabixFile'
callGenotypes(variants, cov,
param = CallGenotypesParam(variants),
BPPARAM = defaultBPPARAM())
CallGenotypesParam(genome,
gq.breaks = c(0, 5, 20, 60, Inf),
p.error = 0.05,
which = tileGenome(seqinfo(genome), ntile=ntile),
ntile = 100L)
variants |
Either |
cov |
The coverage, as an RleList or a |
param |
Parameters controlling the genotyping, constructed by
|
genome |
An object with a |
gq.breaks |
A numeric vector representing an increasing sequence of genotype quality breaks to segment the wildtype runs. |
p.error |
The binomial probability for an error. This is used to calculate the expected frequency of hom-ref and hom-alt variants. |
which |
A |
ntile |
When |
BPPARAM |
A |
In general, the behavior is very similar to that of the GATK
UnifiedGenotyper (see references). For every position in the tallies,
we compute a binomial likelihood for each of wildtype (0/0), het (0/1)
and hom-alt (1/1), assuming the alt allele frequency to be
p.error
, 0.5
and 1 - p.error
, respectively. The
genotype with the maximum likelihood is chosen as the genotype, and
the genotype quality is computed by taking the fraction of the maximum
likelihood over the sum of the three likelihoods.
We assume that any position not present in the input tallies is
wildtype (0/0) and compute the quality for every such position, using
the provided coverage data. For scalability reasons, we segment runs
of these positions according to user-specified breaks on the genotype
quality. The segments become special records in the returned
VRanges
, where the range represents the segment, the ref
is the first reference base, alt
is <NON_REF>
and the
totalDepth
is the mean of the coverage.
The genotype information is recorded as metadata columns named according to gVCF conventions:
GT
The genotype call string: 0/0
, 0/1
,
1/1
.
GQ
The numeric genotype quality, phred scaled. For wildtype runs, this is minimum over the run.
PL
A 3 column matrix with the likelihood for each genotype, phred scaled. We take the minimum over wildtype runs.
MIN_DP
The minimum coverage over a wildtype
run; NA
for single positions.
For callGenotypes
, a VRanges
annotated with the genotype
call information, as described in the details.
Michael Lawrence
The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA, 2010 GENOME RESEARCH 20:1297-303.
bams <- LungCancerLines::LungCancerBamFiles()
data(vignette)
tallies <- tallies_H1993
sampleNames(tallies) <- "H1993"
mcols(tallies) <- NULL
cov <- coverage(bams$H1993)
## simple usage
## (need gmapR to find the genome in the GMAP database, otherwise,
## provide sequence directly as shown later)
if (requireNamespace("gmapR", quietly=TRUE)) {
genotypes <- callGenotypes(tallies, cov,
BPPARAM=BiocParallel::SerialParam())
}
## customize
params <- CallGenotypesParam(genome_p53, p.error = 1/1000)
genotypes <- callGenotypes(tallies, cov, params)
## write to gVCF
writeVcf(genotypes, tempfile("genotypes", fileext="vcf"), index=TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.