stopifnot( requireNamespace("pander") ) library(knitr) opts_chunk$set( # Set these first, to make them default message = FALSE, warning=FALSE, error=FALSE ) optChunkDefault <- opts_chunk$get()
The VCF Tool Box (r Biocpkg("TVTB")
) offers S4 classes and methods
to filter, summarise and visualise genetic variation data
stored in VCF files pre-processed by the
Ensembl Variant Effect Predictor
(VEP) [@RN1].
An RStudio/Shiny web-application,
the Shiny Variant Explorer (tSVE),
provides a convenient interface to demonstrate those
functionalities integrated in a programming-free environment.
Currently, major functionalities in the r Biocpkg("TVTB")
package include:
A class to store recurrent parameters of genetic analyses
r Biocpkg("BiocParallel")
)Genotype counts and allele frequencies
ExpandedVCF
objects
(i.e. bi-allelic records)Classes of VCF filter rules
fixed
slot of an VCF
objectinfo
slot of an VCF
objectVCF
objects using the above filter rulesInstructions to install the VCF Tool Box are available here.
Once installed, the package can be loaded and attached as follows:
library(TVTB)
Most functionalities in r Biocpkg("TVTB")
require recurrent information
such as:
<phenotype>_<level>_<suffix>
,<suffix>
.\bioccomment{
INFO key suffixes should be considered carefully to avoid conflicts
with INFO keys already present in the VCF
object.
}
To reduce the burden of repetition during programming, and to
facilitate analyses using consistent sets of parameters,
r Biocpkg("TVTB")
implements the TVTBparam
class.
The TVTBparam
class offer a container for parameters recurrently used
across the package.
A TVTBparam
object may be initialised as follows:
tparam <- TVTBparam(Genotypes( ref = "0|0", het = c("0|1", "1|0", "0|2", "2|0", "1|2", "2|1"), alt = c("1|1", "2|2")), ranges = GenomicRanges::GRangesList( SLC24A5 = GenomicRanges::GRanges( seqnames = "15", IRanges::IRanges( start = 48413170, end = 48434757) ) ) )
TVTBparam
objects have a convenient summary view and accessor methods:
tparam
In this example:
genos(x)
Genotypes
"0|0"
."REF"
."0|1"
, "1|0"
, "0|2"
, "2|0"
, "1|2"
, and "2|1"
."HET"
."1|1"
."ALT"
.ranges(x)
GRangesList
"15"
.aaf(x)
"AAF"
.maf(x)
"MAF"
.vep(x)
"CSQ"
.bp(x)
BiocParallelParam
svp(x)
ScanVcfParam
which
slot automatically populated with reduce(unlist(ranges(x)))
Default values are provided for all slots except genotypes, as these may vary more frequently from one data set to another (e.g. phased, unphased, imputed).
Functionalities in r Biocpkg("TVTB")
support
CollapsedVCF
and ExpandedVCF
objects
(both extending the virtual class VCF
) of the
r Biocpkg("VariantAnnotation")
package.
Typically, CollapsedVCF
objects are produced by the
r Biocpkg("VariantAnnotation")
readVcf
method after parsing a VCF file,
and ExpandedVCF
objects result of the
r Biocpkg("VariantAnnotation")
expand
method applied to
a CollapsedVCF
object.
Any information that users deem relevant for the analysis may be
imported from VCF files and stored in
VCF
objects passed to r Biocpkg("TVTB")
methods.
However, to enable the key functionalities of the package,
the slots of a VCF
object should include
at least the following information:
fixed(x)
"REF"
and "ALT"
.info(x)
<vep>
: where <vep>
stands for the INFO key where
the Ensembl VEP predictions are stored in the VCF
object.geno(x)
GT
: genotypes.colData(x)
: phenotypes.
In the near future, r Biocpkg("TVTB")
functionalities are expected to
produce summary statistics and plots faceted by meta-features,
each potentially composed of multiple genomic ranges.
For instance, burden tests may be performed on a set of transcripts,
considering only variants in their respective sets of exons.
The r Biocpkg("GenomicRanges")
GRangesList
class is an ideal container
in example,
as each GRanges
in the GRangesList
would represent a transcript,
and each element in the GRanges
would represent an exon.
Furthermore, TVTBparam
objects may be supplied as the param
argument of the r Biocpkg("VariantAnnotation")
readVcf
. In this case, the
TVTBparam
object is used to import only variants overlapping
the relevant genomic regions.
Moreover, the readVcf
method also ensured that the vep
slot of the
TVTBparam
object is present in the header of the VCF file.
svp <- as(tparam, "ScanVcfParam") svp
Although VCF
objects may be constructed without attached phenotype data,
phenotype information is critical to interpret and compare genetic variants
between groups of samples
(e.g. burden of damaging variants in different phenotype levels).
VCF
objects accept phenotype information
(as r Biocpkg("S4Vectors")
DataFrame
) in the colData
slot.
This practice has the key advantage of keeping phenotype and genetic
information synchronised through operation such as subsetting and re-ordering,
limiting workspace entropy
and confusion.
An ExpandedVCF
object that contains the minimal data necessary for the rest
of the vignette can be created as follows:
Step 1: Import phenotypes
phenoFile <- system.file( "extdata", "integrated_samples.txt", package = "TVTB") phenotypes <- S4Vectors::DataFrame( read.table(file = phenoFile, header = TRUE, row.names = 1))
Step 2: Define the VCF file to parse
vcfFile <- system.file( "extdata", "chr15.phase3_integrated.vcf.gz", package = "TVTB") tabixVcf <- Rsamtools::TabixFile(file = vcfFile)
Step 3: Define VCF import parameters
VariantAnnotation::vcfInfo(svp(tparam)) <- vep(tparam) VariantAnnotation::vcfGeno(svp(tparam)) <- "GT" svp(tparam)
Of particular interest in the above chunk of code:
TVTBparam
constructor previously populated the which
slot of svp
with "reduced" (i.e. non-overlapping)
genomic ranges defined in the ranges
slot.vep
slot will be importedStep 4: Import and pre-process variants
# Import variants as a CollapsedVCF object vcf <- VariantAnnotation::readVcf( tabixVcf, param = tparam, colData = phenotypes) # Expand into a ExpandedVCF object (bi-allelic records) vcf <- VariantAnnotation::expand(x = vcf, row.names = TRUE)
Of particular interest in the above chunk of code, the readVcf
method is
given:
TVTBparam
parameters, invoking the corresponding method signaturerownames
of those phenotypes defines the sample identifiers that
are queried from the VCF file.colData
slot of the resulting
VCF
object.The result is an ExpandedVCF
object that includes variants
in the targeted genomic range(s) and samples:
vcf
Although interesting figures and summary tables may be obtained
as soon as the first ExpandedVCF
object is created
(see section Summarising Ensembl VEP predictions),
those methods may benefit from information added to additional INFO keys
after data import, either manually by the user, or through various
methods implemented in the r Biocpkg("TVTB")
package.
For instance, the method addOverallFrequencies
uses the
reference homozoygote (REF), heterozygote (HET),
and homozygote alternate (ALT) genotypes defined in the TVTBparam
object
stored in the VCF
metadata
to obtain the count of each genotype in an ExpandedVCF
object.
Immediately thereafter, the method uses those counts to calculate
alternate allele frequency (AAF) and minor allele frequency (MAF).
Finally, the method stores the five calculated values
(REF, HET, ALT, AAF, and MAF)
in INFO keys defined by suffixes also declared in the TVTBparam
object.
initialInfo <- colnames(info(vcf)) vcf <- addOverallFrequencies(vcf = vcf) setdiff(colnames(info(vcf)), initialInfo)
Notably, the addOverallFrequencies
method is synonym to the addFrequencies
method missing the argument phenos
:
vcf <- addFrequencies(vcf = vcf, force = TRUE)
\bioccomment{
The optional argument force = TRUE
is used here
to overwrite INFO keys created in the previous chunk of code.
}
Similarly, the method addPhenoLevelFrequencies
obtains the count of each
genotype in samples associated with given level(s) of given phenotype(s),
and stores the calculated values in INFO keys defined as
<pheno>_<level>_<suffix>
, with suffixes defined in the TVTBparam
object
stored in the VCF
metadata.
initialInfo <- colnames(info(vcf)) vcf <- addPhenoLevelFrequencies( vcf = vcf, pheno = "super_pop", level = "AFR") setdiff(colnames(info(vcf)), initialInfo)
Notably, the addPhenoLevelFrequencies
method is synonym
to the addFrequencies
method called with the argument phenos
given as a list where names
are
phenotypes, and values are character
vectors of levels to
process within each phenotype:
initialInfo <- colnames(info(vcf)) vcf <- addFrequencies( vcf, list(super_pop = c("EUR", "SAS", "EAS", "AMR")) ) setdiff(colnames(info(vcf)), initialInfo)
In addition, the addFrequencies
method can be given a character
vector
of phenotypes as the phenos
argument, in which case frequencies are
calculated for all levels of the given phenotypes:
vcf <- addFrequencies(vcf, "pop") head(grep("^pop_[[:alpha:]]+_REF", colnames(info(vcf)), value = TRUE))
Although VCF
objects are straightforward to subset
using either indices and row names
(as they inherit from the r Biocpkg("SummarizedExperiment")
RangedSummarizedExperiment
class),
users may wish to identify variants that pass combinations of criteria based on
information in their fixed
slot, info
slot, and Ensembl VEP predictions,
a non-trivial task due to those pieces of information being stored in
different slots of the VCF
object, and the 1:N relationship
between variants and EnsemblVEP predictions.
To facilitate the definition of VCF filter rules, and their application to
VCF
objects, r Biocpkg("TVTB")
extends the r Biocpkg("S4Vectors")
FilterRules
class in four new classes of filter rules:
motivations <- readRDS( system.file("misc", "motivations_rules.rds", package = "TVTB") ) pander::pandoc.table( motivations, paste( "Motivation for each of the new classes extending `FilterRules`, to define VCF filter rules." ), style = "multiline", split.cells = c(15, 45), split.tables = c(Inf) )
Note that FilterRules
objects themselves are applicable to VCF
objects,
with two important difference from the above specialised classes:
VCF
slotsVCF
slots, for instance:S4Vectors::FilterRules(list( mixed = function(x){ VariantAnnotation::fixed(x)[,"FILTER"] == "PASS" & VariantAnnotation::info(x)[,"MAF"] >= 0.05 } ))
Instances of those classes may be initialised as follows:
VcfFixedRules
fixedR <- VcfFixedRules(list( pass = expression(FILTER == "PASS"), qual = expression(QUAL > 20) )) fixedR
VcfInfoRules
infoR <- VcfInfoRules( exprs = list( rare = expression(MAF < 0.01 & MAF > 0), common = expression(MAF > 0.05), mac_ge3 = expression(HET + 2*ALT >= 3)), active = c(TRUE, TRUE, FALSE) ) infoR
The above code chunk illustrates useful features of FilterRules
:
FilterRules
are initialised in an active state by default
(evaluating an inactive rule returns TRUE
for all items)
The active
argument may be used to initialise specific filter rules in
an inactive state.expression
(or function
) may refer to multiple columns of
the relevant slot in the VCF
object.VCF
object.VcfVepRules
vepR <- VcfVepRules(exprs = list( missense = expression(Consequence %in% c("missense_variant")), CADD_gt15 = expression(CADD_PHRED > 15) )) vepR
VcfFilterRules
VcfFilterRules
combine VCF filter rules of different types
in a single object.
vcfRules <- VcfFilterRules(fixedR, infoR, vepR) vcfRules
This vignette offers only a brief peek into the utility and flexibility of
VCF filter rules. More (complex) examples are given in a separate
vignette, including filter rules using functions and pattern matching.
The documentation of the r Biocpkg("S4Vectors")
package---where the parent
class FilterRules
is defined---can also be a source of inspiration.
As the above classes of VCF filter rules inherit
from the r Biocpkg("S4Vectors")
FilterRules
class,
they also benefit from its accessors and methods.
For instance, VCF filter rules can easily be toggled
between active and inactive states:
active(vcfRules)["CADD_gt15"] <- FALSE active(vcfRules)
A separate vignette describes in greater detail the use of classes that contain VCF filter rules.
Once defined, the above filter rules can be applied to ExpandedVCF
objects,
in the same way as FilterRules
are evaluated in a given environment
(see the r Biocpkg("S4Vectors")
documentation):
summary(eval(expr = infoR, envir = vcf)) summary(eval(expr = vcfRules, envir = vcf)) summary(evalSeparately(expr = vcfRules, envir = vcf))
\bioccomment{
Note how the inactive rules (i.e. CADD
, pass
) returns TRUE
for all
variants, irrespective of the filter result.
}
Let us show the alternate allele frequency (AAF) of common variants, estimated in each super-population, in the context of the transcripts ovelapping the region of interest.
In the MAF
track:
plotInfo( subsetByFilter(vcf, vcfRules["common"]), "AAF", range(GenomicRanges::granges(vcf)), EnsDb.Hsapiens.v75::EnsDb.Hsapiens.v75, "super_pop", zero.rm = FALSE )
Alternatively, the minor allele frequency (MAF) of missense variants
(as estimated from the entire data set) may be visualised in the same manner.
However, due to the nature of those variants, the zero.rm
argument may be
set to TRUE
to hide all data points showing a MAF of 0
; thereby
variants actually detected in each super-population are emphasised
even at low frequencies.
plotInfo( subsetByFilter(vcf, vcfRules["missense"]), "MAF", range(GenomicRanges::granges(vcf)), EnsDb.Hsapiens.v75::EnsDb.Hsapiens.v75, "super_pop", zero.rm = TRUE )
Using the r CRANpkg("GGally")
ggpairs
method,
let us make a matrix of plots for common variants, showing:
[^1]: Pearson, by default
pairsInfo(subsetByFilter(vcf, vcfRules["common"]), "AAF", "super_pop")
Note that the ellipsis ...
allows a high degree of customisation,
as it passes additional arguments to the underlying ggpairs
method.
This section presents upcoming features.
As soon as genetic and phenotypic information are imported
into an ExpandedVCF
object,
or after the object was extended with additional information,
the scientific value of the data may be revealed by
a variety of summary statistics and graphical representations.
This section will soon present several ideas being implemented in
r Biocpkg("TVTB")
, for instance:
Dr. Stefan Gräf and Mr. Matthias Haimel for advice on the VCF file format and the Ensembl VEP script. Prof. Martin Wilkins for his trust and support. Dr. Michael Lawrence for his helpful code review and suggestions.
Last but not least, the amazing collaborative effort of the rep("many",n)
Bioconductor developers whose hard work
appears through the dependencies of this package.
Here is the output of sessionInfo()
on the system on which this document was
compiled:
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.