library(knitr) opts_chunk$set(concordance=FALSE) knitr::opts_chunk$set(fig.width = 8) knitr::opts_chunk$set(fig.height = 5) set.seed(21666641)
Many tools for germline copy number variant (CNV) detection from NGS data have been developed. Usually, these tools were designed for different input data like WGS, WES or panel data, and their performance may depend on the CNV size. Available benchmarks show that all these tools produce false positives, sometimes reaching a very high number of them.
With the aim of reducing the number of false positives,
r BiocStyle::Biocpkg("CNVfilteR")
identifies those germline CNVs that can
be discarded. This task is performed by using the germline
single nucleotide variant (SNV) calls that are usually obtained in
common NGS pipelines. As VCF field interpretation is key when working
with these files, r BiocStyle::Biocpkg("CNVfilteR")
specifically supports
VCFs produced by VarScan2, Strelka/Strelka2, freeBayes, HaplotypeCaller (GATK),
and UnifiedGenotyper (GATK). Additionally, results can be plotted using the
functions provided by the R/Bioconductor packages
karyoploteR and
CopyNumberPlots.
r BiocStyle::Biocpkg("CNVfilteR")
is a
Bioconductor package and to install it we have
to use r BiocStyle::Biocpkg("BiocManager")
.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("CNVfilteR")
We can also install the package from github to get the latest devel version.
```` {r, eval = FALSE} BiocManager::install("jpuntomarcos/CNVfilteR")
# Quick Start Below we show a full example that covers the usual steps: CNVs data loading, SNVs loading, identifying false postives and plotting the results. First, we can load some CNV tool results: ```r library(CNVfilteR) cnvs.file <- system.file("extdata", "DECoN.CNVcalls.csv", package = "CNVfilteR", mustWork = TRUE) cnvs.gr <- loadCNVcalls(cnvs.file = cnvs.file, chr.column = "Chromosome", start.column = "Start", end.column = "End", cnv.column = "CNV.type", sample.column = "Sample", genome = "hg19")
Then, we load the SNVs stored in a couple of VCF files.
vcf.files <- c(system.file("extdata", "variants.sample1.vcf.gz", package = "CNVfilteR", mustWork = TRUE), system.file("extdata", "variants.sample2.vcf.gz", package = "CNVfilteR", mustWork = TRUE)) vcfs <- loadVCFs(vcf.files, cnvs.gr = cnvs.gr)
We observe that the function recognized VarScan2 as the source, so fields
were selected and allele frequency consequently. Now we can call
filterCNVs()
to identify those CNVs that can be discarded.
results <- filterCNVs(cnvs.gr, vcfs) names(results)
And we can check those CNVs that can be filtered out:
results$cnvs[results$cnvs$filter == TRUE]
As an example, we can observe that the CNV with cnv.id
=3 contains 4 variants:
2 in favor of discarding it, two against discarding it. If we want
to know more about the variants falling in a certain CNV we can do:
results$variantsForEachCNV[["3"]]
Two variants are close to the default expected heterozygous frequency,
0.5, so they obtain a positive score. The other two variants are not
so clearly close to the default expected duplication value, 0.33, so
they obtain a low negative score. All these default values and others can
be modified in the filterCNVs()
function.
Finally, we may be interested in plotting the results. For example, we can
plot easily the duplication CNV with cnv.id
=3 and all the variants falling in
it.
plotVariantsForCNV(results, "3")
We can do the same to plot the deletion CNV with cnv.id
=19, where all
variants discard the CNV except one homozygous variant that does not give us any
information for supporting or discarding the CNV:
plotVariantsForCNV(results, "19")
On the opposite side, we can observe those CNVs that cannot be discarded:
results$cnvs[results$cnvs$filter != TRUE]
For example, the CNV with cnv.id
=14 contains one variant. If we get
the variant info, we see that the variant has an allele frequency very close to
the default expected duplication value, 0.66.
results$variantsForEachCNV[["14"]]
So, no evidence was found for discarding the CNV. We can also plot the CNV and the variant:
plotVariantsForCNV(results, "3")
r BiocStyle::Biocpkg("CNVfilteR")
functions expect germline CNVs calls to be a
GRanges
object with a few specificic metadata columns:
You can create this object yourself, but
r BiocStyle::Biocpkg("CNVfilter")
provides the proper function to do this,
loadCNVcalls()
. This function can interpret any tsv o csv file by indicating
which columns store the information. For example, in the following code, the
chr.column
column is stored in the "Chromosome" column of the cnvs.file
.
cnvs.file <- system.file("extdata", "DECoN.CNVcalls.csv", package = "CNVfilteR", mustWork = TRUE) cnvs.gr <- loadCNVcalls(cnvs.file = cnvs.file, chr.column = "Chromosome", start.column = "Start", end.column = "End", cnv.column = "CNV.type", sample.column = "Sample", genome = "hg19")
loadCNVcalls()
can interpret different types of CNVs. Among other options,
separator can be selected using the sep
parameter (defaults to \t),
and first lines can be skipped using the skip
parameter (defaults to 0). Also,
the value used in cnv.column
to store the CNV type can be modified
using the deletion
and duplication
parameters (defaults to "deletion" and
"duplication", respectively). If, for example, our cnv.column
uses "CN1" and
"CN3" for deletion and duplication (respectively), we should indicate
the function to use these values:
cnvs.gr.2 <- loadCNVcalls(cnvs.file = cnvs.file.2, deletion = "CN1", duplication = "CN3", chr.column = "Chromosome", start.column = "Start", end.column = "End", cnv.column = "CNV.type", sample.column = "Sample")
Some CNV tools generate results where the CNV location is stored in a single
column with the format chr:start-end (i.e. 1:538001-540000). In this
case, we can call loadCNVcalls()
using the coord.column
instead of the
chr.column
, start.column
and end.column
columns.
Common NGS pipelines produce germline variant calling (SNVs or INDELs)
in a VCF file. However, VCF interpretation is
challenging due to the flexibility provided by the VCF format definition.
It is not straightforward to interpret correctly the fields in the VCF file
and compute the allele frequency. loadVCFs()
interprets automatically
VCFs produced by VarScan2, Strelka/Strelka2, freeBayes, HaplotypeCaller (GATK),
and UnifiedGenotyper (GATK).
In the following example the function recognizes VarScan2 as the source.
vcf.files <- c(system.file("extdata", "variants.sample1.vcf.gz", package = "CNVfilteR", mustWork = TRUE), system.file("extdata", "variants.sample2.vcf.gz", package = "CNVfilteR", mustWork = TRUE)) vcfs <- loadVCFs(vcf.files, cnvs.gr = cnvs.gr)
We can also load the VCF file spicifying how to interpret it, which can be
useful if the VCF was generated by a caller not supported by
r BiocStyle::Biocpkg("CNVfilteR")
. For example we can specify the ref/alt
fields:
vcfs <- loadVCFs(vcf.files, cnvs.gr = cnvs.gr, vcf.source = "MyCaller", ref.support.field = "RD", alt.support.field = "AD")
Alternatively, we can set the list.support.field
parameter so that field
will be loaded assuming that it is a list in this order: reference allele,
alternative allele. As an example:
vcf.file3 <- c(system.file("extdata", "variants.sample3.vcf", package = "CNVfilteR", mustWork = TRUE)) vcfs3 <- loadVCFs(vcf.file3, cnvs.gr = cnvs.gr, vcf.source = "MyCaller", list.support.field = "AD")
CNVfilteR uses SNVs to identify false-positive CNV calls. Therefore, its performance depends on the SNV calls quality. We recommend using VCF files free of false-positive SNVs (as possible) to improve CNVfilteR accuracy. Some considerations can be followed in order to provide reliable SNVs to CNVfilteR.
Use the min.total.depth
parameter to discard SNVs with low depth coverage in
the loadVCFs
function. The default
value is 10, which may be appropriate in many WGS samples, but
this value should be adapted to your experiment conditions. For
example, we used a min.total.depth
of 30 when using CNVfilteR on panel (targeted-enrinched) samples with high
coverage and VarScan2 as SNV caller.
Low complexity and repetitive regions are genome areas where SNV callers
(also CNV callers) perform poorly. If possible, ignore these regions when using
CNVfilteR. We can exclude
those complex regions that have already known alignement artifacts
with the parameter regions.to.exclude
.
In this example, we exclude PMS2, PRSS1, and
FANCD2 genes because they are pseudogenes with alignments artifacts:
regions.to.exclude <- GRanges(seqnames = c("chr3","chr7", "chr7"), ranges = IRanges(c(10068098, 6012870, 142457319), c(10143614, 6048756, 142460923))) vcfs4 <- loadVCFs(vcf.files, cnvs.gr = cnvs.gr, regions.to.exclude = regions.to.exclude)
Also, the parameter exclude.indels
indicates whether to exclude INDELs when
loading the variants. TRUE is the default and recommended value given
that INDELs allele frequency varies differently than SNVs. Including
INDELs may allow the algorithm to identify more CNVs to discard with a greater
risk of identifying them wrongly. Additionally, any SNV overlapping an INDEL
will be ignored because the SNV allele frequency may be affected in that region.
The function loadVCFs()
also adapts to different needs. If sample.names
parameter is
not provided, the sample names included in the VCF itself will be used.
Both single-sample and multi-sample VCFs are accepted, but when
multi-sample VCFs are used, sample.names
parameter must be NULL.
If VCF is not compressed with bgzip, the function compresses it and generates the .gz file. If .tbi file does not exist for a given VCF file, the function also generates it. All files are generated in a temporary folder.
See loadVCFs()
documentation to see other parameters info.
The task of identifying false positives is performed by the filterCNVs()
function. It checks all the variants (SNVs and optionally INDELs) falling in
each CNV present in cnvs.gr
to identify those CNVs that can be filtered out. It
returns an S3 object with 3 elements: cnvs
, variantsForEachCNV
and
filterParameters
:
results <- filterCNVs(cnvs.gr, vcfs) tail(results$cnvs)
Observe that those CNVs that can be filtered out have the value TRUE in the
column filter
. r BiocStyle::Biocpkg("CNVfilteR")
employs two
different strategies for identifying those CNVs:
ht.deletions.threshold
% of heterozygous variants in
the CNV. Default ht.deletions.threshold
value is 30, so 30% is required.score
is >=
dup.threshold.score
after computing all
heterozygous variants falling in that CNV. Default dup.threshold.score
value
is 0.5. How the score is computed for each variant is explained in
the next section.The scoring model for determining whether a certain duplication CNV can be discarded is based on the allele frequency for each heterozygous variant falling in that CNV:
expected.ht.mean
). So, a variant with an allele frequency close to 50%
gives us evidence of the non-existence of a duplication CNV, so the CNV could
be discarded.expected.dup.ht.mean1
) when the variant is not in the same allele
than the duplication CNV, and 66.6% (expected.dup.ht.mean2
) when the variant
is in the same allele than
the duplication CNV call. So, a variant with an allele frequency close to
33.3% or 66.6% gives us evidence of the existence of duplication CNV.Although we can expect that most of the variants are close to the expected means
(33.3%, 50%, and 66.6%), many of them can be far from any expected mean. The
scoring model implemented in the filterCNVs()
function measures
the evidence - for discarding a certain CNV - using
a scoring model. The scoring model is based on the fuzzy logic, where elements
can have any value between 1 (True) and 0 (False). Following this idea,
each variant will be scored with a value between 0 and 1 depending on
how close is the allele frequency to the nearest expected mean.
The total score
is computed among all the variants falling in the CNV. If the
score
is greater than the dup.threshold.score
, the CNV can be discarded.
A common way of applying the fuzzy logic is using the sigmoid function. CNVfilteR uses the sigmoid function implemented in the pracma package, which is defined as $$ \begin{aligned} y = 1/(1 + e^{-c1(x−c2)}) \end{aligned} $$
The scoring model is built on 6 sigmoids defined on 6 different intervals. The
c1 parameter is 2 by default (sigmoid.c1
), and the c2 parameter
is defined for the 6 sigmoids (sigmoid.c2.vector
).
expected.dup.ht.mean1
], c2=28expected.dup.ht.mean1
, sigmoid.int1
], c2=38.3sigmoid.int1
, expected.ht.mean
], c2=44.7expected.ht.mean
, sigmoid.int2
], c2=55.3sigmoid.int2
, expected.dup.ht.mean2
], c`=61.3expected.dup.ht.mean2
, 80], c2=71.3Where sigmoid.int1
is the mean between expected.dup.ht.mean1
and
expected.ht.mean
, and sigmoid.int2
is the mean between
expected.dup.ht.mean2
and expected.ht.mean
.
The scoring model can be plotted using the plotScoringModel()
function.
p <- results$filterParameters plotScoringModel(expected.ht.mean = p$expected.ht.mean, expected.dup.ht.mean1 = p$expected.dup.ht.mean1, expected.dup.ht.mean2 = p$expected.dup.ht.mean2, sigmoid.c1 = p$sigmoid.c1, sigmoid.c2.vector = p$sigmoid.c2.vector)
And the scoring model can be modified when calling the filterCNVs()
function.
Let's see how the model changes when we modify the parameter sigmoid.c1
(1 instead of 2):
plotScoringModel(expected.ht.mean = p$expected.ht.mean, expected.dup.ht.mean1 = p$expected.dup.ht.mean1, expected.dup.ht.mean2 = p$expected.dup.ht.mean2, sigmoid.c1 = 1, sigmoid.c2.vector = p$sigmoid.c2.vector)
We can also modify the sigmoid.c2.vector
parameter for each sigmoid function. For example, to
make the central sigmoids narrower:
plotScoringModel(expected.ht.mean = p$expected.ht.mean, expected.dup.ht.mean1 = p$expected.dup.ht.mean1, expected.dup.ht.mean2 = p$expected.dup.ht.mean2, sigmoid.c1 = p$sigmoid.c1, sigmoid.c2.vector = c(28, 38.3, 46.7, 53.3, 61.3, 71.3))
Many CNV callers produce inaccurate CNV calls. These inaccurate CNV calls are
more likely to be true (to overlap the real CNV) in the middle of the CNV than
in the extremes. So, the margin.pct
parameter defines the percentage of CNV
(from each CNV limit) where SNVs will be ignored. By default, only 10% of SNVs
from each CNV extreme will be ignored. This margin.pct
parameter can be
modified to better adapt it to your CNV caller. For example, we observed that
DECoN produced very accurate CNV calls in our genes panel dataset, so
margin.pct
value was updated to 0 in this context.
Summarizing, variants in the CNV call but close to the ends of the CNV call will
be ignored. margin.pct
defines the percentage of CNV length, located at
each CNV limit, where variants will be ignored. For example, for a CNV
chr1:1000-2000 and a margin.pct
value of 10,
variants within chr1:1000-1100 and chr1:1900-2000 will be ignored.
We can plot easily a certain CNV and the variants in it. For example, the
duplication CNV with cnv.id
=17 can be plotted as follows:
plotVariantsForCNV(results, "16")
Some parameters can be customized, like points.cex
and points.pch
. It is
also possible to plot all CNVs in a global schema where all the chromosomes are
plotted:
cnvs.file <- system.file("extdata", "DECoN.CNVcalls.2.csv", package = "CNVfilteR", mustWork = TRUE) cnvs.gr.2 <- loadCNVcalls(cnvs.file = cnvs.file, chr.column = "Chromosome", start.column = "Start", end.column = "End", cnv.column = "CNV.type", sample.column = "Sample", genome = "hg19") plotAllCNVs(cnvs.gr.2)
Note that if a CNV is too small, it will not be visible when calling
plotAllCNVs()
.
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.