readVariantInfo-methods: Read information about variants from VCF file

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

A fast lightweight function that determines information on variants ocurring in a VCF file and returns the result as a VariantInfo object

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
## S4 method for signature 'TabixFile,GRanges'
readVariantInfo(file, regions, subset,
                noIndels=TRUE, onlyPass=TRUE,
                na.limit=1, MAF.limit=1,
                na.action=c("impute.major", "omit", "fail"),
                MAF.action=c("ignore", "omit","invert", "fail"),
                omitZeroMAF=TRUE, refAlt=FALSE, sex=NULL)
## S4 method for signature 'TabixFile,missing'
readVariantInfo(file, regions, ...)
## S4 method for signature 'character,GRanges'
readVariantInfo(file, regions, ...)
## S4 method for signature 'character,missing'
readVariantInfo(file, regions, ...)

Arguments

file

a TabixFile object or a character string with a file name of the VCF file to read from; if file is a file name, the method internally creates a TabixFile object for this file name.

regions

a GRanges object that specifies which genomic regions to read from the VCF file; if missing, the entire VCF file is read.

subset

a numeric vector with indices or a character vector with names of samples to restrict to; if specified, only these samples' genotypes are considered when determining the minor allele frequencies (MAFs) of variants.

noIndels

if TRUE (default), only single-nucleotide variants (SNVs) are considered and indels are skipped.

onlyPass

if TRUE (default), only variants are considered whose value in the FILTER column is “PASS”.

na.limit

all variants with a missing value ratio above this threshold will be omitted from the output object.

MAF.limit

all variants with an MAF above this threshold will be omitted from the output object.

na.action

if “impute.major”, all missing values are considered as major alleles when computing MAFs. If “omit”, all variants containing missing values will be omitted in the output object. If “fail”, the function stops with an error if a variant contains any missing values.

MAF.action

if “ignore” (default), no action is taken for variants with an MAF greater than 0.5, these variants are kept and included in the output object as they are. If “omit”, all variants with an MAF greater than 0.5 are omitted in the output object. If “fail”, the function stops with an error if any variant has an MAF greater than 0.5. If “invert”, all variants with an MAF exceeding 0.5 will be inverted in the sense that all minor alleles will be replaced by major alleles and vice versa. Note: if this setting is used in conjunction with refAlt=TRUE, the MAFs of the variants that have been inverted do no longer correspond to the true alternate allele.

omitZeroMAF

if TRUE (default), variants with an MAF of 0 are not considered and omitted from the output object.

refAlt

if TRUE, two metadata columns named “ref” and “alt” are added to the output object that contain reference and alternate alleles. Note that these sequences can be quite long for indels, which may result in large memory consumption. The default is FALSE.

sex

if NULL, all samples are treated the same without any modifications; if sex is a factor with levels F (female) and M (male) that is as long as subset or as the VCF file has samples, this argument is interpreted as the sex of the samples. In this case, the genotypes corresponding to male samples are doubled before computing MAFs. The option to supply the sex argument is meant to allow for a correct estimate of MAFs as readGenotypeMatrix and assocTest compute it. Note, however, that the MAFs computed in this way do not correspond to the true MAFs contained in the data.

...

for the three latter methods above, all other parameters are passed on to the method with signature TabixFile,GRanges.

Details

This method uses the “tabix” API provided by the Rsamtools package to parse a VCF file. The readVariantInfo method considers each variant and determines its minor allele frequency (MAF) and the type of the variant. The result is returned as a VariantInfo object, i.e. a GRanges object with two metadata columns “MAF” and “type”. The former contains the MAF of each variant, while the latter is a factor column that contains information about the type of the variant. Possible values in this column are “INDEL” (insertion or deletion), “MULTIPLE” (single-nucleotide variant with multiple alternate alleles), “TRANSITION” (single-nucleotide variation A/G or C/T), “TRANSVERSION” (single-nucleotide variation A/C, A/T, C/G, or G/T), or “UNKNOWN” (anything else). If refAlt is TRUE, two further metadata columns “ref” and “alt” are included which contain reference and alternate alleles of each variant.

For all variants, filters in terms of missing values and MAFs can be applied. Moreover, variants with MAFs greater than 0.5 can filtered out or inverted. For details, see descriptions of parameters na.limit, MAF.limit, na.action, and MAF.action above.

Value

returns an object of class VariantInfo

Author(s)

Ulrich Bodenhofer bodenhofer@bioinf.jku.at

References

http://www.bioinf.jku.at/software/podkat

http://www.1000genomes.org/wiki/analysis/variant-call-format/vcf-variant-call-format-version-42

Li, H., Handsaker, B., Wysoker, A., Fenell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and 1000 Genome Project Data Processing Subgroup (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079.

See Also

GenotypeMatrix

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
vcfFile <- system.file("examples/example1.vcf.gz", package="podkat")

## default parameters
vInfo <- readVariantInfo(vcfFile)
vInfo
summary(vInfo)

## including zero MAF variants and reference/alternate alleles
vInfo <- readVariantInfo(vcfFile, omitZeroMAF=FALSE, refAlt=TRUE)
vInfo
summary(vInfo)

podkat documentation built on Nov. 8, 2020, 6:55 p.m.