VCF-class: VCF class objects

VCF-classR Documentation

VCF class objects

Description

The VCF class is a virtual class extended from RangedSummarizedExperiment. The subclasses, CompressedVCF and ExtendedVCF, are containers for holding data from Variant Call Format files.

Details

The VCF class is a virtual class with two concrete subclasses, CollapsedVCF and ExtendedVCF.

Slots unique to VCF and subclasses,

  • fixed: A DataFrame containing the REF, ALT, QUAL and FILTER fields from a VCF file.

  • info: A DataFrame containing the INFO fields from a VCF file.

Slots inherited from RangedSummarizedExperiment,

  • metadata: A list containing the file header or other information about the overall experiment.

  • rowRanges: A GRanges-class instance defining the variant ranges and associated metadata columns of REF, ALT, QUAL and FILTER. While the REF, ALT, QUAL and FILTER fields can be displayed as metadata columns they cannot be modified with rowRanges<-. To modify these fields use fixed<-.

  • colData: A DataFrame-class instance describing the samples and associated metadata.

  • geno: The assays slot from RangedSummarizedExperiment has been renamed as geno for the VCF class. This slot contains the genotype information immediately following the FORMAT field in a VCF file. Each element of the list or SimpleList is a matrix or array.

It is expected that users will not create instances of the VCF class but instead one of the concrete subclasses, CollapsedVCF or ExpandVCF. CollapsedVCF contains the ALT data as a DNAStringSetList allowing for multiple alleles per variant. The ExpandedVCF ALT data is a DNAStringSet where the ALT column has been expanded to create a flat form of the data with one row per variant-allele combination. In the case of strucutral variants, ALT will be a CompressedCharacterList or character in the collapsed or expanded forms.

Constructors

readVcf(file, genome, param, ..., row.names=TRUE)

VCF(rowRanges = GRanges(), colData = DataFrame(), exptData = list(header = VCFHeader()), fixed = DataFrame(), info = DataFrame(), geno = SimpleList(), ..., collapsed=TRUE, verbose = FALSE) Creates CollapsedVCF when collapsed = TRUE and an ExpandedVCF when collapsed = FALSE.

This is a low-level constructor used internally. Most instances of the VCF class are created with readVCF.

Accessors

In the following code snippets x is a CollapsedVCF or ExpandedVCF object.

rowRanges(x, ..., fixed = TRUE), rowRanges(x) <- value: Gets or sets the rowRanges. The CHROM, POS, ID, POS and REF fields are used to create a GRanges object. The start of the ranges are defined by POS and the width is equal to the width of the reference allele REF. The IDs become the rownames. If they are missing (i.e., ‘.’) a string of CHROM:POS_REF/ALT is used instead. The genome argument is stored in the seqinfo of the GRanges and can be accessed with genome(<VCF>).

When fixed = TRUE, REF, ALT, QUAL and FILTER metadata columns are displayed as metadata columns. To modify the fixed fields, use the fixed<- setter.

One metadata column, paramRangeID, is included with the rowRanges. This ID is meaningful when multiple ranges are specified in the ScanVcfParam and distinguishes which records match each range.

The metadata columns of a VCF object are accessed with the following:

  • ref(x), ref(x) <- value: Gets or sets the reference allele (REF). value must be a DNAStringSet.

  • alt(x), alt(x) <- value: Gets or sets the alternate allele data (ALT). When x is a CollapsedVCF, value must be a DNAStringSetList or CompressedCharacterList. For ExpandedVCF, value must be a DNAStringSet or character.

  • qual(x), qual(x) <- value: Returns or sets the quality scores (QUAL). value must be an numeric(1L).

  • filt(x), filt(x) <- value: Returns or sets the filter data. value must be a character(1L). Names must be one of 'REF', 'ALT', 'QUAL' or 'FILTER'.

mcols(x), mcols(x) <- value: These methods behave the same as mcols(rowRanges(x)) and mcols(rowRanges(x)) <- value. This method does not manage the fixed fields, 'REF', 'ALT', 'QUAL' or 'FILTER'. To modify those columns use fixed<-.

fixed(x), fixed(x) <- value: Gets or sets a DataFrame of REF, ALT, QUAL and FILTER only. Note these fields are displayed as metadata columns with the rowRanges() data (set to fixed = FALSE to suppress).

info(x, ..., row.names = TRUE), info(x) <- value: Gets or sets a DataFrame of INFO variables. Row names are added if unique and row.names=TRUE.

geno(x, withDimnames=TRUE), geno(x) <- value: oets a SimpleList of genotype data. value is a SimpleList. To replace a single variable in the SimpleList use geno(x)$variable <- value; in this case value must be a matrix or array. By default row names are returned; to override specify geno(vcf, withDimnames=FALSE).

metadata(x): Gets a list of experiment-related data. By default this list includes the ‘header’ information from the VCF file. See the use of header() for details in extracting header information.

colData(x), colData(x) <- value: Gets or sets a DataFrame of sample-specific information. Each row represents a sample in the VCF file. value must be a DataFrame with rownames representing the samples in the VCF file.

genome(x): Extract the genome information from the GRanges object returned by the rowRanges accessor.

seqlevels(x): Extract the seqlevels from the GRanges object returned by the rowRanges accessor.

strand(x): Extract the strand from the GRanges object returned by the rowRanges accessor.

header(x), header(x)<- value: Get or set the VCF header information. Replacement value must be a VCFHeader object. To modify individual elements use info<-, geno<- or meta<- on a ‘VCFHeader’ object. See ?VCFHeader man page for details.

  • info(header(x))

  • geno(header(x))

  • meta(header(x))

  • samples(header(x))

vcfFields(x) Returns a CharacterList of all available VCF fields, with names of fixed, info, geno and samples indicating the four categories. Each element is a character() vector of available VCF field names within each category.

Subsetting and combining

In the following code x is a VCF object, and ... is a list of VCF objects.

x[i, j], x[i, j] <- value: Gets or sets rows and columns. i and j can be integer or logical vectors. value is a replacement VCF object.

subset(x, subset, select, ...): Restricts x by evaluating the subset argument in the scope of rowData(x) and info(x), and select in the context of colData(x). The subset argument restricts by rows, while the select argument restricts by column. The ... are passed to the underlying subset() calls.

cbind(...), rbind(...): cbind combines objects with identical ranges (rowRanges) but different samples (columns in assays). The colnames in colData must match or an error is thrown. Columns with duplicate names in fixed, info and mcols(rowRanges(VCF)) must contain the same data.

rbind combines objects with different ranges (rowRanges) and the same subjects (columns in assays). Columns with duplicate names in colData must contain the same data. The ‘Samples’ columns in colData (created by readVcf) are renamed with a numeric extension ordered as they were input to rbind e.g., “Samples.1, Samples.2, ...” etc.

metadata from all objects are combined into a list with no name checking.

expand

In the following code snippets x is a CollapsedVCF object.

expand(x, ..., row.names = FALSE): Expand (unlist) the ALT column of a CollapsedVCF object to one row per ALT value. Variables with Number='A' have one value per alternate allele and are expanded accordingly. The 'AD' genotype field (and any variables with 'Number' set to 'R') is expanded into REF/ALT pairs. For all other fields, the rows are replicated to match the elementNROWS of ALT.

The output is an ExpandedVCF with ALT as a DNAStringSet or character (structural variants). By default rownames are NULL. When row.names=TRUE the expanded output has duplicated rownames corresponding to the original x.

genotypeCodesToNucleotides(vcf, ...)

This function converts the 'GT' genotype codes in a VCF object to nucleotides. See also ?readGT to read in only 'GT' data as codes or nucleotides.

SnpMatrixToVCF(from, seqSource)

This function converts the output from the read.plink function to a VCF class. from must be a list of length 3 with named elements "map", "fam" and "genotypes". seqSource can be a BSgenome or an FaFile used for reference sequence extraction.

Variant Type

Functions to identify variant type include isSNV, isInsertion, isDeletion, isIndel, isSubstitution and isTransition. See the ?isSNV man page for details.

Arguments

geno

A list or SimpleList of matrix elements, or a matrix containing the genotype information from a VCF file. If present, these data immediately follow the FORMAT field in the VCF.

Each element of the list must have the same dimensions, and dimension names (if present) must be consistent across elements and with the row names of rowRanges, colData.

info

A DataFrame of data from the INFO field of a VCF file. The number of rows must match that in the rowRanges object.

fixed

A DataFrame of REF, ALT, QUAL and FILTER fields from a VCF file. The number of rows must match that of the rowRanges object.

rowRanges

A GRanges instance describing the ranges of interest. Row names, if present, become the row names of the VCF. The length of the GRanges must equal the number of rows of the matrices in geno.

colData

A DataFrame describing the samples. Row names, if present, become the column names of the VCF.

metadata

A list describing the header of the VCF file or additional information for the overall experiment.

...

For cbind and rbind a list of VCF objects. For all other methods ... are additional arguments passed to methods.

collapsed

A logical(1) indicating whether a CollapsedVCF or ExpandedVCF should be created. The ALT in a CollapsedVCF is a DNAStringSetList while in a ExpandedVCF it is a DNAStringSet.

verbose

A logical(1) indicating whether messages about data coercion during construction should be printed.

Author(s)

Valerie Obenchain

See Also

GRanges, DataFrame, SimpleList, RangedSummarizedExperiment, readVcf, writeVcf isSNV

Examples


## readVcf() parses data into a VCF object: 

fl <- system.file("extdata", "structural.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, genome="hg19")

## ----------------------------------------------------------------
## Accessors 
## ----------------------------------------------------------------
## Variant locations are stored in the GRanges object returned by
## the rowRanges() accessor.
rowRanges(vcf)

## Suppress fixed fields:
rowRanges(vcf, fixed=FALSE)

## Individual fields can be extracted with ref(), alt(), qual(), filt() etc.
qual(vcf)
ref(vcf)
head(info(vcf))

## All available VCF field names can be contracted with vcfFields(). 
vcfFields(vcf)

## Extract genotype fields with geno(). Access specific fields with 
## '$' or '[['.
geno(vcf)
identical(geno(vcf)$GQ, geno(vcf)[[2]])

## ----------------------------------------------------------------
## Renaming seqlevels and subsetting 
## ----------------------------------------------------------------
## Overlap and matching operations require that the objects
## being compared have the same seqlevels (chromosome names).
## It is often the case that the seqlevesls in on of the objects
## needs to be modified to match the other. In this VCF, the 
## seqlevels are numbers instead of preceded by "chr" or "ch". 

seqlevels(vcf)

## Rename the seqlevels to start with 'chr'.
vcf2 <- renameSeqlevels(vcf, paste0("chr", seqlevels(vcf))) 
seqlevels(vcf2)

## The VCF can also be subset by seqlevel using 'keepSeqlevels'
## or 'dropSeqlevels'. See ?keepSeqlevels for details. 
vcf3 <- keepSeqlevels(vcf2, "chr2", pruning.mode="coarse")
seqlevels(vcf3)

## ----------------------------------------------------------------
## Header information 
## ----------------------------------------------------------------

## Header data can be modified in the 'meta', 'info' and 'geno'
## slots of the VCFHeader object. See ?VCFHeader for details.

## Current 'info' fields.
rownames(info(header(vcf)))

## Add a new field to the header.
newInfo <- DataFrame(Number=1, Type="Integer",
                     Description="Number of Samples With Data",
                     row.names="NS")
info(header(vcf)) <- rbind(info(header(vcf)), newInfo)
rownames(info(header(vcf)))

## ----------------------------------------------------------------
## Collapsed and Expanded VCF 
## ----------------------------------------------------------------
## readVCF() produces a CollapsedVCF object.
fl <- system.file("extdata", "ex2.vcf", 
                  package="VariantAnnotation")
vcf <- readVcf(fl, genome="hg19")
vcf

## The ALT column is a DNAStringSetList to allow for more
## than one alternate allele per variant.
alt(vcf)

## For structural variants ALT is a CharacterList.
fl <- system.file("extdata", "structural.vcf", 
                  package="VariantAnnotation")
vcf <- readVcf(fl, genome="hg19")
alt(vcf)

## ExpandedVCF is the 'flattened' counterpart of CollapsedVCF.
## The ALT and all variables with Number='A' in the header are
## expanded to one row per alternate allele.
vcfLong <- expand(vcf)
alt(vcfLong)

## Also see the ?VRanges class for an alternative form of
## 'flattened' VCF data.

## ----------------------------------------------------------------
## isSNV()
## ----------------------------------------------------------------
## isSNV() returns a subset VCF containing SNVs only.

vcf <- VCF(rowRanges = GRanges("chr1", IRanges(1:4*3, width=c(1, 2, 1, 1))))
alt(vcf) <- DNAStringSetList("A", c("TT"), c("G", "A"), c("TT", "C"))
ref(vcf) <- DNAStringSet(c("G", c("AA"), "T", "G"))
fixed(vcf)[c("REF", "ALT")]

## SNVs are present in rows 1 (single ALT value), 3 (both ALT values) 
## and 4 (1 of the 2 ALT values).
vcf[isSNV(vcf, singleAltOnly=TRUE)] 
vcf[isSNV(vcf, singleAltOnly=FALSE)] ## all 3 SNVs

Bioconductor/VariantAnnotation documentation built on Jan. 9, 2025, 12:03 a.m.