scanVcf-methods: Import VCF files

scanVcfR Documentation

Import VCF files


Import Variant Call Format (VCF) files in text or binary format


scanVcfHeader(file, ...)
## S4 method for signature 'character'
scanVcfHeader(file, ...)

scanVcf(file, ..., param)
## S4 method for signature 'character,ScanVcfParam'
scanVcf(file, ..., param)
## S4 method for signature 'character,missing'
scanVcf(file, ..., param)
## S4 method for signature 'connection,missing'
scanVcf(file, ..., param)

## S4 method for signature 'TabixFile'
scanVcfHeader(file, ...)
## S4 method for signature 'TabixFile,missing'
scanVcf(file, ..., param)
## S4 method for signature 'TabixFile,ScanVcfParam'
scanVcf(file, ..., param)
## S4 method for signature 'TabixFile,GRanges'
scanVcf(file, ..., param)
## S4 method for signature 'TabixFile,IntegerRangesList'
scanVcf(file, ..., param)



For scanVcf and scanVcfHeader, the character() file name, TabixFile, or class connection (file() or bgzip()) of the ‘VCF’ file to be processed.


A instance of ScanVcfParam influencing which records are parsed and the ‘INFO’ and ‘GENO’ information returned.


Additional arguments for methods


The argument param allows portions of the file to be input, but requires that the file be bgzip'd and indexed as a TabixFile.

scanVcf with param="missing" and file="character" or file="connection" scan the entire file. With file="connection", an argument n indicates the number of lines of the VCF file to input; a connection open at the beginning of the call is open and incremented by n lines at the end of the call, providing a convenient way to stream through large VCF files.

The INFO field of the scanned VCF file is returned as a single ‘packed’ vector, as in the VCF file. The GENO field is a list of matrices, each matrix corresponds to a field as defined in the FORMAT field of the VCF header. Each matrix has as many rows as scanned in the VCF file, and as many columns as there are samples. As with the INFO field, the elements of the matrix are ‘packed’. The reason that INFO and GENO are returned packed is to facilitate manipulation, e.g., selecting particular rows or samples in a consistent manner across elements.


scanVcfHeader returns a VCFHeader object with header information parsed into five categories, samples, meta, fixed, info and geno. Each can be accessed with a ‘getter’ of the same name (e.g., info(<VCFHeader>)). If the file header has multiple rows with the same name (e.g., 'source') the row names of the DataFrame are made unique in the usual way, 'source', 'source.1' etc.

scanVcf returns a list, with one element per range. Each list has 7 elements, obtained from the columns of the VCF specification:


GRanges instance derived from CHROM, POS, ID, and the width of REF


reference allele


alternate allele


phred-scaled quality score for the assertion made in ALT


indicator of whether or not the position passed all filters applied


additional information


genotype information immediately following the FORMAT field in the VCF

The GENO element is itself a list, with elements corresponding to those defined in the VCF file header. For scanVcf, elements of GENO are returned as a matrix of records x samples; if the description of the element in the file header indicated multiplicity other than 1 (e.g., variable number for “A”, “G”, or “.”), then each entry in the matrix is a character string with sub-entries comma-delimited.


Martin Morgan and Valerie Obenchain>

References outlines the VCF specification. contains information on the portion of the specification implemented by bcftools. provides information on samtools.

See Also

readVcf BcfFile TabixFile


  fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
  vcf <- scanVcf(fl)
  ## value: list-of-lists

Bioconductor/VariantAnnotation documentation built on Jan. 9, 2025, 12:03 a.m.