View source: R/imputedDosageFile.R
imputedDosageFile | R Documentation |
These functions create or check a GDS or NetCDF file and corresponding annotation for imputed dosages from IMPUTE2, BEAGLE, or MaCH.
imputedDosageFile(input.files, filename, chromosome,
input.type=c("IMPUTE2", "BEAGLE", "MaCH"),
input.dosage=FALSE,
output.type = c("dosage", "genotype"),
file.type=c("gds", "ncdf"),
snp.annot.filename="dosage.snp.RData",
scan.annot.filename="dosage.scan.RData",
compress="LZMA_RA:1M",
compress.annot="LZMA_RA",
genotypeDim="scan,snp",
scan.df=NULL,
snp.exclude=NULL,
snp.id.start=1,
block.size=5000,
prob.threshold=0.9, verbose=TRUE)
checkImputedDosageFile(genoData, snpAnnot, scanAnnot,
input.files, chromosome,
input.type=c("IMPUTE2", "BEAGLE", "MaCH"),
input.dosage=FALSE,
output.type=c("dosage", "genotype"),
snp.exclude=NULL, snp.id.start=1,
tolerance=1e-4, na.logfile=NULL,
block.size=5000,
prob.threshold=0.9,
verbose=TRUE)
input.files |
A character vector of input files. The first file should always be genotypes (either probabilities or dosages). Files for each input type should be as follows:
|
filename |
Character string with name of output GDS or NetCDF file. |
chromosome |
Chromosome corresponding to the SNPs in the genotype file. Character codes will be mapped to integer values as follows: "X"->23, "XY"->24, "Y"-> 25, "M","MT"->26. |
input.type |
Format of input files. Accepted file types are "IMPUTE2", "BEAGLE", and "MaCH". |
input.dosage |
Logical for whether the genotype file
( |
file.type |
The type of file to create ("gds" or "ncdf") |
snp.annot.filename |
Output .RData file for storing a |
scan.annot.filename |
Output .RData file for storing a
|
compress |
The compression level for dosage variables in a GDS file (see
|
compress.annot |
The compression level for annotation variables in a GDS file (see
|
genotypeDim |
character string specifying genotype dimensions of output file. Either "snp,scan" or "scan,snp". "scan,snp" is usually much faster to create for GDS files, but "snp,scan" is required for NetCDF files. |
scan.df |
data frame specifying which samples to include in the output GDS files, with optional scanIDs already assigned. See details. |
snp.exclude |
vector of integers specifying which SNPs to exclude from the GDS file. |
snp.id.start |
Starting index for snpID. |
block.size |
Number of lines to read at once. |
verbose |
Logical for whether to print progress messages. |
genoData |
A |
snpAnnot |
The |
scanAnnot |
The |
tolerance |
tolerance for checking differences against input files |
na.logfile |
filename for recording snpID and scanID of missing dosages |
output.type |
output to record in gds file (either |
prob.threshold |
if |
Input files can contain either imputed dosages or genotype probabilities, specified by the input.dosage
flag.
In either case, the GDS/NetCDF file will store dosage of the A allele in the "genotype" variable.
All SNPs are assumed to be on the same chromosome, which is indicated by the chromosome
argument.
If the input file contains genotype probabilities for all three genotypes, the dosage is set to missing if the genotype probability strings (before numerical conversion) are equal (e.g., (0,0,0), (0.33, 0.33, 0.33), or (-1, -1, -1)). The dosage is also normalized by the sum of all three genotype probabilities.
The scan.df
argument allows the user to specify what samples should be included in the GDS files and an optional sampleID-scanID mapping. scan.df
is a data frame with required column sampleID
. The function attempts to match the given sampleID in the scan.df
data frame with a unique sampleID in the input files. The format of sampleID is different for different input types:
IMPUTE2: "ID_1 ID_2" as given in the sample file, where IDs are separated by a space
BEAGLE: Column header names corresponding to that sample in .dose or .gprobs file
MaCH: The first column of the .mlprob or .mlprob file
The snp.names
argument allows the user to specify the which SNPs should be included in the GDS files. However, snp.names
must be in the same order as SNPs occur in the imputation files; this option therefore only allows selection of SNPs, not reordering of SNPs. The ordering is checked and an error is thrown if the SNP names are not in order, but due to the design of imputation files, this may not occur until well into the GDS file population. The user can specify the starting snpID by setting snp.id.start
, and included SNPs are numbered sequentially starting with snp.id.start
. For IMPUTE2 data, snp.names must correspond to the second column of the .gprobs file.
Minimal SNP and scan annotation are created from the input files and stored in
RData format in snp.annot.filename
and scan.annot.filename
.
If requested with na.logfile
, checkImputedDosageFile
will output a file with scanIDs and snpIDs of missing genotype calls.
Either dosage or genotypes can be output using output.type
. If dosage is requested, the dosages will be 2*AAprob + ABprob. If genotype is requested, the value will be set to the genotype with the maximum probability, unless all probabilities are less than prob.threshold
. In that case, the genotype will be set to missing. SNPs with max probabilities that are the same for two genotypes (ie, AA=0.5, AB=0.5, BB=0) will also be set to missing.
Currently supported input file types are IMPUTE2, BEAGLE, and MaCH.
Adrienne Stilp, Stephanie Gogarten
IMPUTE2: http://mathgen.stats.ox.ac.uk/impute/impute_v2.html
BEAGLE: http://faculty.washington.edu/browning/beagle/beagle.html
MaCH: http://www.sph.umich.edu/csg/abecasis/MACH/tour/imputation.html
createDataFile
, GdsGenotypeReader
, NcdfGenotypeReader
, GenotypeData
, assocRegression
gdsfile <- tempfile()
snpfile <- tempfile()
scanfile <- tempfile()
logfile <- tempfile()
# IMPUTE2
probfile <- system.file("extdata", "imputation", "IMPUTE2", "example.chr22.study.gens",
package="GWASdata")
sampfile <- system.file("extdata", "imputation", "IMPUTE2", "example.study.samples",
package="GWASdata")
imputedDosageFile(input.files=c(probfile, sampfile), filename=gdsfile, chromosome=22,
input.type="IMPUTE2", input.dosage=FALSE,
snp.annot.filename=snpfile, scan.annot.filename=scanfile)
gds <- GdsGenotypeReader(gdsfile)
scanAnnot <- getobj(scanfile)
snpAnnot <- getobj(snpfile)
genoData <- GenotypeData(gds, scanAnnot=scanAnnot, snpAnnot=snpAnnot)
checkImputedDosageFile(genoData, snpAnnot, scanAnnot,
input.files=c(probfile, sampfile), chromosome=22,
input.type="IMPUTE2", input.dosage=FALSE, na.logfile=logfile)
geno <- getGenotype(genoData)
getAlleleA(genoData)
getAlleleB(genoData)
log <- read.table(logfile)
head(log)
# association test with imputed dosages
scanAnnot$status <- sample(0:1, nrow(scanAnnot), replace=TRUE)
genoData <- GenotypeData(gds, scanAnnot=scanAnnot, snpAnnot=snpAnnot)
assoc <- assocRegression(genoData, outcome="status", model.type="logistic")
head(assoc)
close(genoData)
# BEAGLE - genotype probabilities
probfile <- system.file("extdata", "imputation", "BEAGLE", "example.hapmap.unphased.bgl.gprobs",
package="GWASdata")
markfile <- system.file("extdata", "imputation", "BEAGLE", "hapmap.markers",
package="GWASdata")
imputedDosageFile(input.files=c(probfile, markfile), filename=gdsfile, chromosome=22,
input.type="BEAGLE", input.dosage=FALSE, file.type="gds",
snp.annot.filename=snpfile, scan.annot.filename=scanfile)
# BEAGLE - dosage
dosefile <- system.file("extdata", "imputation", "BEAGLE", "example.hapmap.unphased.bgl.dose",
package="GWASdata")
imputedDosageFile(input.files=c(dosefile, markfile), filename=gdsfile, chromosome=22,
input.type="BEAGLE", input.dosage=TRUE, file.type="gds",
snp.annot.filename=snpfile, scan.annot.filename=scanfile)
# MaCH - genotype probabilities
probfile <- system.file("extdata", "imputation", "MaCH", "mach1.out.mlprob",
package="GWASdata")
markfile <- system.file("extdata", "imputation", "MaCH", "mach1.out.mlinfo",
package="GWASdata")
posfile <- system.file("extdata", "imputation", "MaCH", "mach1.snp.position",
package="GWASdata")
imputedDosageFile(input.files=c(probfile, markfile, posfile), filename=gdsfile, chromosome=22,
input.type="MaCH", input.dosage=FALSE, file.type="gds",
snp.annot.filename=snpfile, scan.annot.filename=scanfile)
# MaCH - dosage
dosefile <- system.file("extdata", "imputation", "MaCH", "mach1.out.mldose",
package="GWASdata")
imputedDosageFile(input.files=c(dosefile, markfile, posfile), filename=gdsfile, chromosome=22,
input.type="MaCH", input.dosage=TRUE, file.type="gds",
snp.annot.filename=snpfile, scan.annot.filename=scanfile)
unlink(c(gdsfile, snpfile, scanfile))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.