View source: R/duplicateDiscordanceAcrossDatasets.R
duplicateDiscordanceAcrossDatasets | R Documentation |
These functions compare genotypes in pairs of duplicate scans of the same sample across multiple datasets. 'duplicateDiscordanceAcrossDatasets' finds the number of discordant genotypes both by scan and by SNP. 'dupDosageCorAcrossDatasets' calculates correlations between allelic dosages both by scan and by SNP, allowing for comparision between imputed datasets or between imputed and observed - i.e., where one or more of the datasets contains continuous dosage [0,2] rather than discrete allele counts {0,1,2}.
duplicateDiscordanceAcrossDatasets(genoData1, genoData2,
match.snps.on=c("position", "alleles"),
subjName.cols, snpName.cols=NULL,
one.pair.per.subj=TRUE, minor.allele.only=FALSE,
missing.fail=c(FALSE, FALSE),
scan.exclude1=NULL, scan.exclude2=NULL,
snp.exclude1=NULL, snp.exclude2=NULL,
snp.include=NULL,
verbose=TRUE)
minorAlleleDetectionAccuracy(genoData1, genoData2,
match.snps.on=c("position", "alleles"),
subjName.cols, snpName.cols=NULL,
missing.fail=TRUE,
scan.exclude1=NULL, scan.exclude2=NULL,
snp.exclude1=NULL, snp.exclude2=NULL,
snp.include=NULL,
verbose=TRUE)
dupDosageCorAcrossDatasets(genoData1, genoData2,
match.snps.on=c("position", "alleles"),
subjName.cols="subjectID", snpName.cols=NULL,
scan.exclude1=NULL, scan.exclude2=NULL,
snp.exclude1=NULL, snp.exclude2=NULL,
snp.include=NULL,
snp.block.size=5000, scan.block.size=100,
verbose=TRUE)
genoData1 |
|
genoData2 |
|
match.snps.on |
One or more of ("position", "alleles", "name") indicating how to match SNPs. "position" will match SNPs on chromosome and position, "alleles" will also require the same alleles (but A/B designations need not be the same), and "name" will match on the columns give in |
subjName.cols |
2-element character vector indicating the names of the annotation variables that will be identical for duplicate scans in the two datasets. (Alternatively, one character value that will be recycled). |
snpName.cols |
2-element character vector indicating the names of the annotation variables that will be identical for the same SNPs in the two datasets. (Alternatively, one character value that will be recycled). |
one.pair.per.subj |
A logical indicating whether a single pair of scans should be randomly selected for each subject with more than 2 scans. |
minor.allele.only |
A logical indicating whether discordance should be calculated only between pairs of scans in which at least one scan has a genotype with the minor allele (i.e., exclude major allele homozygotes). |
missing.fail |
For |
scan.exclude1 |
An integer vector containing the ids of scans to be excluded from the first dataset. |
scan.exclude2 |
An integer vector containing the ids of scans to be excluded from the second dataset. |
snp.exclude1 |
An integer vector containing the ids of SNPs to be excluded from the first dataset. |
snp.exclude2 |
An integer vector containing the ids of SNPs to be excluded from the second dataset. |
snp.include |
List of SNPs to include in the comparison. Should
match the contents of the columns referred to by |
snp.block.size |
Block size for SNPs |
scan.block.size |
Block size for scans |
verbose |
Logical value specifying whether to show progress information. |
duplicateDiscordanceAcrossDatasets
calculates discordance metrics both by
scan and by SNP. If one.pair.per.subj=TRUE
(the default), each
subject with more than two duplicate genotyping instances will have
one scan from each dataset randomly selected for computing discordance. If
one.pair.per.subj=FALSE
, discordances will be calculated
pair-wise for all possible cross-dataset pairs for each subject.
dupDosageCorAcrossDatasets
calculates dosage correlation
(Pearson correlation coefficient) both
by scan and by SNP. Note it only allows for one pair of duplicate scans
per sample. For this function only, genoData1
and
genoData2
must have been created with
GdsGenotypeReader
objects.
By default, overlapping variants are identified based on position and alleles. Alleles are determined via 'getAlleleA' and 'getAlleleB' accessors, so users should ensure these variables are referring to the same strand orientation in both datests (e.g., both plus strand alleles). It is not necessary for the A/B ordering to be consistent across datasets. For example, two variants at the same position with alleleA="C" and alleleB="T" in genoData1 and alleleA="T" and alleleB="C" in genoData2 will stil be identified as overlapping.
If minor.allele.only=TRUE
, the allele frequency will be
calculated in genoData1
, using only samples common to both datasets.
If snp.include=NULL
(the default), discordances will be found
for all SNPs common to both datasets.
genoData1
and genoData2
should each have "alleleA" and
"alleleB" defined in their SNP annotation. If allele coding cannot be
found, the two datasets are assumed to have identical coding. Note
that 'dupDosageCorAcrossDatasets' can NOT detect where strand-ambiguous (A/T or
C/G) SNPs are annotated on different strands, although the r2 in these
instances would be unaffected: r may be negative but r2 will be positive.
minorAlleleDetectionAccuracy
summarizes the accuracy of minor
allele detection in genoData2
with respect to genoData1
(the "gold standard").
TP=
number of true positives, TN=
number of true negatives,
FP=
number of false positives, and FN=
number of false
negatives.
Accuracy is represented by four metrics:
sensitivity for each SNP as TP/(TP+FN)
specificity for each SNP as TN/(TN+FP)
positive predictive value for each SNP as TP/(TP+FP)
negative predictive value for each SNP as TN/(TN+FN)
.
TP
, TN
, FP
, and FN
are calculated as follows:
genoData1 | ||||
mm | Mm | MM | ||
mm | 2TP | 1TP + 1FP | 2FP | |
genoData2 | Mm | 1TP + 1FN | 1TN + 1TP | 1TN + 1FP |
MM | 2FN | 1FN + 1TN | 2TN | |
-- | 2FN | 1FN | ||
"M" is the major allele and
"m" is the minor allele (as calculated in genoData1
).
"-" is a missing call in genoData2
.
Missing calls in genoData1
are ignored. If
missing.fail=FALSE
, missing calls in genoData2
(the last
row of the table) are also ignored.
SNP annotation columns returned by all functions are:
chromosome |
chromosome |
position |
base pair position |
snpID1 |
snpID from genoData1 |
snpID2 |
snpID from genoData2 |
If matching on "alleles":
alleles |
alleles sorted alphabetically |
alleleA1 |
allele A from genoData1 |
alleleB1 |
allele B from genoData2 |
alleleA2 |
allele A from genoData2 |
alleleB2 |
allele B from genoData2 |
If matching on "name":
name |
the common SNP name given in |
duplicateDiscordanceAcrossDatasets
returns a list with two data
frames:
The data.frame "discordance.by.snp" contains the
SNP annotation columns listed above as well as:
discordant |
number of discordant pairs |
npair |
number of pairs examined |
n.disc.subj |
number of subjects with at least one discordance |
discord.rate |
discordance rate i.e. discordant/npair |
The data.frame "discordance.by.subject" contains a list of matrices (one for each subject) with the pair-wise discordance between the different genotyping instances of the subject.
minorAlleleDetectionAccuracy
returns a data.frame with the
SNP annotation columns listed above as well as:
npair |
number of sample pairs compared (non-missing in |
sensitivity |
sensitivity |
specificity |
specificity |
positivePredictiveValue |
Positive predictive value |
negativePredictiveValue |
Negative predictive value |
dupDosageCorAcrossDatasets
returns a list with two data
frames:
The data.frame "snps" contains the by-SNP correlation (r) values with the SNP annotation columns listed above as well as:
nsamp.dosageR |
number of samples in r calculation (i.e., non missing data in both genoData1 and genoData2) |
dosageR |
dosage correlation |
The data.frame "samps" contains the by-sample r values with the following columns:
subjectID |
subject-level identifier for duplicate sample pair |
scanID1 |
scanID from genoData1 |
scanID2 |
scanID from genoData2 |
nsnp.dosageR |
number of SNPs in r calculation (i.e., non missing data in both genoData1 and genoData2) |
dosageR |
dosage correlation |
If no duplicate scans or no common SNPs are found, these functions issue a warning
message and return NULL
.
Stephanie Gogarten, Jess Shen, Sarah Nelson
GenotypeData
, duplicateDiscordance
,
duplicateDiscordanceProbability
# first set
snp1 <- data.frame(snpID=1:10, chromosome=1L, position=101:110,
rsID=paste("rs", 101:110, sep=""),
alleleA="A", alleleB="G", stringsAsFactors=FALSE)
scan1 <- data.frame(scanID=1:3, subjectID=c("A","B","C"), sex="F", stringsAsFactors=FALSE)
mgr <- MatrixGenotypeReader(genotype=matrix(c(0,1,2), ncol=3, nrow=10), snpID=snp1$snpID,
chromosome=snp1$chromosome, position=snp1$position, scanID=1:3)
genoData1 <- GenotypeData(mgr, snpAnnot=SnpAnnotationDataFrame(snp1),
scanAnnot=ScanAnnotationDataFrame(scan1))
# second set
snp2 <- data.frame(snpID=1:5, chromosome=1L,
position=as.integer(c(101,103,105,107,107)),
rsID=c("rs101", "rs103", "rs105", "rs107", "rsXXX"),
alleleA= c("A","C","G","A","A"),
alleleB=c("G","T","A","G","G"),
stringsAsFactors=FALSE)
scan2 <- data.frame(scanID=1:3, subjectID=c("A","C","C"), sex="F", stringsAsFactors=FALSE)
mgr <- MatrixGenotypeReader(genotype=matrix(c(1,2,0), ncol=3, nrow=5), snpID=snp2$snpID,
chromosome=snp2$chromosome, position=snp2$position, scanID=1:3)
genoData2 <- GenotypeData(mgr, snpAnnot=SnpAnnotationDataFrame(snp2),
scanAnnot=ScanAnnotationDataFrame(scan2))
duplicateDiscordanceAcrossDatasets(genoData1, genoData2,
match.snps.on="position",
subjName.cols="subjectID")
duplicateDiscordanceAcrossDatasets(genoData1, genoData2,
match.snps.on=c("position", "alleles"),
subjName.cols="subjectID")
duplicateDiscordanceAcrossDatasets(genoData1, genoData2,
match.snps.on=c("position", "alleles", "name"),
subjName.cols="subjectID",
snpName.cols="rsID")
duplicateDiscordanceAcrossDatasets(genoData1, genoData2,
subjName.cols="subjectID",
one.pair.per.subj=FALSE)
minorAlleleDetectionAccuracy(genoData1, genoData2,
subjName.cols="subjectID")
dupDosageCorAcrossDatasets(genoData1, genoData2,
scan.exclude2=scan2$scanID[duplicated(scan2$subjectID)])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.