seqfile <- params$seqfile interactive <- params$interactive
sampleanno <- seqfile@QCresult$sample.annot ind.study <- sampleanno[,5] == "study" anno.study <- sampleanno[ind.study, ] ## table(anno.study$population) ## table(anno.study$gender)
We have a total of r nrow(anno.study)
study samples from the population of r unique(anno.study$population)
, with r unname(table(anno.study$gender)[names(table(anno.study$gender)) %in% "female"])
female samples and r unname(table(anno.study$gender)[names(table(anno.study$gender)) %in% "male"])
male samples.
Here is the list of all QC steps we have done.
(qcresults <- names(QCresult(seqfile))[names(QCresult(seqfile)) %in% c("MissingRate", "SexCheck", "Inbreeding", "IBD", "PCA")])
Samples with high missing rate (>0.1) are marked as "Yes" in the column of outlier
and should be removed. The value of the outlier
column is set to "NA" for benchmark samples.
if("MissingRate" %in% qcresults){ res.mr <- QCresult(seqfile)$MissingRate plotQC(seqfile, QCstep = "MissingRate", interactive = interactive) }else{ print("MissingRate check is not done yet!") res.mr <- NULL }
There are r sum(res.mr$outlier %in% "Yes")
problematic samples due to high missing rate.
Samples are predicted to be female or male if the inbreeding coefficient is below 0.2, or greater than 0.8. The samples with discordant reported gender and predicted gender are considered as problematic. When the inbreeding coefficient is within the range of [0.2, 0.8], β0β will be shown in the column of pred.sex
to indicate ambiguous gender, which would not be considered as problematic.
if("SexCheck" %in% qcresults){ res.sex <- QCresult(seqfile)$SexCheck tail(res.sex) table(res.sex$sex, res.sex$pred.sex) ## sample with gender mismatch (prob.sex <- res.sex[res.sex$sex == "female" & res.sex$pred.sex == "male" | res.sex$sex == "male" & res.sex$pred.sex == "female", ]) plotQC(seqfile, QCstep = "SexCheck", interactive = interactive) }else{ print("Sex check is not done yet!") prob.sex <- NULL }
There are r nrow(prob.sex)
problematic samples due to gender mismatch.
By using LD-pruned variants (by default), we calculate the inbreeding coefficients for each sample in the study cohort and for benchmark samples of the same population as the study cohort. Samples with inbreeding coefficients that are five standard deviations beyond the mean are considered problematic and shows "Yes" in the column of outlier.5sd
. Benchmark samples in this column are set to be βNAβ.
The estimate of inbreeding coefficient could be negative due to random sampling error, but a strongly negative value could reflect other factors, e.g. sample contamination events.
if("Inbreeding" %in% qcresults){ res.inb <- QCresult(seqfile)$Inbreeding tail(res.inb) table(res.inb$outlier.5sd) ## sample with inbreeding outlier (prob.inb <- res.inb[res.inb$outlier.5sd == "Yes" & !is.na(res.inb$outlier.5sd), ]) plotQC(seqfile, QCstep = "Inbreeding", interactive = interactive) }else{ print("Inbreeding outlier check is not done yet!") prob.inb <- NULL }
There are r nrow(prob.inb)
problematic samples that are inbreeding outliers.
By using LD-pruned variants (by default), we calculate the IBD coefficients for all sample pairs, and then predict related sample pairs in study cohort using the support vector machine (SVM) method with linear kernel and the known relatedness embedded in benchmark data as training set.
Sample pairs with discordant self-reported and predicted relationship are considered as problematic. All predicted related pairs are also required to have coefficient of kinship >= 0.08. The sample with higher missing rate in each related pair is selected for removal from further analysis by function of IBDRemove
.
if("IBD" %in% qcresults){ res.ibd <- QCresult(seqfile)$IBD head(res.ibd) table(res.ibd$label, res.ibd$pred.label) ## sample with cryptic relatedness (prob.ibd <- IBDRemove(seqfile)) plotQC(seqfile, QCstep = "IBD", interactive = interactive) }else{ print("IBD check is not done yet!") prob.ibd <- NULL }
There are r nrow(prob.ibd$ibd.pairs)
problematic sample pairs with cryptic relationship.
By using LD-pruned variants (by default), and benchmark samples as training dataset, we predict the ancestry for each sample in the study cohort using the top four principle components by support machine (SVM) with linear kernel.
Samples with discordant predicted and self-reported population groups are considered problematic. The function PCA
performs the PCA analysis and identifies population outliers in study cohort.
if("PCA" %in% qcresults){ res.pca <- QCresult(seqfile)$PCA tail(res.pca) studyid <- sampleanno[ind.study, 1] res.study <- res.pca[res.pca$sample %in% studyid, ] (prob.pca <- res.study[res.study$pop != res.study$pred.pop, ]) plotQC(seqfile, QCstep = "PCA", interactive = interactive) }else{ print("Population outlier check is not done yet!") prob.pca <- NULL }
There are r nrow(prob.pca)
problematic samples that are detected to be population outliers.
Here is the summary list of problematic samples. Problematic samples are identified by SeqSQC due to different reasons of high missing rate, gender mismatch, inbreeding outlier, cryptic relationship or population outlier.
if ("problem.list" %in% names(QCresult(seqfile))){ (probs <- QCresult(seqfile)$problem.list) (prob.remove <- QCresult(seqfile)$remove.list) }else{ (probs <- problemList(seqfile)) }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.