1. summary of samples

seqfile <- params$seqfile
interactive <- params$interactive
sampleanno <- seqfile@QCresult$sample.annot
ind.study <- sampleanno[,5] == "study"
anno.study <- sampleanno[ind.study, ]
## table(anno.study$population)
## table(anno.study$gender)

We have a total of r nrow(anno.study) study samples from the population of r unique(anno.study$population), with r unname(table(anno.study$gender)[names(table(anno.study$gender)) %in% "female"]) female samples and r unname(table(anno.study$gender)[names(table(anno.study$gender)) %in% "male"]) male samples.

2. Sample quality check

Here is the list of all QC steps we have done.

(qcresults <- names(QCresult(seqfile))[names(QCresult(seqfile)) %in% c("MissingRate", "SexCheck", "Inbreeding", "IBD", "PCA")])

2.1 sample missing rate check

Samples with high missing rate (>0.1) are marked as "Yes" in the column of outlier and should be removed. The value of the outlier column is set to "NA" for benchmark samples.

if("MissingRate" %in% qcresults){
    res.mr <- QCresult(seqfile)$MissingRate
    plotQC(seqfile, QCstep = "MissingRate", interactive = interactive)    
}else{
    print("MissingRate check is not done yet!")
    res.mr <- NULL
}

There are r sum(res.mr$outlier %in% "Yes") problematic samples due to high missing rate.

2.2 Sex check

Samples are predicted to be female or male if the inbreeding coefficient is below 0.2, or greater than 0.8. The samples with discordant reported gender and predicted gender are considered as problematic. When the inbreeding coefficient is within the range of [0.2, 0.8], β€œ0” will be shown in the column of pred.sex to indicate ambiguous gender, which would not be considered as problematic.

if("SexCheck" %in% qcresults){
    res.sex <- QCresult(seqfile)$SexCheck
    tail(res.sex)
    table(res.sex$sex, res.sex$pred.sex)
    ## sample with gender mismatch
    (prob.sex <- res.sex[res.sex$sex == "female" & res.sex$pred.sex == "male" | res.sex$sex == "male" & res.sex$pred.sex == "female", ])
    plotQC(seqfile, QCstep = "SexCheck", interactive = interactive)
}else{
    print("Sex check is not done yet!")
    prob.sex <- NULL
}

There are r nrow(prob.sex) problematic samples due to gender mismatch.

2.3 Inbreeding check

By using LD-pruned variants (by default), we calculate the inbreeding coefficients for each sample in the study cohort and for benchmark samples of the same population as the study cohort. Samples with inbreeding coefficients that are five standard deviations beyond the mean are considered problematic and shows "Yes" in the column of outlier.5sd. Benchmark samples in this column are set to be β€œNA”.

The estimate of inbreeding coefficient could be negative due to random sampling error, but a strongly negative value could reflect other factors, e.g. sample contamination events.

if("Inbreeding" %in% qcresults){
    res.inb <- QCresult(seqfile)$Inbreeding
    tail(res.inb)
    table(res.inb$outlier.5sd)
    ## sample with inbreeding outlier
    (prob.inb <- res.inb[res.inb$outlier.5sd == "Yes" & !is.na(res.inb$outlier.5sd), ])
    plotQC(seqfile, QCstep = "Inbreeding", interactive = interactive)
}else{
    print("Inbreeding outlier check is not done yet!")
    prob.inb <- NULL
}

There are r nrow(prob.inb) problematic samples that are inbreeding outliers.

2.4 IBD check

By using LD-pruned variants (by default), we calculate the IBD coefficients for all sample pairs, and then predict related sample pairs in study cohort using the support vector machine (SVM) method with linear kernel and the known relatedness embedded in benchmark data as training set.

Sample pairs with discordant self-reported and predicted relationship are considered as problematic. All predicted related pairs are also required to have coefficient of kinship >= 0.08. The sample with higher missing rate in each related pair is selected for removal from further analysis by function of IBDRemove.

if("IBD" %in% qcresults){
    res.ibd <- QCresult(seqfile)$IBD
    head(res.ibd)
    table(res.ibd$label, res.ibd$pred.label)
    ## sample with cryptic relatedness
    (prob.ibd <- IBDRemove(seqfile))
    plotQC(seqfile, QCstep = "IBD", interactive = interactive)
}else{
    print("IBD check is not done yet!")
    prob.ibd <- NULL
}

There are r nrow(prob.ibd$ibd.pairs) problematic sample pairs with cryptic relationship.

2.5 Population outlier check

By using LD-pruned variants (by default), and benchmark samples as training dataset, we predict the ancestry for each sample in the study cohort using the top four principle components by support machine (SVM) with linear kernel.

Samples with discordant predicted and self-reported population groups are considered problematic. The function PCA performs the PCA analysis and identifies population outliers in study cohort.

if("PCA" %in% qcresults){
    res.pca <- QCresult(seqfile)$PCA
    tail(res.pca)
    studyid <- sampleanno[ind.study, 1]
    res.study <- res.pca[res.pca$sample %in% studyid, ]
    (prob.pca <- res.study[res.study$pop != res.study$pred.pop, ])
    plotQC(seqfile, QCstep = "PCA", interactive = interactive)
}else{
    print("Population outlier check is not done yet!")
    prob.pca <- NULL
}

There are r nrow(prob.pca) problematic samples that are detected to be population outliers.

3. Summary of problematic samples.

Here is the summary list of problematic samples. Problematic samples are identified by SeqSQC due to different reasons of high missing rate, gender mismatch, inbreeding outlier, cryptic relationship or population outlier.

if ("problem.list" %in% names(QCresult(seqfile))){
    (probs <- QCresult(seqfile)$problem.list)
    (prob.remove <- QCresult(seqfile)$remove.list)
}else{
    (probs <- problemList(seqfile))
}


Liubuntu/SeqSQC documentation built on April 12, 2024, 6:39 p.m.