Why do I see a high sample no call rate in the smaller callsets?

  • Updated

The high sample no-call rate is a result of our filtering process for the smaller callsets, not QC issues. The smaller callsets are the srWGS SNP and Indel callset over limited genomic regions in VCF, Hail MT, BGEN, and PLINK bed formats.

During joint callset QC, we filter at the genotype level and set filtered genotypes to no calls (./.) (i.e., the FT field is not "PASS" nor missing). The introduction of additional ./. leads to an artificial deflation of sample call rate for all samples. This is exacerbated for split Hail MT, BGEN, and PLINK bed files because multiallelic sites are split in these formats.

Because we already have performed sample and genotype QC, we do not recommend that researchers should use sample call rate to filter samples in the smaller callsets. If you do a call rate QC check as part of your analyses, you will arbitrarily remove too many samples. Additionally, we do not recommend using variant call rate to filter variants in the smaller callsets. This is for similar reasons as described above.

We believe that the QC steps detailed in the All of Us Genomic Quality Report already prune or flag potentially problematic samples. 

If you have any further questions about the quality of the variant data in the All of Us genomic dataset, please reach out to support@researchallofus.org.

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.