The All of Us Researcher Workbench team released updates to short-read WGS (srWGS) smaller callsets to improve usability and to remedy issues related to missing variants and inflated AN and AF fields.
All formats (BGEN, Hail MatrixTables, PLINK bed, and VCF) of the three smaller callsets (ACAF Threshold, ClinVar, and Exome) were updated on 9/26/23.
Recovered Missing Variants in ACAF Threshold (All Formats)
The srWGS ACAF Threshold callset was missing variants (~8%) that were supposed to be included according to our documentation in “How the All of Us Genomic Data are Organized” and “Smaller Callsets for Analyzing Short Read WGS SNP & Indel Data”. We added the missing variants to the updated ACAF Threshold callset.
Applied Genotype filtering (All Formats)
To make the smaller callsets easier to use, we created a process to filter out genotypes when they fail genotype filtering (FT). The failed genotypes now appear as missing calls (“no calls”).
This process decreases the number of quality control steps researchers need to perform to analyze and ensure accuracy in analyses. Original calls including genotypes that fail genotype filtering can be recovered from the complete callset VDS if needed.
Populated rsid field (BGEN Format)
The “rsid” field was empty in the BGEN format, which caused issues with some software, including PLINK and Regenie. We added a unique identifier to the “rsid” field based on the position, REF, and ALT alleles, so that researchers no longer need to workaround this issue manually.
Corrected Allele Number (AN) and Allele Fraction (AF) fields (VCF and Hail MT Formats)
The smaller callset VCF and Hail MT formats were updated to correct for a Hail bug. In these formats, some positions of the genome had variants with inflated allele number (AN) and allele fraction (AF) fields. These incorrect values are a result from variants at the beginning of chromosome calling regions called as erroneous homref genotypes (GT = 0/0), instead of no calls (GT = ./.).
This affected a small number of variants on the autosome and on chromosome X (less than 0.03% of variants). However, chromosome Y has been more severely impacted (less than 4.10% of variant sites). The variants affected were at the beginning of the calling regions.
If you use the AN or AF fields in your analyses, we recommend that you upgrade to the new version of these smaller callsets and re-run any completed analyses.
File Path Updates
We updated the Controlled CDR Directory page on the User Support Hub to include the new srWGS smaller callsets. Table 1 summarizes the original, new, and deprecated srWGS smaller callsets locations. The original versions of the smaller callsets were renamed to “*_deprecated” and will be deleted in 60 days (on 11/7/23).
We recommend that you update your bucket path locations to the new location in all analyses. If you cannot upgrade immediately, we recommend that you use the deprecated location until you are able to update to the new location. Please note: you will need to transition to the new location by 11/7/23.
If you’ve already completed a project with the ACAF callset, you may want to re-run your analysis if you’d like to include the missing variants. For users who have published with these callsets - no changes are needed because your analysis relies on the data at the time of publication.
If you have any questions regarding the update, please contact us at support@researchallofus.org.
Table 1: Original, New, and Deprecated Locations for the srWGS Smaller Callsets.
srWGS Smaller Callsets | Status | Location |
ACAF Threshold | New | gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1 |
Deprecated (deleted in 60 days) |
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_deprecated | |
Original (deleted) |
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold | |
ClinVar | New | gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1 |
Deprecated (deleted in 60 days) |
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_deprecated | |
Original (deleted) |
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar | |
Exome | New | gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1 |
Deprecated (deleted in 60 days) |
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_deprecated | |
Original (deleted) |
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome |
Comments
0 comments
Article is closed for comments.