The All of Us Researcher Workbench team released an incremental data release for the v7 Curated Data Repository (CDR) genotyping array (“array”) array and short-read whole genome sequencing (srWGS) data to ensure consistency with data quality procedures.
Two new issues were discovered in the dataset regarding data processing and quality procedures. These issues affected 1493 array samples and 12 srWGS samples.
We have updated the v7 merged array dataset Hail MatrixTable and PLINK bed files to remove samples that were affected by these issues. We have also deprecated single sample files for samples affected by these issues. The deprecated data will be removed after 60 days. Lists of affected samples are available on the Researcher Workbench.
The new array dataset is available in PLINK bed and Hail MT formats as version 7.1. We recommend that researchers upgrade their analyses to use the v7.1 array data or remove affected srWGS samples within 60 days.
Background
The issues have been documented in the All of Us Genomic Quality Report and are referred to as Known Issues #14 and #15.
Known Issue #14 involves a sample processing issue affecting array data from 63 participants.
Known Issue #15 involves 1430 participants who were discovered to have a history of a bone marrow transplant that is allogeneic or an unknown type. All 1430 participants have array data and 12 of those participants also have srWGS data.
For both of these issues, we recommend removing any affected participants from your dataset.
- If you use the array merged Hail MatrixTable or PLINK bed files, upgrade to the new version 7.1. Current environment variables have been updated to link to the 7.1 dataset.
- If you use single sample array data, such as IDATs or VCFs, please remove affected samples from your analysis. List files of affected samples are available on the researcher workbench. Affected have been renamed and now have a suffix of *_deprecated. After 60 days, these files will be deleted.
- If you use srWGS data, remove any affected samples from your dataset and take action to remove affected single sample files from your workflow. List files of affected samples are available on the researcher workbench. Affected have been renamed and now have a suffix of *_deprecated. After 60 days, these files will be deleted.
Note: Please take action within 60 days.
Where to find updated data
An update for the v7 array merged variant data in PLINK bed and Hail MT formats is now available, version 7.1. In this data update, we remediated the two new known issues, Known Issue #14 and Known Issue #15. Additionally, we have remediated three known issues that were present in the original v7 dataset and are described in the All of Us Genomic Quality Report .
We removed 31 additional array samples that were affected by sample processing issues (Known Issue #1 and Known Issue #2). We added one sample that was missing from the merged array dataset (Known Issue #4). In total, 1525 samples were updated in the version 7.1 array PLINK bed and Hail MT formats.
We have provided list files containing the research IDs of the affected samples so that you can remove affected samples from your analysis (Table 1).
Single sample data for affected array and srWGS data has been renamed with the suffix *_deprecated. This data will be removed after 60 days (Table 2).
Environment variables now refer to the updated data.
Table 1: Controlled Tier directory updated data
Asset | Location | Environment Variable |
Array: all samples Hail MT | gs://fc-aou-datasets-controlled/v7/microarray/hail_v7.1.mt | MICROARRAY_HAIL_STORAGE_PATH |
Array: all samples PLINK files | gs://fc-aou-datasets-controlled/v7/microarray/plink_v7.1/arrays.* | |
Array: IDAT files | gs://fc-aou-datasets-controlled/v7/microarray/idat/manifest.csv | MICROARRAY_IDAT_MANIFEST_PATH |
Array: single sample VCFs | gs://fc-aou-datasets-controlled/v7/microarray/vcf/manifest.csv | MICROARRAY_VCF_MANIFEST_PATH |
srWGS: CRAM files | gs://fc-aou-datasets-controlled/v7/wgs/cram/manifest.csv | WGS_CRAM_MANIFEST_PATH |
Array: Known Issue #14 | gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_array_known_issue_14.tsv | |
Array: Known Issue #15 | gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_array_known_issue_15.tsv | |
srWGS: Known Issue #15 | gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_wgs_known_issue_15.tsv |
Table 2: Deprecated datasets
Asset | Original Path | Deprecated Dataset (available for 60 days) |
Array: all samples Hail MT | gs://fc-aou-datasets-controlled/v7/microarray/hail.mt | gs://fc-aou-datasets-controlled/v7/microarray/hail_deprecated.mt |
Array: all samples PLINK files | <filepath>/arrays.* | <filepath>/arrays_deprecated.* |
Array: IDAT files | <filepath>/*_Grn.idat | <filepath>*_Grn_deprecated.idat |
Array: single sample VCFs | <filepath>/*.sorted.vcf.gz | <filepath>/*.sorted_deprecated.vcf.gz |
srWGS: CRAM files | <filepath>/wgs_*.cram | <filepath>/wgs_*_deprecated.cram |
If you have any questions regarding the hotfix, please contact us at support@researchallofus.org.
Comments
0 comments
Please sign in to leave a comment.