Incremental Data Release for v7 Genotyping Array and Short-Read Genomic Data

  • Updated

The All of Us Researcher Workbench team released an incremental data release for the v7 Curated Data Repository (CDR) genotyping array (“array”) array and short-read whole genome sequencing (srWGS) data to ensure consistency with data quality procedures.

Two new issues were discovered in the dataset regarding data processing and quality procedures. These issues affected 1493 array samples and 12 srWGS samples. 

We have updated the v7 merged array dataset Hail MatrixTable and PLINK bed files to remove samples that were affected by these issues. We have also deprecated single sample files for samples affected by these issues. The deprecated data will be removed after 60 days. Lists of affected samples are available on the Researcher Workbench.  

The new array dataset is available in PLINK bed and Hail MT formats as version 7.1. We recommend that researchers upgrade their analyses to use the v7.1 array data or remove affected srWGS samples within 60 days.

Background

The issues have been documented in the All of Us Genomic Quality Report and are referred to as Known Issues #14 and #15.  

Known Issue #14 involves a sample processing issue affecting array data from 63 participants. 

Known Issue #15 involves 1430 participants who were discovered to have a history of a bone marrow transplant that is allogeneic or an unknown type. All 1430 participants have array data and 12 of those participants also have srWGS data. 

For both of these issues, we recommend removing any affected participants from your dataset. 

  • If you use the array merged Hail MatrixTable or PLINK bed files, upgrade to the new version 7.1. Current environment variables have been updated to link to the 7.1 dataset.
  • If you use single sample array data, such as IDATs or VCFs, please remove affected samples from your analysis. List files of affected samples are available on the researcher workbench. Affected have been renamed and now have a suffix of *_deprecated. After 60 days, these files will be deleted. 
  • If you use srWGS data, remove any affected samples from your dataset and take action to remove affected single sample files from your workflow. List files of affected samples are available on the researcher workbench.  Affected have been renamed and now have a suffix of *_deprecated. After 60 days, these files will be deleted. 

Note: Please take action within 60 days. 

Where to find updated data

An update for the v7 array merged variant data in PLINK bed and Hail MT formats is now available, version 7.1. In this data update, we remediated the two new known issues, Known Issue #14 and Known Issue #15. Additionally, we have remediated three known issues that were present in the original v7 dataset and are described in the All of Us Genomic Quality Report

We removed 31 additional array samples that were affected by sample processing issues (Known Issue #1 and Known Issue #2). We added one sample that was missing from the merged array dataset (Known Issue #4). In total, 1525 samples were updated in the version 7.1 array PLINK bed and Hail MT formats. 

We have provided list files containing the research IDs of the affected samples so that you can remove affected samples from your analysis (Table 1). 

Single sample data for affected array and srWGS data has been renamed with the suffix *_deprecated. This data will be removed after 60 days (Table 2).

Environment variables now refer to the updated data.

 

Table 1: Controlled Tier directory updated data

Asset Location Environment Variable
Array: all samples Hail MT gs://fc-aou-datasets-controlled/v7/microarray/hail_v7.1.mt MICROARRAY_HAIL_STORAGE_PATH
Array: all samples PLINK files gs://fc-aou-datasets-controlled/v7/microarray/plink_v7.1/arrays.*  
Array: IDAT files gs://fc-aou-datasets-controlled/v7/microarray/idat/manifest.csv MICROARRAY_IDAT_MANIFEST_PATH
Array: single sample VCFs gs://fc-aou-datasets-controlled/v7/microarray/vcf/manifest.csv MICROARRAY_VCF_MANIFEST_PATH
srWGS: CRAM files gs://fc-aou-datasets-controlled/v7/wgs/cram/manifest.csv WGS_CRAM_MANIFEST_PATH
Array: Known Issue #14 gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_array_known_issue_14.tsv  
Array: Known Issue #15  gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_array_known_issue_15.tsv  
srWGS: Known Issue #15 gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_wgs_known_issue_15.tsv  

 

Table 2: Deprecated datasets

Asset Original Path Deprecated Dataset (available for 60 days)
Array: all samples Hail MT gs://fc-aou-datasets-controlled/v7/microarray/hail.mt gs://fc-aou-datasets-controlled/v7/microarray/hail_deprecated.mt
Array: all samples PLINK files <filepath>/arrays.* <filepath>/arrays_deprecated.*
Array: IDAT files <filepath>/*_Grn.idat <filepath>*_Grn_deprecated.idat
Array: single sample VCFs <filepath>/*.sorted.vcf.gz <filepath>/*.sorted_deprecated.vcf.gz
srWGS: CRAM files <filepath>/wgs_*.cram <filepath>/wgs_*_deprecated.cram

 

If you have any questions regarding the hotfix, please contact us at support@researchallofus.org.

 

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.