Described below is the registry of assets currently made available within the Researcher Workbench Controlled Tier dataset, organized according to where and how they are individually stored.
Summary of changes since Beta
- 32 WGS samples removed
- 81 microarray samples removed
- CRAM files added
- IDAT files added
See All of Us Genomic Quality Report for more information.
Added: manifest files
Some genomic data file types will now be shared across Curated Data Repository (CDR) releases. In such cases, you will now find a manifest CSV file in place of a simple directory of files. Within this release, this pattern applies to CRAM files, microarray VCF files, and IDAT files. To ensure accordance with the Workbench data use policy, use these manifests to determine the correct set of storage asset paths associated with your workspace / CDR version. Data files which are not shared across releases, e.g. merged WGS VCFs and Hail Matrix Tables will continue to reside in a simple versioned directory. Read further for manifest CSV documentation on each file type.
Note that microarray VCF files now use the manifest pattern, changing from the simple directory structure used in the prior release.
BigQuery
Included within the Controlled Tier is a “mainline” curated data repository (CDR) that contains data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) also represented within the Registered Tier, but with different privacy protections applied. Like the Registered Tier, the Controlled Tier includes a default CDR, which is queryable through the Researcher Workbench dataset tools or BigQuery directly, and a base CDR that is queryable only through BigQuery. Below is information about the mainline CDR location within BigQuery.
For more information about mainline CDR creation and the differences between the default and base instances, please see here. For information about the specific tables, fields, and privacy protections applied to the Registered and Controlled Tier CDRs, please see the CDR Data Dictionary.
Asset |
Location |
Env var |
Description |
Default Mainline CDR |
fc-aou-cdr-prod.C2022Q2R6 |
WORKSPACE_CDR |
Main BigQuery CDR representative of the same data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) included in the Registered Tier, but with few privacy protections applied. |
Base Mainline CDR |
fc-aou-cdr-prod.C2022Q2R6_base |
Same CDR representation as noted above, but with fewer processing/convenience transformations applied. |
Cloud Storage
Also included in the Controlled Tier is a genomics CDR that is stored separate from the above referenced BigQuery assets. Note, the genomics CDR is only available through the Controlled Tier (e.g., genomics data are not available within the Registered Tier).
For more information on the specifics of the genomic data assets (including the auxiliary files) and the associated file formats referenced below, please see the Genomic Data section of How the All of Us genomic data are organized. Before using the genomic data, we highly recommend reading the Known Issues section of the All Of Us Research Program Genomic Research Data Quality Report. For help displaying html files in the Researcher Workbench, please see our featured notebook.
Asset |
Location |
Env var |
Description |
All CDR assets path |
gs://fc-aou-datasets-controlled/v6 |
CDR_STORAGE_PATH |
All Cloud Storage assets for this CDR version are under this path |
WGS: joint callset VCFs |
gs://fc-aou-datasets-controlled/v6/wgs/vcf/merged/* |
WGS_VCF_MERGED_STORAGE_PATH |
Joint callset VCF of all samples in the CDR with genomic data available (98590). This callset is sharded, by genomic interval, into multiple files. Use with any number of command line tools or libraries. Please see WGS VCFs for more information |
WGS auxiliary files |
gs://fc-aou-datasets-controlled/v6/wgs/vcf/aux/ |
Auxiliary files related to the “all samples” WGS data. |
|
|
…/vcf_shard_intervals/interval_lists/*.vcf.gz.interval_list …/vcf_shard_intervals/shard_begin_and_end_bucket.txt |
Interval list files corresponding to the sharded VCFs (.interval_list). Please see Interval list files for more details on the format. We also provide the extents of each interval_list file (txt). Please see the WGS VCF shard interval extent file format (Appendix B) for details on the format of the file. |
|
|
…/vat/vat_merged.tsv.bgz …/vat/chr*/*.tsv.gz |
Variant-level metadata / functional annotations about the variants contained in the all samples VCF. Please see Variant Annotation Table for more details, including the provided fields. |
|
|
…/ancestry/ancestry_preds.tsv …/ancestry/preds_oth.html …/ancestry/merged_sites_only_intersection.vcf.bgz |
Computed ancestry predictions for all samples in the WGS joint callset (tsv). We also provide a plot of the ancestry predictions and the sites-only VCF (vcf.bgz) of the locations we used for training the ancestry classifier. Please see Genetic predicted ancestry for more information on these files. For more information on how we predict ancestry, please see the All Of Us Beta Release Genomic Quality Report (Appendix A) |
|
|
…/qc/flagged_samples.tsv …/qc/metrics.html …/qc/pc1vspc2.html …/qc/wgs_without_array_rids.tsv …/qc/wgs_not_in_cdr.tsv …/qc/array_not_in_cdr.tsv …/qc/genomic_metrics.tsv …/qc/wgs_siteID.tsv |
The list of WGS samples flagged in the joint callset QC process (tsv). We also include plots of the metric residuals and the first two principal components (*.html). Please see Flagged WGS Samples for details on the tsv format. Please see the All Of Us Beta Release Genomic Quality Report (Joint Callset QC) for details on how we flag samples. 2,994 WGS samples do not have corresponding arrays in this release. The list of WGS IDs can be found in wgs_without_array_rids.tsv. For more information, please see All Of Us Beta Release Genomic Quality Report (Known Issue #1) |
|
|
…/relatedness/relatedness.tsv …/relatedness/relatedness_flagged_samples.tsv |
The relatedness of the WGS samples as kinship scores. We provide one file which lists all pairs of samples w/ a kinship score greater than 0.1 (relatedness.tsv). We also provide a list of samples that would need to be removed in order to remove related samples from the full cohort. (relatedness_flagged_samples.tsv) Please see Relatedness for more information. |
|
WGS: joint callset Hail Matrix Table |
gs://fc-aou-datasets-controlled/v6/wgs/hail.mt |
WGS_HAIL_STORAGE_PATH |
Hail MatrixTable for the WGS joint callset. Please note that all shards have been merged into this file. When using this file in Hail, read directly from the bucket location. Do not attempt to copy it locally. |
WGS: CRAM files |
gs://fc-aou-datasets-controlled/v6/wgs/cram/manifest.csv |
WGS_CRAM_MANIFEST_PATH |
Path to manifest CSV file that contains a row per sample of: person_id,cram_uri,cram_index_uri We provide CRAM files and CRAM index files with the research ID in the name of the file. One CRAM file for each WGS sample. See CRAM files for more information. |
Microarray: single sample VCFs |
gs://fc-aou-datasets-controlled/v6/microarray/vcf/single_sample/manifest.csv |
MICROARRAY_VCF_MANIFEST_PATH |
Path to manifest CSV file that contains a row per sample of: person_id,vcf_uri,vcf_index_uri One VCF per participant sample. Please see Array VCFs for more information. |
Microarray: all samples Hail Matrix Table |
gs://fc-aou-datasets-controlled/v6/microarray/hail.mt |
MICROARRAY_HAIL_STORAGE_PATH |
Hail matrix table of the micorarray samples in this release. All of the samples have been merged into a single matrix table. Please see Array MatrixTable for more information. |
Microarray: all samples Plink files |
gs://fc-aou-datasets-controlled/v6/microarray/plink/arrays.* |
PLINK binary merged representation of the microarray samples in this release (.bed). Includes .fam, .bim files for usage with the plink tool as well. Please see Array PLINK 1.9 data for more information |
|
Microarray: IDAT files |
gs://fc-aou-datasets-controlled/v6/microarray/idat/manifest.csv |
MICROARRAY_IDAT_MANIFEST_PATH |
Path to manifest.csv file that contains a row per sample of: person_id,green_idat_uri,red_idat_uri Two IDAT files per array sample with the research id in the name of the file. Please see IDAT files for more information. |
FASTA hg38/GRCh38 public reference file |
gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta |
|
All variants are called against the hg38/GRCh38 reference. This is the location to the FASTA formatted file. |
FAI hg38/GRCh38 public reference file |
gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta.fai |
|
All variants are called against the hg38/GRCh38 reference. This is the location to the FAI formatted file. |
DICT hg38/GRCh38 public reference file |
gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.dict |
|
All variants are called against the hg38/GRCh38 reference. This is the location to the DICT formatted file. |
Comments
0 comments
Article is closed for comments.