Controlled CDR Directory (ARCHIVED C2022Q2R2 CDR CT Dataset v6)

  • Updated

Described below is the registry of assets currently made available within the Researcher Workbench Controlled Tier dataset, organized according to where and how they are individually stored. 

Summary of changes since Beta

  • 32 WGS samples removed
  • 81 microarray samples removed
  • CRAM files added
  • IDAT files added

See All of Us Genomic Quality Report for more information.

Added: manifest files

Some genomic data file types will now be shared across Curated Data Repository (CDR) releases. In such cases, you will now find a manifest CSV file in place of a simple directory of files. Within this release, this pattern applies to CRAM files, microarray VCF files, and IDAT files. To ensure accordance with the Workbench data use policy, use these manifests to determine the correct set of storage asset paths associated with your workspace / CDR version. Data files which are not shared across releases, e.g. merged WGS VCFs and Hail Matrix Tables will continue to reside in a simple versioned directory. Read further for manifest CSV documentation on each file type.

Note that microarray VCF files now use the manifest pattern, changing from the simple directory structure used in the prior release.

BigQuery

Included within the Controlled Tier is a “mainline” curated data repository (CDR) that contains data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) also represented within the Registered Tier, but with different privacy protections applied. Like the Registered Tier, the Controlled Tier includes a default CDR, which is queryable through the Researcher Workbench dataset tools or BigQuery directly,  and a base CDR that is queryable only through BigQuery. Below is information about the mainline CDR location within BigQuery. 

For more information about mainline CDR creation and the differences between the default and base instances, please see here. For information about the specific tables, fields, and privacy protections applied to the Registered and Controlled Tier CDRs, please see the CDR Data Dictionary.

 

Asset

Location

Env var

Description

Default Mainline CDR 

fc-aou-cdr-prod.C2022Q2R6

WORKSPACE_CDR

Main BigQuery CDR representative of the same data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) included in the Registered Tier, but with few privacy protections applied. 

Base Mainline CDR 

fc-aou-cdr-prod.C2022Q2R6_base

 

Same CDR representation as noted above, but with fewer processing/convenience transformations applied. 

Cloud Storage

Also included in the Controlled Tier is a genomics CDR that is stored separate from the above referenced BigQuery assets. Note, the genomics CDR is only available through the Controlled Tier (e.g., genomics data are not available within the Registered Tier). 

For more information on the specifics of the genomic data assets (including the auxiliary files) and the associated file formats referenced below, please see the Genomic Data section of How the All of Us genomic data are organized.  Before using the genomic data, we highly recommend reading the Known Issues section of the All Of Us Research Program Genomic Research Data Quality Report. For help displaying html files in the Researcher Workbench, please see our featured notebook.

Asset

Location

Env var

Description

All CDR assets path

gs://fc-aou-datasets-controlled/v6

CDR_STORAGE_PATH

All Cloud Storage assets for this CDR version are under this path

WGS: joint callset VCFs

gs://fc-aou-datasets-controlled/v6/wgs/vcf/merged/*

WGS_VCF_MERGED_STORAGE_PATH

Joint callset VCF of all samples in the CDR with genomic data available (98590).  This callset is sharded, by genomic interval, into multiple files. Use with any number of command line tools or libraries.


Please see WGS VCFs for more information

WGS auxiliary files

gs://fc-aou-datasets-controlled/v6/wgs/vcf/aux/

 

Auxiliary files related to the “all samples” WGS data.

  • VCF shard intervals

…/vcf_shard_intervals/interval_lists/*.vcf.gz.interval_list

…/vcf_shard_intervals/shard_begin_and_end_bucket.txt

 

Interval list files corresponding to the sharded VCFs (.interval_list).  Please see  Interval list files for more details on the format.


We also provide the extents of each interval_list file (txt).  Please see the WGS VCF shard interval extent file format (Appendix B) for details on the format of the file. 

  • Variant Annotation Table

…/vat/vat_merged.tsv.bgz

…/vat/chr*/*.tsv.gz

 

Variant-level metadata / functional annotations about the variants contained in the all samples VCF.


Please see Variant Annotation Table for more details, including the provided fields.

  • Ancestry information

…/ancestry/ancestry_preds.tsv

…/ancestry/preds_oth.html

…/ancestry/merged_sites_only_intersection.vcf.bgz

 

Computed ancestry predictions for all samples in the WGS joint callset (tsv).  We also provide a plot of the ancestry predictions and the sites-only VCF (vcf.bgz) of the locations we used for training the ancestry classifier.


Please see Genetic predicted ancestry for more information on these files.


For more information on how we predict ancestry, please see the All Of Us Beta Release Genomic Quality Report (Appendix A)

  • QC information

…/qc/flagged_samples.tsv

…/qc/metrics.html

…/qc/pc1vspc2.html

…/qc/wgs_without_array_rids.tsv

…/qc/wgs_not_in_cdr.tsv

…/qc/array_not_in_cdr.tsv

…/qc/genomic_metrics.tsv

…/qc/wgs_siteID.tsv

 

The list of WGS samples flagged in the joint callset QC process (tsv).  We also include plots of the metric residuals and the first two principal components (*.html).


Please see Flagged WGS Samples for details on the tsv format.


Please see the All Of Us Beta Release Genomic Quality Report (Joint Callset QC) for details on how we flag samples.


2,994 WGS samples do not have corresponding arrays in this release.  The list of WGS IDs can be found in  wgs_without_array_rids.tsv.  For more information, please see All Of Us Beta Release Genomic Quality Report (Known Issue #1)

  • Relatedness

…/relatedness/relatedness.tsv

…/relatedness/relatedness_flagged_samples.tsv

 

The relatedness of the WGS samples as kinship scores.  We provide one file which lists all pairs of samples w/ a kinship score greater than 0.1 (relatedness.tsv).  We also provide a list of samples that would need to be removed in order to remove related samples from the full cohort.  (relatedness_flagged_samples.tsv)


Please see Relatedness for more information.

WGS: joint callset Hail Matrix Table

gs://fc-aou-datasets-controlled/v6/wgs/hail.mt

WGS_HAIL_STORAGE_PATH

Hail MatrixTable for the WGS joint callset. Please note that all shards have been merged into this file.


When using this file in Hail, read directly from the bucket location.  Do not attempt to copy it locally.

WGS: CRAM files

gs://fc-aou-datasets-controlled/v6/wgs/cram/manifest.csv

WGS_CRAM_MANIFEST_PATH

Path to manifest CSV file that contains a row per sample of: person_id,cram_uri,cram_index_uri


We provide CRAM files and CRAM index files with the research ID in the name of the file. One CRAM file for each WGS sample. See CRAM files for more information.

Microarray: single sample VCFs

gs://fc-aou-datasets-controlled/v6/microarray/vcf/single_sample/manifest.csv

MICROARRAY_VCF_MANIFEST_PATH

Path to manifest CSV file that contains a row per sample of: person_id,vcf_uri,vcf_index_uri


One VCF per participant sample.


Please see Array VCFs for more information.

Microarray: all samples Hail Matrix Table

gs://fc-aou-datasets-controlled/v6/microarray/hail.mt

MICROARRAY_HAIL_STORAGE_PATH

Hail matrix table of the micorarray samples in this release.  All of the samples have been merged into a single matrix table.  


Please see Array MatrixTable for more information. 

Microarray: all samples Plink files

gs://fc-aou-datasets-controlled/v6/microarray/plink/arrays.*

 

PLINK binary merged representation of the microarray samples in this release (.bed). Includes .fam, .bim files for usage with the plink tool as well. 


Please see Array PLINK 1.9 data for more information

Microarray: IDAT files

gs://fc-aou-datasets-controlled/v6/microarray/idat/manifest.csv

MICROARRAY_IDAT_MANIFEST_PATH

Path to manifest.csv file that contains a row per sample of: person_id,green_idat_uri,red_idat_uri


Two IDAT files per array sample with the research id in the name of the file. Please see IDAT files for more information.

FASTA hg38/GRCh38 public reference file

gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta

 

All variants are called against the hg38/GRCh38 reference. This is the location to the FASTA formatted file.

FAI hg38/GRCh38 public reference file

gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta.fai

 

All variants are called against the hg38/GRCh38 reference. This is the location to the FAI formatted file.

DICT hg38/GRCh38 public reference file

gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.dict

 

All variants are called against the hg38/GRCh38 reference. This is the location to the DICT formatted file.

 

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.