Controlled CDR Directory (ARCHIVED C2021Q3R6 CDR CT Dataset v5)

  • Updated

Described below is the  registry of assets currently made available within the Researcher Workbench Controlled Tier dataset, organized according to where and how they are individually stored.

Included within the Controlled Tier is a “mainline” curated data repository (CDR) that contains data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) also represented within the Registered Tier, but with different privacy protections applied. Like the Registered Tier, the Controlled Tier includes a default CDR, which is queryable through the Researcher Workbench dataset tools or BigQuery directly  and a base CDR that is queryable only through BigQuery. Below is information about mainline CDR location within BigQuery. 

For more information about mainline CDR creation and the differences between the default and base instances, please see here. For information about the specific tables, fields, and privacy protections applied to the Registered and Controlled Tier CDRs, please see the CDR Data Dictionary.

BigQuery

Asset

Location

Env var

Description

Default Mainline CDR

fc-aou-cdr-prod.C2021Q3R6

WORKSPACE_CDR

Main BigQuery CDR representative of the same data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) included in the Registered Tier, but with few privacy protections applied.

Base Mainline CDR

fc-aou-cdr-prod.C2021Q3R6_base

 

Same CDR representation as noted above, but with fewer processing/convenience transformations applied.

 

Cloud Storage

Also included in the Controlled Tier is a genomics CDR that is stored separate from the above referenced BigQuery assets. Note, the genomics CDR is only available through the Controlled Tier (e.g., genomics data are not available within the Registered Tier). 

For more information on the specifics of the genomic data assets (including the auxiliary files) and the associated file formats referenced below, please see the Genomic Data section of How the All of Us genomic data are organized.  Before using the genomic data, we highly recommend reading the Known Issues section of the All Of Us Research Program Genomic Research Data Quality Report. For help displaying html files in the Researcher Workbench, please see our featured notebook.

 

Asset

Location

Env var

Description

All CDR assets path

gs://fc-aou-datasets-controlled/v5

CDR_STORAGE_PATH

All Cloud Storage assets for this CDR version are under this path

WGS: joint callset VCFs

gs://fc-aou-datasets-controlled/v5/wgs/vcf/merged/*

WGS_VCF_MERGED_STORAGE_PATH

Joint callset VCF of all samples in the CDR with genomic data available (98622).  This callset is sharded, by genomic interval, into multiple files. Use with any number of command line tools or libraries.

Please see WGS VCFs for more information

WGS auxiliary files

gs://fc-aou-datasets-controlled/v5/wgs/vcf/aux/

 

Auxiliary files related to the “all samples” WGS data.

  • VCF shard intervals

…/vcf_shard_intervals/interval_lists/*-scattered.interval_list

…/vcf_shard_intervals/shard_begin_and_end_bucket.txt

 

Interval list files corresponding to the sharded VCFs (.interval_list).  Please see  Interval list files for more details on the format.

We also provide the extents of each interval_list file (txt).  Please see the WGS VCF shard interval extent file format (Appendix B) for details on the format of the file. 

  • Variant Annotation Table

…/vat_v2/merged.tsv.bgz

…/vat_v2/chr*/

 

Variant-level metadata / functional annotations about the variants contained in the all samples VCF.

Please see Variant Annotation Table for more details, including the provided fields.

  • Ancestry information

…/ancestry/ancestry_preds.tsv

…/ancestry/preds_oth.html

…/ancestry/merged_sites_only_intersection.vcf.bgz

 

Computed ancestry predictions for all samples in the WGS joint callset (tsv).  We also provide a plot of the ancestry predictions and the sites-only VCF (vcf.bgz) of the locations we used for training the ancestry classifier.

Please see Genetic predicted ancestry for more information on these files.

For more information on how we predict ancestry, please see the All Of Us Beta Release Genomic Quality Report (Appendix A)

  • QC information

…/qc/flagged_samples.tsv

…/qc/metrics.html

…/qc/pc1vspc2.html

…/qc/wgs_without_array_rids.tsv

 

The list of WGS samples flagged in the joint callset QC process (tsv).  We also include plots of the metric residuals and the first two principal components (*.html).

Please see Flagged WGS Samples for details on the tsv format.

Please see the All Of Us Beta Release Genomic Quality Report (Joint Callset QC) for details on how we flag samples.

2,997 WGS samples do not have corresponding arrays in this release.  The list of WGS IDs can be found in beta_v4_99k_wgs_without_array_rids.tsv.  For more information, please see All Of Us Beta Release Genomic Quality Report (Known Issue #1)

  • Relatedness

…/relatedness/relatedness.tsv

…/relatedness/relatedness_flagged_samples.tsv

 

The relatedness of the WGS samples as kinship scores.  We provide one file which lists all pairs of samples w/ a kinship score greater than 0.1 (relatedness.tsv).  We also provide a list of samples that would need to be removed in order to remove related samples from the full cohort.  (relatedness_flagged_samples.tsv)

Please see Relatedness for more information

WGS: joint callset Hail Matrix Table

gs://fc-aou-datasets-controlled/v5/wgs/hail.mt

WGS_HAIL_STORAGE_PATH

Hail MatrixTable for the WGS joint callset. Please note that all shards have been merged into this file.

When using this file in Hail, read directly from the bucket location.  Do not attempt to copy it locally.

Microarray: single sample VCFs

gs://fc-aou-datasets-controlled/v5/microarray/vcf/single_sample/manifest.csv

MICROARRAY_VCF_SINGLE_SAMPLE_STORAGE_PATH

Path to manifest CSV file that contains a row per sample of: person_id,vcf_uri,vcf_index_uri

 

One VCF per participant sample.

Please see Array VCFs for more information

Microarray: all samples Hail Matrix Table

gs://fc-aou-datasets-controlled/v5/microarray/hail.mt

MICROARRAY_HAIL_STORAGE_PATH

Hail matrix table of the micorarray samples in this release.  All of the samples have been merged into a single matrix table.  

Please see Array MatrixTable for more information. 

Microarray: all samples Plink files

gs://fc-aou-datasets-controlled/v5/microarray/plink/arrays.*

 

PLINK binary merged representation of the microarray samples in this release (.bed). Includes .fam, .bim files for usage with the plink tool as well. 

Please see Array PLINK 1.9 data for more information

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.