Controlled CDR Directory (ARCHIVED C2021Q3R6 CDR CT Dataset v5)

Described below is the registry of assets currently made available within the Researcher Workbench Controlled Tier dataset, organized according to where and how they are individually stored.

Included within the Controlled Tier is a “mainline” curated data repository (CDR) that contains data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) also represented within the Registered Tier, but with different privacy protections applied. Like the Registered Tier, the Controlled Tier includes a default CDR, which is queryable through the Researcher Workbench dataset tools or BigQuery directly and a base CDR that is queryable only through BigQuery. Below is information about mainline CDR location within BigQuery.

For more information about mainline CDR creation and the differences between the default and base instances, please see here. For information about the specific tables, fields, and privacy protections applied to the Registered and Controlled Tier CDRs, please see the CDR Data Dictionary.

BigQuery

Asset	Location	Env var	Description
Default Mainline CDR	fc-aou-cdr-prod.C2021Q3R6	WORKSPACE_CDR	Main BigQuery CDR representative of the same data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) included in the Registered Tier, but with few privacy protections applied.
Base Mainline CDR	fc-aou-cdr-prod.C2021Q3R6_base		Same CDR representation as noted above, but with fewer processing/convenience transformations applied.

Cloud Storage

Also included in the Controlled Tier is a genomics CDR that is stored separate from the above referenced BigQuery assets. Note, the genomics CDR is only available through the Controlled Tier (e.g., genomics data are not available within the Registered Tier).

For more information on the specifics of the genomic data assets (including the auxiliary files) and the associated file formats referenced below, please see the Genomic Data section of How the All of Us genomic data are organized. Before using the genomic data, we highly recommend reading the Known Issues section of the All Of Us Research Program Genomic Research Data Quality Report. For help displaying html files in the Researcher Workbench, please see our featured notebook.

Asset	Location	Env var	Description
All CDR assets path	gs://fc-aou-datasets-controlled/v5	CDR_STORAGE_PATH	All Cloud Storage assets for this CDR version are under this path
WGS: joint callset VCFs	gs://fc-aou-datasets-controlled/v5/wgs/vcf/merged/*	WGS_VCF_MERGED_STORAGE_PATH	Joint callset VCF of all samples in the CDR with genomic data available (98622). This callset is sharded, by genomic interval, into multiple files. Use with any number of command line tools or libraries. Please see WGS VCFs for more information
WGS auxiliary files	gs://fc-aou-datasets-controlled/v5/wgs/vcf/aux/		Auxiliary files related to the “all samples” WGS data.
VCF shard intervals	…/vcf_shard_intervals/interval_lists/*-scattered.interval_list …/vcf_shard_intervals/shard_begin_and_end_bucket.txt		Interval list files corresponding to the sharded VCFs (.interval_list). Please see Interval list files for more details on the format. We also provide the extents of each interval_list file (txt). Please see the WGS VCF shard interval extent file format (Appendix B) for details on the format of the file.
Variant Annotation Table	…/vat_v2/merged.tsv.bgz …/vat_v2/chr*/		Variant-level metadata / functional annotations about the variants contained in the all samples VCF. Please see Variant Annotation Table for more details, including the provided fields.
Ancestry information	…/ancestry/ancestry_preds.tsv …/ancestry/preds_oth.html …/ancestry/merged_sites_only_intersection.vcf.bgz		Computed ancestry predictions for all samples in the WGS joint callset (tsv). We also provide a plot of the ancestry predictions and the sites-only VCF (vcf.bgz) of the locations we used for training the ancestry classifier. Please see Genetic predicted ancestry for more information on these files. For more information on how we predict ancestry, please see the All Of Us Beta Release Genomic Quality Report (Appendix A)
QC information	…/qc/flagged_samples.tsv …/qc/metrics.html …/qc/pc1vspc2.html …/qc/wgs_without_array_rids.tsv		The list of WGS samples flagged in the joint callset QC process (tsv). We also include plots of the metric residuals and the first two principal components (.html). Please see Flagged WGS Samples for details on the tsv format. Please see the All Of Us* Beta Release Genomic Quality Report (Joint Callset QC) for details on how we flag samples. 2,997 WGS samples do not have corresponding arrays in this release. The list of WGS IDs can be found in beta_v4_99k_wgs_without_array_rids.tsv. For more information, please see All Of Us Beta Release Genomic Quality Report (Known Issue #1)
Relatedness	…/relatedness/relatedness.tsv …/relatedness/relatedness_flagged_samples.tsv		The relatedness of the WGS samples as kinship scores. We provide one file which lists all pairs of samples w/ a kinship score greater than 0.1 (relatedness.tsv). We also provide a list of samples that would need to be removed in order to remove related samples from the full cohort. (relatedness_flagged_samples.tsv) Please see Relatedness for more information
WGS: joint callset Hail Matrix Table	gs://fc-aou-datasets-controlled/v5/wgs/hail.mt	WGS_HAIL_STORAGE_PATH	Hail MatrixTable for the WGS joint callset. Please note that all shards have been merged into this file. When using this file in Hail, read directly from the bucket location. Do not attempt to copy it locally.
Microarray: single sample VCFs	gs://fc-aou-datasets-controlled/v5/microarray/vcf/single_sample/manifest.csv	MICROARRAY_VCF_SINGLE_SAMPLE_STORAGE_PATH	Path to manifest CSV file that contains a row per sample of: person_id,vcf_uri,vcf_index_uri One VCF per participant sample. Please see Array VCFs for more information
Microarray: all samples Hail Matrix Table	gs://fc-aou-datasets-controlled/v5/microarray/hail.mt	MICROARRAY_HAIL_STORAGE_PATH	Hail matrix table of the micorarray samples in this release. All of the samples have been merged into a single matrix table. Please see Array MatrixTable for more information.
Microarray: all samples Plink files	gs://fc-aou-datasets-controlled/v5/microarray/plink/arrays.*		PLINK binary merged representation of the microarray samples in this release (.bed). Includes .fam, .bim files for usage with the plink tool as well. Please see Array PLINK 1.9 data for more information

Controlled CDR Directory (ARCHIVED C2021Q3R6 CDR CT Dataset v5)

Was this article helpful?

Comments

<%= previousTitle %>

<%= nextTitle %>

<%= block.name %>

<%= block.name %>

Have a question or would like to make a request?

Categories

Toggle navigation menu

<%= category.name %>

Search

Was this article helpful?

<%= previousTitle %>

<%= nextTitle %>

<%= block.name %>

<%= block.name %>

Have a question or would like to make a request?

Categories

Toggle navigation menu

<%= category.name %>

Categories

Categories