Controlled CDR Directory

Described below is the registry of assets currently made available within the Researcher Workbench Controlled Tier dataset, organized according to where and how they are individually stored.

Summary of changes since CDRv7

169,436 short-read WGS (srWGS) samples added with single nucleotide polymorphism, insertion, and deletion variant calls (SNPs and Indels) (2,684 Withdrawn, 172,120 New) - Total: 414830
137,898 genotyping array (“array”) samples added (365 Withdrawn, 139,776 New) - Total: 447,278
85,671 srWGS samples with structural variant (SV) calls added (123 Withdrawn, 85,794 New) - Total: 97,061
Long read WGS (lrWGS) samples with SNP and Indel variants and SVs added (1,773 New) - Total: 2,800

BigQuery

Included within the Controlled Tier is a “mainline” curated data repository (CDR) that contains data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) also represented within the Registered Tier, but with different privacy protections applied. Like the Registered Tier, the Controlled Tier includes a default CDR, which is queryable through the Researcher Workbench dataset tools or BigQuery directly, and a base CDR that is queryable only through BigQuery. Below is information about the mainline CDR location within BigQuery.

For more information about mainline CDR creation and the differences between the default and base instances, please see here. For information about the specific tables, fields, and privacy protections applied to the Registered and Controlled Tier CDRs, please see the CDR Data Dictionary.

Asset	Location	Env var	Description
Default Mainline CDR	fc-aou-cdr-prod.C2024Q3R3	WORKSPACE_CDR	Main BigQuery CDR representative of the same data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) included in the Registered Tier, but with fewer privacy protections applied.
Base Mainline CDR	fc-aou-cdr-prod.C2024Q3R3_base		Same CDR representation as noted above, but with fewer processing/convenience transformations applied.

Cloud Storage

Also included in the Controlled Tier is a genomics CDR that is stored separate from the above referenced BigQuery assets. Note, the genomics CDR is only available through the Controlled Tier (e.g., genomic data is not available within the Registered Tier).

Complete descriptions on the genomic data file types (including auxiliary files) are available in the article How the All of Us genomic data are organized. Before using the genomic data, we highly recommend reading the Known Issues section of the All of Us Research Program Genomic Research Data Quality Report. For help displaying html files in the Researcher Workbench, please see our featured notebook.

The environment variables (Env var) are predefined variables in the Researcher Workbench and can be used to reference the full path.

If you use gsutil to access the CDR bucket, you will need to pass an additional flag in the command as described more here:

!gsutil -u $GOOGLE_PROJECT ls gs://fc-aou-datasets-controlled

Asset	Location	Env var	Description
All CDR assets path	gs://fc-aou-datasets-controlled/v8	CDR_STORAGE_PATH	All Cloud Storage assets for this CDR version are under this path

short-read whole genome sequencing (srWGS) data	gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/
srWGS: Variant Dataset (VDS)	.../vds/hail.vds	WGS_VDS_PATH	The srWGS joint callset in VDS format.
srWGS: Exome callsets	.../exome/		The srWGS SNP and Indel variants within the exon regions of the Gencode v42 basic transcripts.
Exome Hail MatrixTable (MT)	.../exome/multiMT/hail.mt	WGS_EXOME_MULTI_HAIL_PATH	Multiallelic sites are not split.
Exome Hail MT multiallelic split	.../exome/splitMT/hail.mt	WGS_EXOME_SPLIT_HAIL_PATH	Multiallelic sites are split into separate records.
Exome VCF	.../exome/vcf/	WGS_EXOME_VCF_PATH	Sharded by chromosome into multiple files.
Exome PLINK BED	.../exome/plink_bed/		PLINK binary biallelic genotype table (.bed), along with .fam and .bim files.
Exome BGEN	.../exome/bgen/		Binary GEN (BGEN) files containing sample, Hail index, and bgenix index.
Exome PGEN	.../exome/pgen/		PLINK 2 binary genotype table (PGEN) files containing pgen, pvar and psam files.
Exome UCSC BED	.../exome/bed/		UCSC BED file with the regions used to generate the Exome callset.
srWGS: ACAF threshold callsets	.../acaf_threshold/		The srWGS SNP and Indel variants that are frequent in the All of Us genetic ancestry groups.
ACAF threshold Hail MT	.../acaf_threshold/multiMT/hail.mt	WGS_ACAF_THRESHOLD_MULTI_HAIL_PATH	Multiallelic sites are not split.
ACAF threshold Hail MT multiallelic split	.../acaf_threshold/splitMT/hail.mt	WGS_ACAF_THRESHOLD_SPLIT_HAIL_PATH	Multiallelic sites are split into separate records.
ACAF threshold VCF	.../acaf_threshold/vcf/	WGS_ACAF_THRESHOLD_VCF_PATH	Sharded by chromosome into multiple files.
ACAF threshold PLINK BED	.../acaf_threshold/plink_bed/		PLINK binary biallelic genotype table (.bed), along with .fam and .bim files.
ACAF threshold BGEN	.../acaf_threshold/bgen/		Binary GEN (BGEN) files containing sample, Hail index, and bgenix index.
ACAF threshold PGEN	.../acaf_threshold/pgen/		PLINK 2 binary genotype table (PGEN) files containing pgen, pvar and psam files.
ACAF threshold UCSC BED	.../acaf_threshold/bed/		UCSC BED file with the regions used to generate the ACAF threshold callset.
srWGS: ClinVar variant callsets callsets	.../clinvar/		The srWGS SNP and Indel variants that are in Clinvar, not limited to pathogenic or likely pathogenic variants.
ClinVar Hail MT	.../clinvar/multiMT/hail.mt	WGS_CLINVAR_MULTI_HAIL_PATH	Multiallelic sites are not split.
ClinVar Hail MT multiallelic split	.../clinvar/splitMT/hail.mt	WGS_CLINVAR_SPLIT_HAIL_PATH	Multiallelic sites are split into separate records.
ClinVar VCF	.../clinvar/vcf/	WGS_CLINVAR_VCF_PATH	Sharded by chromosome into multiple files.
ClinVar PLINK BED	.../clinvar/plink_bed/		PLINK binary biallelic genotype table (.bed), along with .fam and .bim files.
ClinVar BGEN	.../clinvar/bgen/		Binary GEN (BGEN) files containing sample, Hail index, and bgenix index.
ClinVar PGEN	.../clinvar/pgen/		PLINK 2 binary genotype table (PGEN) files containing pgen, pvar and psam files.
ClinVar UCSC BED	.../clinvar/bed/		UCSC BED file with the regions used to generate the ClinVar variants callset.
srWGS: Challenging Medically Relevant Gene (CMRG) callset	gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/cmrg/	WGS_CMRG_VCF_PATH	The CMRG callset is a VCF for 33 genes that are impacted by falsely duplicated and collapsed errors in the GRCh38 reference genome. Called with a masked hg38 reference.
srWGS: CRAM files	gs://fc-aou-datasets-controlled/v8/wgs/cram/manifest.csv	WGS_CRAM_MANIFEST_PATH	The manifest CSV contains one row per sample with: person_id,cram_uri,cram_index_uri

srWGS: auxiliary files	gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/aux		Auxiliary files for the SNP and Indel variants for srWGS data.
Variant Annotation Table (VAT)	…/vat/vat_complete.bgz.tsv.gz		Variant functional annotations and metadata
Genetic ancestry	…/ancestry/ancestry_preds.tsv …/ancestry/preds_oth.html …/ancestry/loadings.ht/ …/ancestry/eigenvalues.txt …/ancestry/rf_classifier.pkl …/ancestry/training_pca.tsv …/ancestry/merged_sites_only_intersection.vcf.bgz …/ancestry/merged_sites_only_intersection.vcf.bgz.tbi		The genetic ancestry groupings for each participant with srWGS data along with additional files, described in How the All of Us genomic data are organized. The sites-only VCFs are also known as HQ sites in the QC process.
Admixture estimates	…/admixture_estimates/aou_admixture_estimates_rye_v8.Q …/admixture_estimates/aou_admixture_estimates_rye_v8.fam …/admixture_estimates/reference_admixture_estimates_rye_v8.Q …/admixture_estimates/reference_admixture_estimates_rye_v8.fam		Computed admixture estimates in .Q and .fam file formats for all samples in the srWGS joint callset and reference samples.
Pharmacogenomics (PGx) calls	../pgx/high_concordance/ ../pgx/low_concordance/		Haplotype calls and predicted phenotypes for over 22 genes relevant to human drug metabolism.
Statistical phasing	…/phasing/		Phasing data for all srWGS samples, provided as multi-sample VCFs, sharded by chromosome.
Relatedness	…/relatedness/relatedness.tsv …/relatedness/relatedness_flagged_samples.tsv		List of sample pairs with kinship scores > 0.1, and samples to be removed to have a fully unrelated cohort.
QC information	…/qc/flagged_samples.tsv …/qc/all_samples.tsv …/qc/metrics.html …/qc/pc1vspc2.html …/qc/genomic_metrics.tsv …/qc/control_samples/		The list of srWGS samples flagged in the joint callset QC process (tsv), the QC metrics for all samples (tsv), the genomic metrics file, plots of the metric residuals and the first two principal components (*.html) used in the genetic ancestry and joint callset QC, and control sample GVCFs from the sensitivity and precision analysis.

srWGS Structural Variants (SVs)	gs://fc-aou-datasets-controlled/v8/wgs/short_read/structural_variants		SV calls are available for 97,061 srWGS samples.
srWGS SV VCF	.../vcf
srWGS SV sites-only VCF	.../vcf/sites-only/		The sites-only VCF has all variant sites but no genotype information.
srWGS SV maximal set of unrelated samples	.../aux/relatedness		The maximal set of unrelated samples in the srWGS SV cohort.
srWGS SV unrelated sites-only VCF	…/vcf/unrelated-sites-only		The sites-only VCF containing annotations from the maximal set of unrelated samples.
srWGS SV samples with probable aneuploidies	.../aux/aneuploidies		There are three separate files for samples with probable aneuploidies: mosaic autosomal aneuploidy, mosaic allosomal aneuploidy, and germline allosomal aneuploidy.
srWGS SV sample list	.../aux/sample_list/AoU_srWGS_SV.v8.research_ids.txt		All research_ids that have srWGS SV data.

Long read whole genome sequencing (lrWGS)	gs://fc-aou-datasets-controlled/v8/wgs/long_read/
lrWGS single sample file manifest	.../manifest.csv		The lrWGS manifest contains file locations for all single sample lrWGS raw data, variant data, and auxillary data. See How the All of Us genomic data are organized for a description of the manifest and each file.
Joint-called Hail MT (grch38_noalt)	…/BCM/ont/joint_call/GRCh38/v8.BCM_ONT_high.QualFT34.mt …/BCM/revio/joint_call/GRCh38/v8.BCM_Rev_high.QualFT40.mt …/BCM/sequel2e/joint_call/GRCh38/v8.BCM_Seq_high.QualFT40.mt …/BI/revio/joint_call/GRCh38/v8.BI_Rev_mid.QualFT40.mt …/BI/sequel2e/joint_call/GRCh38/v8.BI_Seq_high.QualFT40.mt …/BI/sequel2e/joint_call/GRCh38/v8.BI_Seq_mid.QualFT40.mt …/HA/revio/joint_call/GRCh38/v8.HA_Rev_mid.QualFT40.mt …/JHU/ont/joint_call/GRCh38/v8.JHU_ONT_high.QualFT34.mt …/UW/revio/joint_call/GRCh38/v8.UW_Rev_high.QualFT40.mt …/UW/sequel2e/joint_call/GRCh38/v8.UW_Seq_high.QualFT40.mt		lrWGS joint SNP & Indel callset for the grch38_noalt reference. Provided for each cohort.
Joint-called Hail MT (T2Tv2.0)	.../BCM/ont/joint_call/T2T/v8.BCM_ONT_high.QualFT34.mt …/BCM/revio/joint_call/T2T/v8.BCM_Rev_high.QualFT40.mt …/BCM/sequel2e/joint_call/T2T/v8.BCM_Seq_high.QualFT40.mt …/BI/revio/joint_call/T2T/v8.BI_Rev_mid.QualFT40.mt …/BI/sequel2e/joint_call/T2T/v8.BI_Seq_high.QualFT40.mt …/BI/sequel2e/joint_call/T2T/v8.BI_Seq_mid.QualFT40.mt …/HA/revio/joint_call/T2T/v8.HA_Rev_mid.QualFT40.mt …/JHU/ont/joint_call/T2T/v8.JHU_ONT_high.QualFT34.mt …/UW/revio/joint_call/T2T/v8.UW_Rev_high.QualFT40.mt …/UW/sequel2e/joint_call/T2T/v8.UW_Seq_high.QualFT40.mt		lrWGS joint SNP and Indel callset for the T2Tv2.0 reference. Provided for each cohort.
Joint-called GVCF (grch38_noalt)	…/BCM/ont/joint_call/GRCh38/ …/BCM/revio/joint_call/GRCh38/ …/BCM/sequel2e/joint_call/GRCh38/ …/BI/revio/joint_call/GRCh38/ …/BI/sequel2e/joint_call/GRCh38/ …/HA/revio/joint_call/GRCh38/ …/JHU/ont/joint_call/GRCh38/ …/UW/revio/joint_call/GRCh38/ …/UW/sequel2e/joint_call/GRCh38/		lrWGS joint SNP & Indel GVCF for the grch38_noalt reference with TBI index. Provided for each cohort.
Joint-called GVCF (T2Tv2.0)	…/BCM/ont/joint_call/T2T/ …/BCM/revio/joint_call/T2T/ …/BCM/sequel2e/joint_call/T2T/ …/BI/revio/joint_call/T2T/ …/BI/sequel2e/joint_call/T2T/ …/HA/revio/joint_call/T2T/ …/JHU/ont/joint_call/T2T/ …/UW/revio/joint_call/T2T/ …/UW/sequel2e/joint_call/T2T/		lrWGS joint SNP & Indel GVCF for the T2Tv2.0 reference with TBI index. Provided for each cohort.
Aux (grch38_noalt)	…/BCM/aux/v8.BCM.auxiliary_metrics.GRCh38.tsv …/BI/aux/v8.BI.auxiliary_metrics.GRCh38.tsv …/HA/aux/v8.HA.auxiliary_metrics.GRCh38.tsv …/JHU/aux/v8.JHU.auxiliary_metrics.GRCh38.tsv …/UW/aux/v8.UW.auxiliary_metrics.GRCh38.tsv		Auxiliary file holding metric values for each sample for the grch38_noalt reference. Provided in a file for each sequencing facility.
Aux (T2Tv2.0)	…/BCM/aux/v8.BCM.auxiliary_metrics.T2T.tsv …/BI/aux/v8.BI.auxiliary_metrics.T2T.tsv …/HA/aux/v8.HA.auxiliary_metrics.T2T.tsv …/JHU/aux/v8.JHU.auxiliary_metrics.T2T.tsv …/UW/aux/v8.UW.auxiliary_metrics.T2T.tsv		Auxiliary file holding metric values for each sample for the T2Tv2.0 reference. Provided in a file for each sequencing facility.
lrWGS flagged samples	.../samples_flagged_by_qc.tsv		The list of lrWGS samples flagged in the joint callset QC process with reasons for flagging, in tsv format.

Array: single sample VCF manifest	gs://fc-aou-datasets-controlled/v8/microarray/vcf/manifest.csv	MICROARRAY_VCF_MANIFEST_PATH	Path to manifest CSV file that contains a row per sample of: person_id,vcf_uri,vcf_index_uri
Array: Hail MT	gs://fc-aou-datasets-controlled/v8/microarray/hail.mt	MICROARRAY_HAIL_STORAGE_PATH	All of the samples have been merged into a single MT.
Array: PLINK BED files	gs://fc-aou-datasets-controlled/v8/microarray/plink/arrays.*		PLINK .bed, .fam, and .bim files.
Array: IDAT files	gs://fc-aou-datasets-controlled/v8/microarray/idat/manifest.csv	MICROARRAY_IDAT_MANIFEST_PATH	Path to manifest.csv file that contains a row per sample of: person_id,green_idat_uri,red_idat_uri

Known Issues: samples’ lists associated with known issues	gs://fc-aou-datasets-controlled/v8/known_issues/rids_with_low_coverage_all_failures_wgs_v8_known_issue_10.tsv .../wgs_v8_known_issue_1.txt …/wgs_v8_known_issue_6.txt		Each file contains a list of sample IDs associated with known issues. For more information, please see the All of Us Genomic Data Quality Report

Controlled CDR Directory

Summary of changes since CDRv7

BigQuery

Cloud Storage

Was this article helpful?

Comments

<%= previousTitle %>

<%= nextTitle %>

<%= block.name %>

<%= block.name %>

Have a question or would like to make a request?

Categories

Toggle navigation menu

<%= category.name %>

Search

Summary of changes since CDRv7

BigQuery

Cloud Storage

Was this article helpful?

<%= previousTitle %>

<%= nextTitle %>

<%= block.name %>

<%= block.name %>

Have a question or would like to make a request?

Categories

Toggle navigation menu

<%= category.name %>

Categories

Categories