Controlled CDR Directory

  • Updated

Described below is the registry of assets currently made available within the Researcher Workbench Controlled Tier dataset, organized according to where and how they are individually stored.

Summary of changes since CDRv7

  • 169,436 short-read WGS (srWGS) samples added with single nucleotide polymorphism, insertion, and deletion variant calls (SNPs and Indels) (2,684 Withdrawn, 172,120 New) - Total: 414830
  • 137,898 genotyping array (“array”) samples added (365 Withdrawn, 139,776 New) - Total: 447,278
  • 85,671 srWGS samples with structural variant (SV) calls added (123 Withdrawn, 85,794 New) - Total: 97,061
  • Long read WGS (lrWGS) samples with SNP and Indel variants and SVs added (1,773 New) - Total: 2,800

 

BigQuery

Included within the Controlled Tier is a “mainline” curated data repository (CDR) that contains data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) also represented within the Registered Tier, but with different privacy protections applied. Like the Registered Tier, the Controlled Tier includes a default CDR, which is queryable through the Researcher Workbench dataset tools or BigQuery directly, and a base CDR that is queryable only through BigQuery. Below is information about the mainline CDR location within BigQuery.

For more information about mainline CDR creation and the differences between the default and base instances, please see here. For information about the specific tables, fields, and privacy protections applied to the Registered and Controlled Tier CDRs, please see the CDR Data Dictionary.

 

Asset Location Env var Description
Default Mainline CDR  fc-aou-cdr-prod.C2024Q3R3 WORKSPACE_CDR Main BigQuery CDR representative of the same data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) included in the Registered Tier, but with fewer privacy protections applied.
Base Mainline CDR  fc-aou-cdr-prod.C2024Q3R3_base   Same CDR representation as noted above, but with fewer processing/convenience transformations applied.

Cloud Storage

Also included in the Controlled Tier is a genomics CDR that is stored separate from the above referenced BigQuery assets. Note, the genomics CDR is only available through the Controlled Tier (e.g., genomic data is not available within the Registered Tier).

Complete descriptions on the genomic data file types (including auxiliary files) are available in the article How the All of Us genomic data are organized. Before using the genomic data, we highly recommend reading the Known Issues section of the All of Us Research Program Genomic Research Data Quality Report. For help displaying html files in the Researcher Workbench, please see our featured notebook.

The environment variables (Env var) are predefined variables in the Researcher Workbench and can be used to reference the full path.

 

If you use gsutil to access the CDR bucket, you will need to pass an additional flag in the command as described more here:
!gsutil -u $GOOGLE_PROJECT ls gs://fc-aou-datasets-controlled

 

Asset Location Env var Description
All CDR assets path gs://fc-aou-datasets-controlled/v8 CDR_STORAGE_PATH All Cloud Storage assets for this CDR version are under this path
       
short-read whole genome sequencing (srWGS) data gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/    
srWGS: Variant Dataset (VDS) .../vds/hail.vds WGS_VDS_PATH The srWGS joint callset in VDS format.
srWGS: Exome callsets .../exome/   The srWGS SNP and Indel variants within the exon regions of the Gencode v42 basic transcripts.
  • Exome Hail MatrixTable (MT)
.../exome/multiMT/hail.mt WGS_EXOME_MULTI_HAIL_PATH Multiallelic sites are not split.
  • Exome Hail MT multiallelic split
.../exome/splitMT/hail.mt WGS_EXOME_SPLIT_HAIL_PATH Multiallelic sites are split into separate records.
  • Exome VCF
.../exome/vcf/ WGS_EXOME_VCF_PATH Sharded by chromosome into multiple files.
  • Exome PLINK BED
.../exome/plink_bed/  

PLINK binary biallelic genotype table (.bed), along with .fam and .bim files.

  • Exome BGEN
.../exome/bgen/  

Binary GEN (BGEN) files  containing sample, Hail index, and bgenix index.

  • Exome PGEN
.../exome/pgen/  

PLINK 2 binary genotype table (PGEN) files containing pgen, pvar and psam files.

  • Exome UCSC BED
.../exome/bed/   UCSC BED file with the regions used to generate the Exome callset.
srWGS: ACAF threshold callsets .../acaf_threshold/   The srWGS SNP and Indel variants that are frequent in the All of Us genetic ancestry groups.
  • ACAF threshold Hail MT
.../acaf_threshold/multiMT/hail.mt WGS_ACAF_THRESHOLD_MULTI_HAIL_PATH Multiallelic sites are not split.
  • ACAF threshold Hail MT multiallelic split
.../acaf_threshold/splitMT/hail.mt WGS_ACAF_THRESHOLD_SPLIT_HAIL_PATH Multiallelic sites are split into separate records.
  • ACAF threshold VCF
.../acaf_threshold/vcf/ WGS_ACAF_THRESHOLD_VCF_PATH Sharded by chromosome into multiple files.
  • ACAF threshold PLINK BED
.../acaf_threshold/plink_bed/  

PLINK binary biallelic genotype table (.bed), along with .fam and .bim files.

  • ACAF threshold BGEN
.../acaf_threshold/bgen/  

Binary GEN (BGEN) files  containing sample, Hail index, and bgenix index.

  • ACAF threshold PGEN
.../acaf_threshold/pgen/   PLINK 2 binary genotype table (PGEN) files containing pgen, pvar and psam files.
  • ACAF threshold UCSC BED
.../acaf_threshold/bed/   UCSC BED file with the regions used to generate the ACAF threshold callset.
srWGS: ClinVar variant callsets callsets .../clinvar/   The srWGS SNP and Indel variants that are in Clinvar, not limited to  pathogenic or likely pathogenic variants.
  • ClinVar Hail MT
.../clinvar/multiMT/hail.mt WGS_CLINVAR_MULTI_HAIL_PATH Multiallelic sites are not split.
  • ClinVar Hail MT multiallelic split
.../clinvar/splitMT/hail.mt WGS_CLINVAR_SPLIT_HAIL_PATH Multiallelic sites are split into separate records.
  • ClinVar VCF
.../clinvar/vcf/ WGS_CLINVAR_VCF_PATH Sharded by chromosome into multiple files.
  • ClinVar PLINK BED
.../clinvar/plink_bed/  

PLINK binary biallelic genotype table (.bed), along with .fam and .bim files.

  • ClinVar BGEN
.../clinvar/bgen/  

Binary GEN (BGEN) files  containing sample, Hail index, and bgenix index.

  • ClinVar PGEN
.../clinvar/pgen/   PLINK 2 binary genotype table (PGEN) files containing pgen, pvar and psam files.
  • ClinVar UCSC BED
.../clinvar/bed/   UCSC BED file with the regions used to generate the ClinVar variants callset.
       
srWGS: CRAM files gs://fc-aou-datasets-controlled/v8/wgs/cram/manifest.csv WGS_CRAM_MANIFEST_PATH The manifest CSV contains one row per sample with: person_id,cram_uri,cram_index_uri
       
srWGS: auxiliary files gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/aux   Auxiliary files for the SNP and Indel variants for srWGS data.
  • Variant Annotation Table (VAT)
…/vat/vat_complete.bgz.tsv.gz   Variant functional annotations and metadata
  • Genetic ancestry

…/ancestry/ancestry_preds.tsv

…/ancestry/eigenvalues.txt

…/ancestry/loadings.ht.tar.gz

…/ancestry/rf_classifier.pkl

…/ancestry/training_pca.tsv

…/ancestry/preds_oth.html

…/ancestry/merged_sites_only_intersection.vcf.bgz

…/ancestry/merged_sites_only_intersection.vcf.bgz.tbi

  The genetic ancestry groupings for each participant with srWGS data along with additional files, described in How the All of Us genomic data are organized.
  • Admixture estimates

…/admixture_estimates/aou_admixture_estimates_rye_v8.Q

…/admixture_estimates/aou_admixture_estimates_rye_v8.fam

…/admixture_estimates/reference_admixture_estimates_rye_v8.Q

…/admixture_estimates/reference_admixture_estimates_rye_v8.fam

 

Computed admixture estimates in .Q and .fam file formats for all samples in the srWGS joint callset and reference samples. 

 
  • Pharmacogenomics (PGx) calls

../pgx/high_concordance/

../pgx/low_concordance/

  Haplotype calls and predicted phenotypes for over 22 genes relevant to human drug metabolism.
  • Relatedness

…/relatedness/relatedness.tsv

…/relatedness/relatedness_flagged_samples.tsv

  List of sample pairs with kinship scores > 0.1, and samples to be removed to have a fully unrelated cohort.
  • QC information

…/qc/flagged_samples.tsv

…/qc/metrics.html

…/qc/pc1vspc2.html

…/qc/genomic_metrics.tsv

  The list of srWGS samples flagged in the joint callset QC process (tsv),  the genomic metrics file,  and plots of the metric residuals and the first two principal components (*.html) used in the genetic ancestry analysis   and joint callset QC.

 

     

srWGS Structural Variants (SVs)

gs://fc-aou-datasets-controlled/v8/wgs/short_read/structural_variants   SV calls are available for 97,061 srWGS samples.
  • srWGS SV VCF
.../vcf    
  • srWGS SV sites-only VCF
.../vcf/sites-only/   The sites-only VCF has all variant sites but no genotype information.
  • srWGS SV maximal set of unrelated samples
.../aux/relatedness   The maximal set of unrelated samples in the srWGS SV cohort.
  • srWGS SV unrelated sites-only VCF
…/vcf/unrelated-sites-only   The sites-only VCF containing annotations from the maximal set of unrelated samples.
  • srWGS SV samples with probable aneuploidies
.../aux/aneuploidies   There are three separate files for samples with probable aneuploidies: mosaic autosomal aneuploidy, mosaic allosomal aneuploidy, and germline allosomal aneuploidy.
  • srWGS SV sample list
.../aux/sample_list/AoU_srWGS_SV.v8.research_ids.txt   All research_ids that have srWGS SV data.
       
Long read whole genome sequencing (lrWGS)  gs://fc-aou-datasets-controlled/v8/wgs/long_read/    
  • lrWGS single sample file manifest
   

The lrWGS manifest contains file locations for all single sample lrWGS raw data, variant data, and auxillary data. See How the All of Us genomic data are organized for a description of the manifest and each file.

  • Joint-called Hail MT (grch38_noalt)

…/BCM/ont/joint_call/GRCh38/v8.BCM_ONT_high.QualFT34.mt

…/BCM/revio/joint_call/GRCh38/v8.BCM_Rev_high.QualFT40.mt

…/BCM/sequel2e/joint_call/GRCh38/v8.BCM_Seq_high.QualFT40.mt

…/BI/revio/joint_call/GRCh38/v8.BI_Rev_mid.QualFT40.mt

…/BI/sequel2e/joint_call/GRCh38/v8.BI_Seq_high.QualFT40.mt

…/BI/sequel2e/joint_call/GRCh38/v8.BI_Seq_mid.QualFT40.mt

…/HA/revio/joint_call/GRCh38/v8.HA_Rev_mid.QualFT40.mt

…/JHU/ont/joint_call/GRCh38/v8.JHU_ONT_high.QualFT34.mt

…/UW/revio/joint_call/GRCh38/v8.UW_Rev_high.QualFT40.mt

…/UW/sequel2e/joint_call/GRCh38/v8.UW_Seq_high.QualFT40.mt

  lrWGS joint SNP & Indel callset for the grch38_noalt reference. Provided for each cohort.
  • Joint-called Hail MT (T2Tv2.0)

.../BCM/ont/joint_call/T2T/v8.BCM_ONT_high.QualFT34.mt

…/BCM/revio/joint_call/T2T/v8.BCM_Rev_high.QualFT40.mt

…/BCM/sequel2e/joint_call/T2T/v8.BCM_Seq_high.QualFT40.mt

…/BI/revio/joint_call/T2T/v8.BI_Rev_mid.QualFT40.mt

…/BI/sequel2e/joint_call/T2T/v8.BI_Seq_high.QualFT40.mt

…/BI/sequel2e/joint_call/T2T/v8.BI_Seq_mid.QualFT40.mt

…/HA/revio/joint_call/T2T/v8.HA_Rev_mid.QualFT40.mt

…/JHU/ont/joint_call/T2T/v8.JHU_ONT_high.QualFT34.mt

…/UW/revio/joint_call/T2T/v8.UW_Rev_high.QualFT40.mt

…/UW/sequel2e/joint_call/T2T/v8.UW_Seq_high.QualFT40.mt

  lrWGS joint SNP and Indel callset for the T2Tv2.0 reference. Provided for each cohort.
  • Joint-called GVCF (grch38_noalt)

…/BCM/ont/joint_call/GRCh38/

…/BCM/revio/joint_call/GRCh38/

…/BCM/sequel2e/joint_call/GRCh38/

…/BI/revio/joint_call/GRCh38/

…/BI/sequel2e/joint_call/GRCh38/

…/HA/revio/joint_call/GRCh38/

…/JHU/ont/joint_call/GRCh38/

…/UW/revio/joint_call/GRCh38/

…/UW/sequel2e/joint_call/GRCh38/

  lrWGS joint SNP & Indel GVCF for the grch38_noalt reference with TBI index. Provided for each cohort.
  • Joint-called GVCF (T2Tv2.0)

…/BCM/ont/joint_call/T2T/

…/BCM/revio/joint_call/T2T/

…/BCM/sequel2e/joint_call/T2T/

…/BI/revio/joint_call/T2T/

…/BI/sequel2e/joint_call/T2T/

…/HA/revio/joint_call/T2T/

…/JHU/ont/joint_call/T2T/

…/UW/revio/joint_call/T2T/

…/UW/sequel2e/joint_call/T2T/

  lrWGS joint SNP & Indel GVCF for the T2Tv2.0 reference with TBI index. Provided for each cohort.
  • Aux (grch38_noalt)

…/BCM/aux/v8.BCM.auxiliary_metrics.GRCh38.tsv

…/BI/aux/v8.BI.auxiliary_metrics.GRCh38.tsv

…/HA/aux/v8.HA.auxiliary_metrics.GRCh38.tsv

…/JHU/aux/v8.JHU.auxiliary_metrics.GRCh38.tsv

…/UW/aux/v8.UW.auxiliary_metrics.GRCh38.tsv

  Auxiliary file holding metric values for each sample for the grch38_noalt reference. Provided in a file for each sequencing facility.
  • Aux (T2Tv2.0)

…/BCM/aux/v8.BCM.auxiliary_metrics.T2T.tsv

…/BI/aux/v8.BI.auxiliary_metrics.T2T.tsv

…/HA/aux/v8.HA.auxiliary_metrics.T2T.tsv

…/JHU/aux/v8.JHU.auxiliary_metrics.T2T.tsv

…/UW/aux/v8.UW.auxiliary_metrics.T2T.tsv

  Auxiliary file holding metric values for each sample for the T2Tv2.0 reference. Provided in a file for each sequencing facility.
lrWGS flagged samples .../samples_flagged_by_qc.tsv   The list of lrWGS samples flagged in the joint callset QC process with reasons for flagging, in tsv format.
       
Array: single sample VCF manifest gs://fc-aou-datasets-controlled/v8/microarray/vcf/manifest.csv MICROARRAY_VCF_MANIFEST_PATH

Path to manifest CSV file that contains a row per sample of: person_id,vcf_uri,vcf_index_uri

Array: Hail MT gs://fc-aou-datasets-controlled/v8/microarray/hail.mt MICROARRAY_HAIL_STORAGE_PATH

 All of the samples have been merged into a single MT.

Array: PLINK BED files

gs://fc-aou-datasets-controlled/v8/microarray/plink/arrays.*

 

PLINK .bed, .fam, and .bim files.

Array: IDAT files gs://fc-aou-datasets-controlled/v8/microarray/idat/manifest.csv MICROARRAY_IDAT_MANIFEST_PATH

Path to manifest.csv file that contains a row per sample of: person_id,green_idat_uri,red_idat_uri

       
Known Issues: samples’ lists associated with known issues

gs://fc-aou-datasets-controlled-ingest/v8/known_issues/wgs_v8_known_issue_1.txt

…/wgs_v8_known_issue_6.txt

 

Each file contains a list of sample IDs associated with known issues.

For more information, please see the All of Us Genomic Data Quality Report

Was this article helpful?

1 out of 1 found this helpful

Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.