Described below is the registry of assets currently made available within the Researcher Workbench Controlled Tier dataset, organized according to where and how they are individually stored.
Summary of changes since CDRv7
- 169,436 short-read WGS (srWGS) samples added with single nucleotide polymorphism, insertion, and deletion variant calls (SNPs and Indels) (2,684 Withdrawn, 172,120 New) - Total: 414830
- 137,898 genotyping array (“array”) samples added (365 Withdrawn, 139,776 New) - Total: 447,278
- 85,671 srWGS samples with structural variant (SV) calls added (123 Withdrawn, 85,794 New) - Total: 97,061
- Long read WGS (lrWGS) samples with SNP and Indel variants and SVs added (1,773 New) - Total: 2,800
BigQuery
Included within the Controlled Tier is a “mainline” curated data repository (CDR) that contains data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) also represented within the Registered Tier, but with different privacy protections applied. Like the Registered Tier, the Controlled Tier includes a default CDR, which is queryable through the Researcher Workbench dataset tools or BigQuery directly, and a base CDR that is queryable only through BigQuery. Below is information about the mainline CDR location within BigQuery.
For more information about mainline CDR creation and the differences between the default and base instances, please see here. For information about the specific tables, fields, and privacy protections applied to the Registered and Controlled Tier CDRs, please see the CDR Data Dictionary.
Asset | Location | Env var | Description |
Default Mainline CDR | fc-aou-cdr-prod.C2024Q3R3 | WORKSPACE_CDR | Main BigQuery CDR representative of the same data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) included in the Registered Tier, but with fewer privacy protections applied. |
Base Mainline CDR | fc-aou-cdr-prod.C2024Q3R3_base | Same CDR representation as noted above, but with fewer processing/convenience transformations applied. |
Cloud Storage
Also included in the Controlled Tier is a genomics CDR that is stored separate from the above referenced BigQuery assets. Note, the genomics CDR is only available through the Controlled Tier (e.g., genomic data is not available within the Registered Tier).
Complete descriptions on the genomic data file types (including auxiliary files) are available in the article How the All of Us genomic data are organized. Before using the genomic data, we highly recommend reading the Known Issues section of the All of Us Research Program Genomic Research Data Quality Report. For help displaying html files in the Researcher Workbench, please see our featured notebook.
The environment variables (Env var) are predefined variables in the Researcher Workbench and can be used to reference the full path.
If you use gsutil to access the CDR bucket, you will need to pass an additional flag in the command as described more here:
!gsutil -u $GOOGLE_PROJECT ls gs://fc-aou-datasets-controlled
|
Asset | Location | Env var | Description |
All CDR assets path | gs://fc-aou-datasets-controlled/v8 | CDR_STORAGE_PATH | All Cloud Storage assets for this CDR version are under this path |
short-read whole genome sequencing (srWGS) data | gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/ | ||
srWGS: Variant Dataset (VDS) | .../vds/hail.vds | WGS_VDS_PATH | The srWGS joint callset in VDS format. |
srWGS: Exome callsets | .../exome/ | The srWGS SNP and Indel variants within the exon regions of the Gencode v42 basic transcripts. | |
|
.../exome/multiMT/hail.mt | WGS_EXOME_MULTI_HAIL_PATH | Multiallelic sites are not split. |
|
.../exome/splitMT/hail.mt | WGS_EXOME_SPLIT_HAIL_PATH | Multiallelic sites are split into separate records. |
|
.../exome/vcf/ | WGS_EXOME_VCF_PATH | Sharded by chromosome into multiple files. |
|
.../exome/plink_bed/ |
PLINK binary biallelic genotype table (.bed), along with .fam and .bim files. |
|
|
.../exome/bgen/ |
Binary GEN (BGEN) files containing sample, Hail index, and bgenix index. |
|
|
.../exome/pgen/ |
PLINK 2 binary genotype table (PGEN) files containing pgen, pvar and psam files. |
|
|
.../exome/bed/ | UCSC BED file with the regions used to generate the Exome callset. | |
srWGS: ACAF threshold callsets | .../acaf_threshold/ | The srWGS SNP and Indel variants that are frequent in the All of Us genetic ancestry groups. | |
|
.../acaf_threshold/multiMT/hail.mt | WGS_ACAF_THRESHOLD_MULTI_HAIL_PATH | Multiallelic sites are not split. |
|
.../acaf_threshold/splitMT/hail.mt | WGS_ACAF_THRESHOLD_SPLIT_HAIL_PATH | Multiallelic sites are split into separate records. |
|
.../acaf_threshold/vcf/ | WGS_ACAF_THRESHOLD_VCF_PATH | Sharded by chromosome into multiple files. |
|
.../acaf_threshold/plink_bed/ |
PLINK binary biallelic genotype table (.bed), along with .fam and .bim files. |
|
|
.../acaf_threshold/bgen/ |
Binary GEN (BGEN) files containing sample, Hail index, and bgenix index. |
|
|
.../acaf_threshold/pgen/ | PLINK 2 binary genotype table (PGEN) files containing pgen, pvar and psam files. | |
|
.../acaf_threshold/bed/ | UCSC BED file with the regions used to generate the ACAF threshold callset. | |
srWGS: ClinVar variant callsets callsets | .../clinvar/ | The srWGS SNP and Indel variants that are in Clinvar, not limited to pathogenic or likely pathogenic variants. | |
|
.../clinvar/multiMT/hail.mt | WGS_CLINVAR_MULTI_HAIL_PATH | Multiallelic sites are not split. |
|
.../clinvar/splitMT/hail.mt | WGS_CLINVAR_SPLIT_HAIL_PATH | Multiallelic sites are split into separate records. |
|
.../clinvar/vcf/ | WGS_CLINVAR_VCF_PATH | Sharded by chromosome into multiple files. |
|
.../clinvar/plink_bed/ |
PLINK binary biallelic genotype table (.bed), along with .fam and .bim files. |
|
|
.../clinvar/bgen/ |
Binary GEN (BGEN) files containing sample, Hail index, and bgenix index. |
|
|
.../clinvar/pgen/ | PLINK 2 binary genotype table (PGEN) files containing pgen, pvar and psam files. | |
|
.../clinvar/bed/ | UCSC BED file with the regions used to generate the ClinVar variants callset. | |
srWGS: CRAM files | gs://fc-aou-datasets-controlled/v8/wgs/cram/manifest.csv | WGS_CRAM_MANIFEST_PATH | The manifest CSV contains one row per sample with: person_id,cram_uri,cram_index_uri |
srWGS: auxiliary files | gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/aux | Auxiliary files for the SNP and Indel variants for srWGS data. | |
|
…/vat/vat_complete.bgz.tsv.gz | Variant functional annotations and metadata | |
|
…/ancestry/ancestry_preds.tsv …/ancestry/eigenvalues.txt …/ancestry/loadings.ht.tar.gz …/ancestry/rf_classifier.pkl …/ancestry/training_pca.tsv …/ancestry/preds_oth.html …/ancestry/merged_sites_only_intersection.vcf.bgz …/ancestry/merged_sites_only_intersection.vcf.bgz.tbi |
The genetic ancestry groupings for each participant with srWGS data along with additional files, described in How the All of Us genomic data are organized. | |
|
…/admixture_estimates/aou_admixture_estimates_rye_v8.Q …/admixture_estimates/aou_admixture_estimates_rye_v8.fam …/admixture_estimates/reference_admixture_estimates_rye_v8.Q …/admixture_estimates/reference_admixture_estimates_rye_v8.fam |
Computed admixture estimates in .Q and .fam file formats for all samples in the srWGS joint callset and reference samples. |
|
|
../pgx/high_concordance/ ../pgx/low_concordance/ |
Haplotype calls and predicted phenotypes for over 22 genes relevant to human drug metabolism. | |
|
…/relatedness/relatedness.tsv …/relatedness/relatedness_flagged_samples.tsv |
List of sample pairs with kinship scores > 0.1, and samples to be removed to have a fully unrelated cohort. | |
|
…/qc/flagged_samples.tsv …/qc/metrics.html …/qc/pc1vspc2.html …/qc/genomic_metrics.tsv |
The list of srWGS samples flagged in the joint callset QC process (tsv), the genomic metrics file, and plots of the metric residuals and the first two principal components (*.html) used in the genetic ancestry analysis and joint callset QC. | |
|
|||
srWGS Structural Variants (SVs) |
gs://fc-aou-datasets-controlled/v8/wgs/short_read/structural_variants | SV calls are available for 97,061 srWGS samples. | |
|
.../vcf | ||
|
.../vcf/sites-only/ | The sites-only VCF has all variant sites but no genotype information. | |
|
.../aux/relatedness | The maximal set of unrelated samples in the srWGS SV cohort. | |
|
…/vcf/unrelated-sites-only | The sites-only VCF containing annotations from the maximal set of unrelated samples. | |
|
.../aux/aneuploidies | There are three separate files for samples with probable aneuploidies: mosaic autosomal aneuploidy, mosaic allosomal aneuploidy, and germline allosomal aneuploidy. | |
|
.../aux/sample_list/AoU_srWGS_SV.v8.research_ids.txt | All research_ids that have srWGS SV data. | |
Long read whole genome sequencing (lrWGS) | gs://fc-aou-datasets-controlled/v8/wgs/long_read/ | ||
|
The lrWGS manifest contains file locations for all single sample lrWGS raw data, variant data, and auxillary data. See How the All of Us genomic data are organized for a description of the manifest and each file. |
||
|
…/BCM/ont/joint_call/GRCh38/v8.BCM_ONT_high.QualFT34.mt …/BCM/revio/joint_call/GRCh38/v8.BCM_Rev_high.QualFT40.mt …/BCM/sequel2e/joint_call/GRCh38/v8.BCM_Seq_high.QualFT40.mt …/BI/revio/joint_call/GRCh38/v8.BI_Rev_mid.QualFT40.mt …/BI/sequel2e/joint_call/GRCh38/v8.BI_Seq_high.QualFT40.mt …/BI/sequel2e/joint_call/GRCh38/v8.BI_Seq_mid.QualFT40.mt …/HA/revio/joint_call/GRCh38/v8.HA_Rev_mid.QualFT40.mt …/JHU/ont/joint_call/GRCh38/v8.JHU_ONT_high.QualFT34.mt …/UW/revio/joint_call/GRCh38/v8.UW_Rev_high.QualFT40.mt …/UW/sequel2e/joint_call/GRCh38/v8.UW_Seq_high.QualFT40.mt |
lrWGS joint SNP & Indel callset for the grch38_noalt reference. Provided for each cohort. | |
|
.../BCM/ont/joint_call/T2T/v8.BCM_ONT_high.QualFT34.mt …/BCM/revio/joint_call/T2T/v8.BCM_Rev_high.QualFT40.mt …/BCM/sequel2e/joint_call/T2T/v8.BCM_Seq_high.QualFT40.mt …/BI/revio/joint_call/T2T/v8.BI_Rev_mid.QualFT40.mt …/BI/sequel2e/joint_call/T2T/v8.BI_Seq_high.QualFT40.mt …/BI/sequel2e/joint_call/T2T/v8.BI_Seq_mid.QualFT40.mt …/HA/revio/joint_call/T2T/v8.HA_Rev_mid.QualFT40.mt …/JHU/ont/joint_call/T2T/v8.JHU_ONT_high.QualFT34.mt …/UW/revio/joint_call/T2T/v8.UW_Rev_high.QualFT40.mt …/UW/sequel2e/joint_call/T2T/v8.UW_Seq_high.QualFT40.mt |
lrWGS joint SNP and Indel callset for the T2Tv2.0 reference. Provided for each cohort. | |
|
…/BCM/ont/joint_call/GRCh38/ …/BCM/revio/joint_call/GRCh38/ …/BCM/sequel2e/joint_call/GRCh38/ …/BI/revio/joint_call/GRCh38/ …/BI/sequel2e/joint_call/GRCh38/ …/HA/revio/joint_call/GRCh38/ …/JHU/ont/joint_call/GRCh38/ …/UW/revio/joint_call/GRCh38/ …/UW/sequel2e/joint_call/GRCh38/ |
lrWGS joint SNP & Indel GVCF for the grch38_noalt reference with TBI index. Provided for each cohort. | |
|
…/BCM/ont/joint_call/T2T/ …/BCM/revio/joint_call/T2T/ …/BCM/sequel2e/joint_call/T2T/ …/BI/revio/joint_call/T2T/ …/BI/sequel2e/joint_call/T2T/ …/HA/revio/joint_call/T2T/ …/JHU/ont/joint_call/T2T/ …/UW/revio/joint_call/T2T/ …/UW/sequel2e/joint_call/T2T/ |
lrWGS joint SNP & Indel GVCF for the T2Tv2.0 reference with TBI index. Provided for each cohort. | |
|
…/BCM/aux/v8.BCM.auxiliary_metrics.GRCh38.tsv …/BI/aux/v8.BI.auxiliary_metrics.GRCh38.tsv …/HA/aux/v8.HA.auxiliary_metrics.GRCh38.tsv …/JHU/aux/v8.JHU.auxiliary_metrics.GRCh38.tsv …/UW/aux/v8.UW.auxiliary_metrics.GRCh38.tsv |
Auxiliary file holding metric values for each sample for the grch38_noalt reference. Provided in a file for each sequencing facility. | |
|
…/BCM/aux/v8.BCM.auxiliary_metrics.T2T.tsv …/BI/aux/v8.BI.auxiliary_metrics.T2T.tsv …/HA/aux/v8.HA.auxiliary_metrics.T2T.tsv …/JHU/aux/v8.JHU.auxiliary_metrics.T2T.tsv …/UW/aux/v8.UW.auxiliary_metrics.T2T.tsv |
Auxiliary file holding metric values for each sample for the T2Tv2.0 reference. Provided in a file for each sequencing facility. | |
lrWGS flagged samples | .../samples_flagged_by_qc.tsv | The list of lrWGS samples flagged in the joint callset QC process with reasons for flagging, in tsv format. | |
Array: single sample VCF manifest | gs://fc-aou-datasets-controlled/v8/microarray/vcf/manifest.csv | MICROARRAY_VCF_MANIFEST_PATH |
Path to manifest CSV file that contains a row per sample of: person_id,vcf_uri,vcf_index_uri |
Array: Hail MT | gs://fc-aou-datasets-controlled/v8/microarray/hail.mt | MICROARRAY_HAIL_STORAGE_PATH |
All of the samples have been merged into a single MT. |
Array: PLINK BED files |
gs://fc-aou-datasets-controlled/v8/microarray/plink/arrays.* |
|
PLINK .bed, .fam, and .bim files. |
Array: IDAT files | gs://fc-aou-datasets-controlled/v8/microarray/idat/manifest.csv | MICROARRAY_IDAT_MANIFEST_PATH |
Path to manifest.csv file that contains a row per sample of: person_id,green_idat_uri,red_idat_uri |
Known Issues: samples’ lists associated with known issues |
gs://fc-aou-datasets-controlled-ingest/v8/known_issues/wgs_v8_known_issue_1.txt …/wgs_v8_known_issue_6.txt |
Each file contains a list of sample IDs associated with known issues. For more information, please see the All of Us Genomic Data Quality Report |
Comments
0 comments
Article is closed for comments.