Controlled CDR Directory (ARCHIVED C2022Q2R2 CDR CT Dataset v6)

Described below is the registry of assets currently made available within the Researcher Workbench Controlled Tier dataset, organized according to where and how they are individually stored.

Summary of changes since Beta

32 WGS samples removed
81 microarray samples removed
CRAM files added
IDAT files added

See All of Us Genomic Quality Report for more information.

Added: manifest files

Some genomic data file types will now be shared across Curated Data Repository (CDR) releases. In such cases, you will now find a manifest CSV file in place of a simple directory of files. Within this release, this pattern applies to CRAM files, microarray VCF files, and IDAT files. To ensure accordance with the Workbench data use policy, use these manifests to determine the correct set of storage asset paths associated with your workspace / CDR version. Data files which are not shared across releases, e.g. merged WGS VCFs and Hail Matrix Tables will continue to reside in a simple versioned directory. Read further for manifest CSV documentation on each file type.

Note that microarray VCF files now use the manifest pattern, changing from the simple directory structure used in the prior release.

BigQuery

Included within the Controlled Tier is a “mainline” curated data repository (CDR) that contains data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) also represented within the Registered Tier, but with different privacy protections applied. Like the Registered Tier, the Controlled Tier includes a default CDR, which is queryable through the Researcher Workbench dataset tools or BigQuery directly, and a base CDR that is queryable only through BigQuery. Below is information about the mainline CDR location within BigQuery.

For more information about mainline CDR creation and the differences between the default and base instances, please see here. For information about the specific tables, fields, and privacy protections applied to the Registered and Controlled Tier CDRs, please see the CDR Data Dictionary.

Asset	Location	Env var	Description
Default Mainline CDR	fc-aou-cdr-prod.C2022Q2R6	WORKSPACE_CDR	Main BigQuery CDR representative of the same data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) included in the Registered Tier, but with few privacy protections applied.
Base Mainline CDR	fc-aou-cdr-prod.C2022Q2R6_base		Same CDR representation as noted above, but with fewer processing/convenience transformations applied.

Cloud Storage

Also included in the Controlled Tier is a genomics CDR that is stored separate from the above referenced BigQuery assets. Note, the genomics CDR is only available through the Controlled Tier (e.g., genomics data are not available within the Registered Tier).

For more information on the specifics of the genomic data assets (including the auxiliary files) and the associated file formats referenced below, please see the Genomic Data section of How the All of Us genomic data are organized. Before using the genomic data, we highly recommend reading the Known Issues section of the All Of Us Research Program Genomic Research Data Quality Report. For help displaying html files in the Researcher Workbench, please see our featured notebook.

Asset	Location	Env var	Description
All CDR assets path	gs://fc-aou-datasets-controlled/v6	CDR_STORAGE_PATH	All Cloud Storage assets for this CDR version are under this path
WGS: joint callset VCFs	gs://fc-aou-datasets-controlled/v6/wgs/vcf/merged/*	WGS_VCF_MERGED_STORAGE_PATH	Joint callset VCF of all samples in the CDR with genomic data available (98590). This callset is sharded, by genomic interval, into multiple files. Use with any number of command line tools or libraries. Please see WGS VCFs for more information
WGS auxiliary files	gs://fc-aou-datasets-controlled/v6/wgs/vcf/aux/		Auxiliary files related to the “all samples” WGS data.
VCF shard intervals	…/vcf_shard_intervals/interval_lists/*.vcf.gz.interval_list …/vcf_shard_intervals/shard_begin_and_end_bucket.txt		Interval list files corresponding to the sharded VCFs (.interval_list). Please see Interval list files for more details on the format. We also provide the extents of each interval_list file (txt). Please see the WGS VCF shard interval extent file format (Appendix B) for details on the format of the file.
Variant Annotation Table	…/vat/vat_merged.tsv.bgz …/vat/chr/.tsv.gz		Variant-level metadata / functional annotations about the variants contained in the all samples VCF. Please see Variant Annotation Table for more details, including the provided fields.
Ancestry information	…/ancestry/ancestry_preds.tsv …/ancestry/preds_oth.html …/ancestry/merged_sites_only_intersection.vcf.bgz		Computed ancestry predictions for all samples in the WGS joint callset (tsv). We also provide a plot of the ancestry predictions and the sites-only VCF (vcf.bgz) of the locations we used for training the ancestry classifier. Please see Genetic predicted ancestry for more information on these files. For more information on how we predict ancestry, please see the All Of Us Beta Release Genomic Quality Report (Appendix A)
QC information	…/qc/flagged_samples.tsv …/qc/metrics.html …/qc/pc1vspc2.html …/qc/wgs_without_array_rids.tsv …/qc/wgs_not_in_cdr.tsv …/qc/array_not_in_cdr.tsv …/qc/genomic_metrics.tsv …/qc/wgs_siteID.tsv		The list of WGS samples flagged in the joint callset QC process (tsv). We also include plots of the metric residuals and the first two principal components (*.html). Please see Flagged WGS Samples for details on the tsv format. Please see the All Of Us Beta Release Genomic Quality Report (Joint Callset QC) for details on how we flag samples. 2,994 WGS samples do not have corresponding arrays in this release. The list of WGS IDs can be found in wgs_without_array_rids.tsv. For more information, please see All Of Us Beta Release Genomic Quality Report (Known Issue #1)
Relatedness	…/relatedness/relatedness.tsv …/relatedness/relatedness_flagged_samples.tsv		The relatedness of the WGS samples as kinship scores. We provide one file which lists all pairs of samples w/ a kinship score greater than 0.1 (relatedness.tsv). We also provide a list of samples that would need to be removed in order to remove related samples from the full cohort. (relatedness_flagged_samples.tsv) Please see Relatedness for more information.
WGS: joint callset Hail Matrix Table	gs://fc-aou-datasets-controlled/v6/wgs/hail.mt	WGS_HAIL_STORAGE_PATH	Hail MatrixTable for the WGS joint callset. Please note that all shards have been merged into this file. When using this file in Hail, read directly from the bucket location. Do not attempt to copy it locally.
WGS: CRAM files	gs://fc-aou-datasets-controlled/v6/wgs/cram/manifest.csv	WGS_CRAM_MANIFEST_PATH	Path to manifest CSV file that contains a row per sample of: person_id,cram_uri,cram_index_uri We provide CRAM files and CRAM index files with the research ID in the name of the file. One CRAM file for each WGS sample. See CRAM files for more information.
Microarray: single sample VCFs	gs://fc-aou-datasets-controlled/v6/microarray/vcf/single_sample/manifest.csv	MICROARRAY_VCF_MANIFEST_PATH	Path to manifest CSV file that contains a row per sample of: person_id,vcf_uri,vcf_index_uri One VCF per participant sample. Please see Array VCFs for more information.
Microarray: all samples Hail Matrix Table	gs://fc-aou-datasets-controlled/v6/microarray/hail.mt	MICROARRAY_HAIL_STORAGE_PATH	Hail matrix table of the micorarray samples in this release. All of the samples have been merged into a single matrix table. Please see Array MatrixTable for more information.
Microarray: all samples Plink files	gs://fc-aou-datasets-controlled/v6/microarray/plink/arrays.*		PLINK binary merged representation of the microarray samples in this release (.bed). Includes .fam, .bim files for usage with the plink tool as well. Please see Array PLINK 1.9 data for more information
Microarray: IDAT files	gs://fc-aou-datasets-controlled/v6/microarray/idat/manifest.csv	MICROARRAY_IDAT_MANIFEST_PATH	Path to manifest.csv file that contains a row per sample of: person_id,green_idat_uri,red_idat_uri Two IDAT files per array sample with the research id in the name of the file. Please see IDAT files for more information.
FASTA hg38/GRCh38 public reference file	gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta		All variants are called against the hg38/GRCh38 reference. This is the location to the FASTA formatted file.
FAI hg38/GRCh38 public reference file	gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta.fai		All variants are called against the hg38/GRCh38 reference. This is the location to the FAI formatted file.
DICT hg38/GRCh38 public reference file	gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.dict		All variants are called against the hg38/GRCh38 reference. This is the location to the DICT formatted file.

Controlled CDR Directory (ARCHIVED C2022Q2R2 CDR CT Dataset v6)

Summary of changes since Beta

Added: manifest files

BigQuery

Cloud Storage

Was this article helpful?

Comments

<%= previousTitle %>

<%= nextTitle %>

<%= block.name %>

<%= block.name %>

Have a question or would like to make a request?

Categories

Toggle navigation menu

<%= category.name %>

Search

Summary of changes since Beta

Added: manifest files

BigQuery

Cloud Storage

Was this article helpful?

<%= previousTitle %>

<%= nextTitle %>

<%= block.name %>

<%= block.name %>

Have a question or would like to make a request?

Categories

Toggle navigation menu

<%= category.name %>

Categories

Categories