Controlled CDR Directory (Archived C2022Q4R13 CDRv7)

Summary of changes since v6

146804 short-read WGS (srWGS) samples added with single nucleotide polymorphism, insertion, and deletion variant calls (SNPs and Indels) (278 Withdrawn, 147082 New) - Total: 245394
147818 genotyping array (“array”) samples added (416 Withdrawn, 148234 New) - Total: 312945
srWGS samples with structural variant (SV) calls added (97940 samples)
Long read WGS (lrWGS) samples with SNP and Indel variants and SVs added (1027 samples)

BigQuery

Included within the Controlled Tier is a “mainline” curated data repository (CDR) that contains data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) also represented within the Registered Tier, but with different privacy protections applied. Like the Registered Tier, the Controlled Tier includes a default CDR, which is queryable through the Researcher Workbench dataset tools or BigQuery directly, and a base CDR that is queryable only through BigQuery. Below is information about the mainline CDR location within BigQuery.

For more information about mainline CDR creation and the differences between the default and base instances, please see here. For information about the specific tables, fields, and privacy protections applied to the Registered and Controlled Tier CDRs, please see the CDR Data Dictionary.

Asset	Location	Env var	Description
Default Mainline CDR	fc-aou-cdr-prod.C2022Q4R11	WORKSPACE_CDR	Main BigQuery CDR representative of the same data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) included in the Registered Tier, but with few privacy protections applied.
Base Mainline CDR	fc-aou-cdr-prod.C2022Q4R11_base		Same CDR representation as noted above, but with fewer processing/convenience transformations applied.

Cloud Storage

Also included in the Controlled Tier is a genomics CDR that is stored separate from the above referenced BigQuery assets. Note, the genomics CDR is only available through the Controlled Tier (e.g., genomic data is not available within the Registered Tier).

For more information on the specifics of the genomic data assets (including the auxiliary files) and the associated file formats referenced below, please see How the All of Us genomic data are organized. Before using the genomic data, we highly recommend reading the Known Issues section of the All of Us Research Program Genomic Research Data Quality Report. For help displaying html files in the Researcher Workbench, please see our featured notebook.

If you use gsutil to access the CDR bucket, you will need to pass an additional flag in the command as described more here:

!gsutil -u $GOOGLE_PROJECT ls gs://fc-aou-datasets-controlled

Asset	Location	Env var	Description
All CDR assets path	gs://fc-aou-datasets-controlled/v7	CDR_STORAGE_PATH	All Cloud Storage assets for this CDR version are under this path
srWGS: VDS	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/vds/hail.vds	WGS_VDS_PATH	Joint SNP/Indel callset of all available samples on the entire called genomic territory. This is stored as a VariantDataset (VDS). This callset has also been released, subsetted to smaller genomic territories, in other formats (such as VCF). See below for more information.

srWGS: auxiliary files	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux		Auxiliary files for the SNP and Indel variants for srWGS data.
Variant Annotation Table (VAT)	…vat/vat_complete_v7.1.bgz.tsv.gz		Variant-level metadata and functional annotations for the SNP and Indel variants contained in the srWGS data. Please see Variant Annotation Table for more details, including the provided fields.
Ancestry information	…/ancestry/ancestry_preds.tsv …/ancestry/preds_oth.html …/ancestry/merged_sites_only_intersection.vcf.bgz …/ancestry/merged_sites_only_intersection.vcf.bgz.tbi …/ancestry/loadings.ht		Computed ancestry predictions for all samples in the srWGS joint callset (tsv). We also provide a plot of the ancestry predictions and the sites-only VCF (vcf.bgz) of the locations we used for training the ancestry classifier. Furthermore, we supply the coefficients of PCs utilized by the ancestry predictions classifier. Please see Genetic predicted ancestry for more information on these files. For more information on how we predict ancestry, please see the All Of Us Release Genomic Quality Report (Appendix A)
QC information	…/qc/flagged_samples.tsv …/qc/metrics.html …/qc/pc1vspc2.html …/qc/genomic_metrics.tsv		The list of srWGS samples flagged in the joint callset QC process (tsv) and metrics related to the genomic sequencing of the WGS samples. We also include plots of the metric residuals and the first two principal components (*.html) used in the Ancestry and Joint Callset QC. Please see Flagged srWGS Samples for details on the tsv format. Please see the All Of Us Release Genomic Quality Report (Joint Callset QC) for details on how we flag samples. One srWGS sample (person_id: 3518297) does not have its corresponding array in the array Hail MT and PLINK files. Please note that the array VCF for this sample will be available. For more information, please see All Of Us Release Genomic Quality Report (Known Issue #4)
Relatedness	…/relatedness/relatedness.tsv …/relatedness/relatedness_flagged_samples.tsv		The relatedness of the srWGS samples as kinship scores. We provide one file which lists all pairs of samples w/ a kinship score greater than 0.1 (relatedness.tsv). We also provide a list of samples that would need to be removed to remove related samples from the full cohort. (relatedness_flagged_samples.tsv) Please see Relatedness for more information.

srWGS: Exome			These subset files include srWGS SNP and Indel variants that are within exon regions of the Gencode v42 basic transcripts, for all samples. The exome srWGS callset is provided in VCF, Hail MT, BGEN, and PLINK bed formats. For more information on how we determine the exome, please see Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK. Note: The exome smaller callsets were updated on 9/22/23 with an update to all three smaller callsets; more details can be found in this article. The new versions are referred to as version 7.1.
Hail MT	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/multiMT/hail.mt	WGS_EXOME_MULTI_HAIL_PATH	Hail multiallelic MatrixTable (MT) for the exome srWGS joint callset. Multiallelic sites are not split. When using this file in Hail, read directly from the bucket location. Do not attempt to copy it locally. Please see srWGS SNP & Indel Hail MT for more information.
VCF	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/vcf/	WGS_EXOME_VCF_PATH	Variant Call Format (VCF) for the exome srWGS joint callset. This callset is converted from the exome Hail MT and sharded by chromosome into multiple files. Please see srWGS SNP & Indel VCFs for more information
Hail MT multiallelic split	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/splitMT/hail.mt	WGS_EXOME_SPLIT_HAIL_PATH	Hail multi MatrixTable (MT) for the exome srWGS joint callset. Multiallelic sites are split into separate records.
PLINK BED	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/plink_bed/		PLINK binary biallelic genotype table (.bed) for the exome srWGS joint callset. Includes .fam and .bim files for usage with the PLINK tool as well. These PLINK triplets are converted from the exome Hail MT.
BGEN	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/bgen/		Binary GEN (BGEN) files for the exome srWGS joint callset. Contains sample, Hail index, and bgenix index.
UCSC BED	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/bed/		UCSC BED file used to generate the Exome Variants callset.
srWGS: ACAF Threshold			These subset files include srWGS SNP and Indel variants that are frequent in the All of Us computed ancestry subpopulations. We use population-specific allele frequency (AF) > 1% or population-specific allele count (AC) > 100, in any computed ancestry subpopulations. The ACAF threshold srWGS callset is provided in VCF, Hail MT, BGEN, and PLINK bed formats. For more information on these subset files, please see Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK. Note: The ACAF Threshold smaller callsets were updated on 9/22/23 with an update to all three smaller callsets; more details can be found in this article. The new versions are referred to as version 7.1.
Hail MT	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/multiMT/hail.mt	WGS_ACAF_THRESHOLD_MULTI _HAIL_PATH	Hail multiallelic MT for the ACAF threshold srWGS joint callset. Multiallelic sites are not split. When using this file in Hail, read directly from the bucket location. Do not attempt to copy it locally. Please see srWGS SNP & Indel Hail MT for more information.
VCF	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/vcf/	WGS_ACAF_THRESHOLD_VCF_PATH	VCF for the ACAF threshold srWGS joint callset. This callset is converted from the ACAF Threshold Hail MT and sharded by chromosome into multiple files. Please see srWGS SNP & Indel VCFs for more information.
Hail MT multiallelic split	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/splitMT/hail.mt	WGS_ACAF_THRESHOLD_SPLIT _HAIL_PATH	Hail multi MT for the ACAF threshold srWGS joint callset. Multiallelic sites are split into separate records.
Plink BED	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/plink_bed/		PLINK binary biallelic genotype table (.bed) for the ACAF threshold srWGS joint callset. Includes .fam, .bim files for usage with the PLINK tool as well. These PLINK triplets are converted from the ACAF threshold Hail MT.
BGEN	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bgen/		BGEN files for the ACAF threshold srWGS joint callset . Contains sample, Hail index, and bgenix index.
UCSC BED	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bed/		UCSC BED file used to generate the ACAF threshold Variants callset.
srWGS: ClinVar Variants			These subset files include srWGS SNP and Indel variants that are in Clinvar, not limited to pathogenic or likely pathogenic variants. The ClinVar srWGS callset is provided in VCF, Hail MT, BGEN, and PLINK bed formats. For more information on these subset files, please see Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK. Note: The ClinVar smaller callsets were updated on 9/22/23 with an update to all three smaller callsets; more details can be found in this article. The new versions are referred to as version 7.1.
Hail MT	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/multiMT/hail.mt	WGS_CLINVAR_MULTI_HAIL_PATH	Hail MT for the ClinVar srWGS joint callset. Multiallelic sites are not split. When using this file in Hail, read directly from the bucket location. Do not attempt to copy it locally. Please see srWGS SNP & Indel Hail MT for more information.
VCF	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/vcf/	WGS_CLINVAR_VCF_PATH	VCF for the ClinVar srWGS joint callset. This callset is converted from the ACAF Threshold Hail MT and sharded by chromosome into multiple files. Please see srWGS SNP & Indel VCFs for more information.
Hail MT multiallelic split	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/splitMT/hail.mt	WGS_CLINVAR_SPLIT_HAIL_PATH	Hail multi MT for the ClinVar srWGS joint callset. Multiallelic sites are split into separate records.
PLINK BED	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/plink_bed/		PLINK binary biallelic genotype table (.bed) for the ClinVar srWGS joint callset. Includes .fam, .bim files for usage with the PLINK tool as well. These PLINK triplets are converted from the ClinVar Hail MT.
BGEN	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/bgen/		BGEN files for the ClinVar srWGS joint callset. Contains sample file, Hail index, and bgenix index.
UCSC BED	gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/bed/		UCSC BED file used to generate the ClinVar srWGS joint callset.
srWGS: CRAM files	gs://fc-aou-datasets-controlled/v7/wgs/cram/manifest.csv	WGS_CRAM_MANIFEST_PATH	Path to manifest CSV file that contains a row per sample of: person_id,cram_uri,cram_index_uri We provide CRAM files and CRAM index files with the research ID in the name of the file. One CRAM file for each WGS sample. See CRAM files for more information.

srWGS: Structural Variants (SVs)
srWGS SV VCF	gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/v7_offcycle/vcf/full/		SVs for 97,940 srWGS samples.
srWGS SV sites-only VCF	gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/v7_offcycle/vcf/sites-only/		The sites-only VCF has all variant sites from the 97,940 samples but no genotype information.
aux			Auxiliary information from the srWGS structural variant calling:
srWGS SV samples with probable aneuploidies	gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/v7_offcycle/aux/aneuploidies/		We provide lists of samples with probable aneuploidies identified during srWGS SV ploidy estimation as tsv files. There are three separate files for samples with probable aneuploidies: mosaic autosomal aneuploidy, mosaic allosomal aneuploidy, and germline allosomal aneuploidy.
srWGS SV maximal set of unrelated samples	gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/v7_offcycle/aux/relatedness		We provide a list of the maximal set of unrelated samples in the srWGS SV cohort. The samples are reported in a txt file as a list of research IDs. One research ID is listed per line and there is no header in the file.
srWGS SV sample list	gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/v7_offcycle/aux/sample_list/AoU_srWGS_SV.v7_offcycle.research_ids.txt		We provide a list file of all research_ids that have srWGS SV data. The file is a text file containing one research_id per line.

srWGS PGx Haplotype Calls			The following file paths are related to work done in the Featured Workspace "Demo - Pharmacogenomics (PGx) variant frequency and medication exposures"
srWGS PGx Haplotype v7	gs://fc-aou-datasets-controlled/v7/demo-project-files/pgx-calls/		PGx haplotype calls completed using srWGS for 15 genes and gene regions. Haplotype calls are available for Stargazer v2.0.1 and PharmCAT v2.4.0
srWGS PGx Haplotype v6	gs://fc-aou-datasets-controlled/v6/demo-project-files/pgx-calls/		PGx haplotype calls completed using srWGS for 15 genes and gene regions. Haplotype calls are available for Stargazer v2.0.0, PharmCAT v2.2.1

Long read whole genome sequencing (lrWGS)			Single sample and joint called raw data, variant data, and auxiliary files. All single sample files are listed in the manifest, described in Appendix 1. All other joint called and joint auxiliary files are listed below.
lrWGS single sample file manifest	gs://fc-aou-datasets-controlled/v7/wgs/long_read/manifest.csv	LONG_READS_MANIFEST_PATH	The lrWGS manifest contains file locations for all single sample lrWGS raw data, variant data, and auxillary data. See Appendix 1 for the description for each file. Path to manifest CSV file that contains a row per sample of: Research_id,hifiasm-primary-asm-fasta,hifiasm-hap1-asm-fasta,hifiasm-hap2-asm-fasta,hifiasm-primary-asm-gfa,hifiasm-hap1-asm-gfa,hifiasm-hap2-asm-gfa,hifiasm-quast-report-html,hifiasm-quast-report-summary,chm13v2.0-pav-vcf,chm13v2.0-pav-tbi,chm13v2.0-bam,chm13v2.0-bai,chm13v2.0-pbi,chm13v2.0-haplotagged-bam,chm13v2.0-haplotagged-bai,chm13v2.0-deepvariant-vcf,chm13v2.0-deepvariant-tbi,chm13v2.0-deepvariant-phased-vcf,chm13v2.0-deepvariant-phased-tbi,chm13v2.0-pbsv-vcf,chm13v2.0-pbsv-tbi,chm13v2.0-sniffles-vcf,chm13v2.0-sniffles-tbi,chm13v2.0-sniffles-snf,grch38-pav-vcf,grch38-pav-tbi,grch38-bam,grch38-bai,grch38-pbi,grch38-haplotagged-bam,grch38-haplotagged-bai,grch38-deepvariant-vcf,grch38-deepvariant-tbi,grch38-deepvariant-phased-vcf,grch38-deepvariant-phased-tbi,grch38-pbsv-vcf,grch38-pbsv-tbi,grch38-sniffles-vcf,grch38-sniffles-tbi,grch38-sniffles-snf
Joint called Hail MT (grch38_noalt)	gs://fc-aou-datasets-controlled/v7/wgs/long_read/hail.mt/GRCh38/	WGS_LONGREADS_HAIL_ GRCH38_PATH	Hail MT for the lrWGS joint callset for SNPs and Indels called to the grch38_noalt reference When using this file in Hail, read directly from the bucket location. Do not attempt to copy it locally.
Joint called Hail MT (T2Tv2.0)	gs://fc-aou-datasets-controlled/v7/wgs/long_read/hail.mt/T2T/	WGS_LONGREADS_HAIL_T2T_PATH	Hail MT for the lrWGS joint callset for SNPs and Indels called to the T2Tv2.0 reference. When using this file in Hail, read directly from the bucket location. Do not attempt to copy it locally. Note: The joint-called Hail MT (T2Tv2.0) was updated to version 7.1 on 2/8/24.
Joint called VCF (grch38_noalt)	gs://fc-aou-datasets-controlled/v7/wgs/long_read/joint_vcf/GRCh38/	WGS_LONGREADS_JOINT_SNP_ INDEL_VCF_GRCH38_PATH	Joint called lrWGS SNP and Indel VCFagainst the grch38_noalt reference. TBI index file accompanying the VCF is also provided.
Joint called VCF (T2Tv2.0)	gs://fc-aou-datasets-controlled/v7/wgs/long_read/joint_vcf/T2T/	WGS_LONGREADS_JOINT_SNP_ INDEL_VCF_T2T_PATH	Joint called lrWGS SNP and Indel VCF against the T2T-CHM13-v2.0 reference. TBI index file accompanying the VCF is also provided.
Joint-called SV VCF (grch38_noalt)	gs://fc-aou-datasets-controlled/v7/wgs/long_read/joint_sv/GRCh38/		Joint-called lrWGS SV VCFs against the grch38_noalt reference. Provided as strict & lenient versions, which refers to sensitivity. In gzipped format with accompanying index.
Joint-called SV VCF (T2Tv2.0)	gs://fc-aou-datasets-controlled/v7/wgs/long_read/joint_sv/T2T/		Joint-called lrWGS SV VCFs against the T2T-CHM13-v2.0 reference. Provided as strict & lenient versions, which refers to sensitivity. In gzipped format with accompanying index.
Aux (grch38_noalt)	gs://fc-aou-datasets-controlled/v7/wgs/long_read/aux/auxiliary_metrics.GRCh38.tsv		Auxiliary file holding metric values for each sample; grch38_noalt. Described in the lrWGS variant metrics.
Aux (T2Tv2.0)	gs://fc-aou-datasets-controlled/v7/wgs/long_read/aux/auxiliary_metrics.T2T.tsv		Auxiliary file holding metric values for each sample; each metric value is a floating point value. T2Tv2.0. Described in the lrWGS variant metrics.

Array: single sample VCFs	gs://fc-aou-datasets-controlled/v7/microarray/vcf/manifest.csv	MICROARRAY_VCF_MANIFEST _PATH	Path to manifest CSV file that contains a row per sample of: person_id,vcf_uri,vcf_index_uri One VCF per participant sample. Please see Array VCFs for more information.
Array: all samples Hail MT	gs://fc-aou-datasets-controlled/v7/microarray/hail.mt_v7.1	MICROARRAY_HAIL_STORAGE_PATH	Hail MT of the array samples in this release. All of the samples have been merged into a single matrix table. Please see Array MatrixTable for more information.
Array: all samples PLINK files	gs://fc-aou-datasets-controlled/v7/microarray/plink_v7.1/arrays.*		PLINK binary merged representation of the microarray samples in this release (.bed). Includes .fam, .bim files for usage with the plink tool as well. Please see Array PLINK 1.9 data for more information
Array: IDAT files	gs://fc-aou-datasets-controlled/v7/microarray/idat/manifest.csv	MICROARRAY_IDAT_MANIFEST _PATH	Path to manifest.csv file that contains a row per sample of: person_id,green_idat_uri,red_idat_uri Two IDAT files per array sample with the research id in the name of the file. Please see IDAT files for more information.

Known Issues: samples’ lists associated with known issues	gs://fc-aou-datasets-controlled/v7/known_issues/wgs_v7_not_in_cdr_known_issue_1.tsv gs://fc-aou-datasets-controlled/v7/known_issues/array_v7_not_in_cdr_known_issue_1.tsv gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/v7_offcycle/aux/known_issues/AoU_srWGS_SV.v7_offcycle.not_in_cdr_known_issue_1.txt gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_array_known_issue_2.tsv gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_wgs_known_issue_2.tsv gs://fc-aou-datasets-controlled/v7/known_issues/array_rids_v6_not_in_v7_known_issue_3.tsv gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_array_known_issue_14.tsv gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_array_known_issue_15.tsv gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_wgs_known_issue_15.tsv		Each file contains a list of sample IDs associated with known issues. For more information, please see All Of Us Release Genomic Quality Report (Known Issue #1,2,3,14,15)

Appendix 1. lrWGS manifest column descriptions

Reference-free genomic files

We have provided three de novo assemblies, each in two formats (FASTA and GFA), for each sample.

column_name	note
hifiasm-primary-asm-fasta	Hifiasm primary assembly, in FASTA format.
Hifiasm-hap1-asm-fasta	Hifiasm haplotype-resolved assembly for haplotype-1 (in no particular order), in FASTA format.
Hifiasm-hap2-asm-fasta	Hifiasm haplotype-resolved assembly for haplotype-2 (in no particular order), in FASTA format.
Hifiasm-primary-asm-gfa	Hifiasm primary assembly, in GFA format.
Hifiasm-hap1-asm-gfa	Hifiasm haplotype-resolved assembly for haplotype-1 (in no particular order), in GFA format.
Hifiasm-hap2-asm-gfa	Hifiasm haplotype-resolved assembly for haplotype-2 (in no particular order), in GFA format.
Hifiasm-quast-report-html	An HTML-formatted report on the quality of the three assembly FASTA files; produced by the tool QUAST.
Hifiasm-quast-report-summary	A summary on the QUAST reported metrics of the three assembly FASTA files.

Reference-specific genomic files

References used

The DRC pipelines align sequences to two references.

The grch38_noalt reference contains a subset of contigs from the full GRCh38 references. Specifically, only primary assembly autosomes (1-22), sex chromosomes (X and Y), mitochondria, human EBV, and random and unplaced contigs are included.
The T2Tv2.0 refers to the T2T CHM13v2.0 reference,retrieved from the T2T consortium's AWS bucket, and then with the human EBV contig appended.

We have provided two sets of all downstream genomic files for each sample, one for each reference.

Genomic files based on grch38_noalt

Unless otherwise specified, all VCF files are gzipped into .vcf.gz.

column_name	note
grch38-bam	WGS BAM for the sample, aligned to grch38_noalt.
grch38-bai	The accompanying index for the BAM.
grch38-pbi	The accompanying PBI index for the BAM.
grch38-haplotagged-bam	Haplotagged BAM.
grch38-haplotagged-bai	The accompanying index for the haplotagged BAM.
grch38-pav-vcf	PAV-generated VCF.
grch38-pav-tbi	TBI index for the PAV-generated VCF.
grch38-deepvariant-vcf	PEPPER-Margin-DeepVariant-generated (DV-generated) single sample small variant VCF; a filter of QUAL<40 has been applied.
grch38-deepvariant-tbi	TBI index for the DV-generated VCF.
grch38-deepvariant-phased-vcf	DV-generated single sample small variant VCF, phased; a filter of QUAL<40 has been applied.
grch38-deepvariant-phased-tbi	TBI index for the DV-generated phased VCF.
grch38-pbsv-vcf	PBSV-generated single-sample SV VCF.
grch38-pbsv-tbi	TBI index for the PBSV-generated VCF.
grch38-sniffles-vcf	Sniffles-generated single sample SV VCF.
grch38-sniffles-tbi	TBI index for the Sniffles-generated VCF.
grch38-sniffles-snf	Sniffles-2 SNF file for the single sample.

Genomic files based on T2Tv2.0

column_name	note
chm13v2.0-bam	lrWGS BAM for the sample, aligned to T2Tv2.0.
chm13v2.0-bai	The accompanying index for the BAM.
chm13v2.0-pbi	The accompanying PBI index for the BAM.
chm13v2.0-haplotagged-bam	Haplotagged BAM.
chm13v2.0-haplotagged-bai	The accompanying index for the haplotagged BAM.
chm13v2.0-pav-vcf	PAV-generated VCF.
chm13v2.0-pav-tbi	TBI index for the PAV-generated VCF.
chm13v2.0-deepvariant-vcf	PEPPER-Margin-DeepVariant-generated (DV-generated) single sample small variant VCF; a filter of QUAL<40 has been applied.
chm13v2.0-deepvariant-tbi	TBI index for the DV-generated VCF.
chm13v2.0-deepvariant-phased-vcf	DV-generated single-sample small variant VCF, phased; a filter of QUAL<40 has been applied.
chm13v2.0-deepvariant-phased-tbi	TBI index for the DV-generated phased VCF.
chm13v2.0-pbsv-vcf	PBSV-generated single-sample SV VCF.
chm13v2.0-pbsv-tbi	TBI index for the PBSV-generated VCF.
chm13v2.0-sniffles-vcf	Sniffles-generated single-sample SV VCF.
chm13v2.0-sniffles-tbi	TBI index for the Sniffles-generated VCF.
chm13v2.0-sniffles-snf	Sniffles-2 SNF file for the single sample.

Auxiliary files

For each of the two references, we also release one auxiliary file (TSV) that describes each sample in one row. Each entry in the columns is either string or a numerical value.

column_name	note
mosdepth_cov	A floating point value describing the mean coverage, computed with mosdepth.
aligned_frac_bases	A floating point value describing the fraction of bases that are aligned to the reference.
aligned_num_bases	An integer describing the number of bases aligned to the reference.
aligned_num_reads	An integer describing the number of aligned reads.
aligned_read_length_N50	An integer describing the N50 of aligned reads.
aligned_read_length_median	A floating point value describing the median of aligned reads.
aligned_read_length_mean	A floating point value describing the mean of aligned reads.
aligned_read_length_stdev	A floating point value describing the standard deviation of aligned read length distribution.
average_identity	A floating point value describing the mean identity between the reads and reference.
median_identity	A floating point value describing the median identity between the reads and reference.
dvp_ft_pass_snp_cnt	The count of SNPs in the DV-generated VCF whose Filter column is PASS.
pbsv_nonBND_50bpSV_cnt	The count of SVs in the PBSV VCF whose Filter column is PASS, not a BND type, and size >= 50bp.
snf2_nonBND_50bpSV_cnt	The count of SVs in the Sniffles-2 VCF whose Filter column is PASS, not a BND type, and size >= 50bp.

In addition, the grch38_noalt version has one extra column from the QC procedures

column_name	note
contamination_est	an estimation of the level of cross-individual contamination as reported by VerifyBAMID2.

Controlled CDR Directory (Archived C2022Q4R13 CDRv7)

Summary of changes since v6

BigQuery

Cloud Storage

Appendix 1. lrWGS manifest column descriptions

Reference-free genomic files

Reference-specific genomic files

References used

Genomic files based on grch38_noalt

Genomic files based on T2Tv2.0

Auxiliary files

Was this article helpful?

Comments

<%= previousTitle %>

<%= nextTitle %>

<%= block.name %>

<%= block.name %>

Have a question or would like to make a request?

Categories

Toggle navigation menu

<%= category.name %>

Search

Summary of changes since v6

BigQuery

Cloud Storage

Appendix 1. lrWGS manifest column descriptions

Reference-free genomic files

Reference-specific genomic files

References used

Genomic files based on grch38_noalt

Genomic files based on T2Tv2.0

Auxiliary files

Was this article helpful?

<%= previousTitle %>

<%= nextTitle %>

<%= block.name %>

<%= block.name %>

Have a question or would like to make a request?

Categories

Toggle navigation menu

<%= category.name %>

Categories

Categories