Controlled CDR Directory

  • Updated

Summary of changes since v6

  • 146804 short-read WGS (srWGS) samples added with single nucleotide polymorphism, insertion, and deletion variant calls (SNPs and Indels) (278 Withdrawn, 147082 New) - Total: 245394
  • 147818 genotyping array (“array”) samples added (416 Withdrawn, 148234 New) - Total: 312945
  • srWGS samples with structural variant (SV) calls added(97940 samples)
  • Long read WGS (lrWGS) samples with SNP and Indel variants and SVs added (1027 samples)

 

BigQuery

Included within the Controlled Tier is a “mainline” curated data repository (CDR) that contains data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) also represented within the Registered Tier, but with different privacy protections applied. Like the Registered Tier, the Controlled Tier includes a default CDR, which is queryable through the Researcher Workbench dataset tools or BigQuery directly, and a base CDR that is queryable only through BigQuery. Below is information about the mainline CDR location within BigQuery. 

 

For more information about mainline CDR creation and the differences between the default and base instances, please see here. For information about the specific tables, fields, and privacy protections applied to the Registered and Controlled Tier CDRs, please see the CDR Data Dictionary.

 

Asset Location Env var Description
Default Mainline CDR  fc-aou-cdr-prod.C2022Q4R11 WORKSPACE_CDR Main BigQuery CDR representative of the same data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) included in the Registered Tier, but with few privacy protections applied. 
Base Mainline CDR  fc-aou-cdr-prod.C2022Q4R11_base   Same CDR representation as noted above, but with fewer processing/convenience transformations applied. 

Cloud Storage

Also included in the Controlled Tier is a genomics CDR that is stored separate from the above referenced BigQuery assets. Note, the genomics CDR is only available through the Controlled Tier (e.g., genomic data is not available within the Registered Tier). 

 

For more information on the specifics of the genomic data assets (including the auxiliary files) and the associated file formats referenced below, please see How the All of Us genomic data are organized.  Before using the genomic data, we highly recommend reading the Known Issues section of the All of Us Research Program Genomic Research Data Quality Report. For help displaying html files in the Researcher Workbench, please see our featured notebook.

 

If you use gsutil to access the CDR bucket, you will need to pass an additional flag in the command as described more here:
!gsutil -u $GOOGLE_PROJECT ls gs://fc-aou-datasets-controlled

 

Asset Location Env var Description
All CDR assets path gs://fc-aou-datasets-controlled/v7 CDR_STORAGE_PATH All Cloud Storage assets for this CDR version are under this path
srWGS: VDS gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/vds/hail.vds WGS_VDS_PATH Joint SNP/Indel callset of all available samples on the entire called genomic territory.  This is stored as a VariantDataset (VDS).

This callset has also been released, subsetted to smaller genomic territories, in other formats (such as VCF). See below for more information.
       
srWGS: auxiliary files gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux   Auxiliary files for the SNP and Indel variants for srWGS data.
  • Variant Annotation Table (VAT)
…vat/vat_complete_v7.1.bgz.tsv.gz  

Variant-level metadata and functional annotations for  the SNP and Indel variants contained in the srWGS data.


Please see Variant Annotation Table for more details, including the provided fields.

  • Ancestry information

…/ancestry/ancestry_preds.tsv

…/ancestry/preds_oth.html

…/ancestry/merged_sites_only_intersection.vcf.bgz

…/ancestry/merged_sites_only_intersection.vcf.bgz.tbi

…/ancestry/loadings.ht

 

Computed ancestry predictions for all samples in the srWGS joint callset (tsv).  We also provide a plot of the ancestry predictions and the sites-only VCF (vcf.bgz) of the locations we used for training the ancestry classifier. Furthermore, we supply the coefficients of PCs utilized by the ancestry predictions classifier.


Please see Genetic predicted ancestry for more information on these files.


For more information on how we predict ancestry, please see the All Of Us Release Genomic Quality Report (Appendix A)

  • QC information

…/qc/flagged_samples.tsv

…/qc/metrics.html

…/qc/pc1vspc2.html

…/qc/genomic_metrics.tsv

 

The list of srWGS samples flagged in the joint callset QC process (tsv) and metrics related to the genomic sequencing of the WGS samples.  We also include plots of the metric residuals and the first two principal components (*.html) used in the Ancestry and Joint Callset QC.


Please see Flagged srWGS Samples for details on the tsv format.


Please see the All Of Us Release Genomic Quality Report (Joint Callset QC) for details on how we flag samples.


One srWGS sample (person_id: 3518297) does not have its corresponding array in the array Hail MT and PLINK files.  Please note that the array VCF for this sample will be available.  For more information, please see All Of Us Release Genomic Quality Report (Known Issue #4)

  • Relatedness

…/relatedness/relatedness.tsv

…/relatedness/relatedness_flagged_samples.tsv

 

The relatedness of the srWGS samples as kinship scores.  We provide one file which lists all pairs of samples w/ a kinship score greater than 0.1 (relatedness.tsv).  We also provide a list of samples that would need to be removed to remove related samples from the full cohort.  (relatedness_flagged_samples.tsv)


Please see Relatedness for more information.

       
srWGS: Exome     

These subset files include srWGS SNP and Indel variants that are within exon regions of the Gencode v42 basic transcripts, for all samples. The exome srWGS callset is provided in VCF, Hail MT, BGEN, and PLINK bed formats.



For more information on how we determine the exome, please see Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK.

 

Note: The exome smaller callsets were updated on 9/22/23 with an update to all three smaller callsets; more details can be found in this article. The new versions are referred to as version 7.1.

  • Hail MT

gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/multiMT/hail.mt

WGS_EXOME_MULTI_HAIL_PATH

Hail multiallelic MatrixTable (MT) for the exome srWGS joint callset. Multiallelic sites are not split.


When using this file in Hail, read directly from the bucket location.  Do not attempt to copy it locally. Please see srWGS SNP & Indel Hail MT for more information.

  • VCF
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/vcf/ WGS_EXOME_VCF_PATH

Variant Call Format (VCF) for the exome srWGS joint callset.  This callset is converted from the exome Hail MT and sharded by chromosome into multiple files. 


Please see srWGS SNP & Indel VCFs for more information

  • Hail MT multiallelic split
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/splitMT/hail.mt WGS_EXOME_SPLIT_HAIL_PATH Hail multi MatrixTable (MT) for the exome srWGS joint callset.  Multiallelic sites are split into separate records.  
  • PLINK BED
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/plink_bed/   PLINK binary biallelic genotype table (.bed)  for the exome srWGS joint callset. Includes .fam and  .bim files for usage with the PLINK tool as well. These PLINK triplets are converted from the exome Hail MT.
  • BGEN
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/bgen/   Binary GEN (BGEN) files  for the exome srWGS joint callset. Contains  sample, Hail index, and bgenix index.
  • UCSC BED
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/bed/   UCSC BED file used to generate the Exome Variants callset.
srWGS: ACAF Threshold    

These subset files include srWGS SNP and Indel variants that are frequent in the All of Us computed ancestry subpopulations. We use population-specific allele frequency (AF) > 1% or population-specific allele count (AC) > 100, in any computed ancestry subpopulations. The ACAF threshold srWGS callset is provided in VCF, Hail MT, BGEN, and PLINK bed formats.


For more information on these subset files, please see Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK.

 

Note: The ACAF Threshold smaller callsets were updated on 9/22/23 with an update to all three smaller callsets; more details can be found in this article. The new versions are referred to as version 7.1.

  • Hail MT
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/multiMT/hail.mt

WGS_ACAF_THRESHOLD_MULTI

_HAIL_PATH

Hail multiallelic MT for the ACAF threshold srWGS joint callset. Multiallelic sites are not split.



When using this file in Hail, read directly from the bucket location.  Do not attempt to copy it locally. Please see srWGS SNP & Indel Hail MT for more information.

  • VCF
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/vcf/ WGS_ACAF_THRESHOLD_VCF_PATH

VCF for the ACAF threshold srWGS joint callset. This callset is converted from the ACAF Threshold Hail MT and sharded by chromosome into multiple files. 


Please see srWGS SNP & Indel VCFs for more information.

  • Hail MT multiallelic split
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/splitMT/hail.mt

WGS_ACAF_THRESHOLD_SPLIT

_HAIL_PATH

Hail multi MT for the ACAF threshold srWGS joint callset. Multiallelic sites are split into separate records.  
  • Plink BED

gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/plink_bed/

 

PLINK binary biallelic genotype table (.bed)  for the ACAF threshold srWGS joint callset. 

Includes .fam, .bim files for usage with the PLINK tool as well. These PLINK triplets are converted from the ACAF threshold  Hail MT.

  • BGEN

gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bgen/

  BGEN files for the ACAF threshold srWGS joint callset . Contains sample, Hail index, and bgenix index.
  • UCSC BED
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bed/   UCSC BED file used to generate the ACAF threshold Variants callset.
srWGS: ClinVar Variants    

These subset files include srWGS SNP and Indel variants that are in Clinvar, not limited to  pathogenic or likely pathogenic variants. The ClinVar srWGS callset is provided in VCF, Hail MT, BGEN, and PLINK bed formats.


For more information on these subset files, please see Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK.

 

Note: The ClinVar smaller callsets were updated on 9/22/23 with an update to all three smaller callsets; more details can be found in this article. The new versions are referred to as version 7.1.

  • Hail MT
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/multiMT/hail.mt WGS_CLINVAR_MULTI_HAIL_PATH

Hail MT for the ClinVar srWGS joint callset. Multiallelic sites are not split.



When using this file in Hail, read directly from the bucket location.  Do not attempt to copy it locally. Please see srWGS SNP & Indel Hail MT for more information.

  • VCF
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/vcf/ WGS_CLINVAR_VCF_PATH

VCF for the ClinVar srWGS joint callset.


This callset is converted from the ACAF Threshold Hail MT and sharded by chromosome into multiple files. 


Please see srWGS SNP & Indel VCFs for more information.

  • Hail MT multiallelic split
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/splitMT/hail.mt WGS_CLINVAR_SPLIT_HAIL_PATH Hail multi MT for the ClinVar srWGS joint callset. Multiallelic sites are split into separate records.   
  • PLINK BED
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/plink_bed/  

PLINK binary biallelic genotype table (.bed)  for the ClinVar srWGS joint callset. Includes .fam, .bim files for usage with the PLINK tool as well. 


These PLINK triplets are converted from the ClinVar Hail MT.

  • BGEN
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/bgen/   BGEN files for the ClinVar srWGS joint callset. Contains sample file, Hail index, and bgenix index.
  • UCSC BED
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/bed/   UCSC BED file used to generate the ClinVar srWGS joint callset.
srWGS: CRAM files gs://fc-aou-datasets-controlled/v7/wgs/cram/manifest.csv WGS_CRAM_MANIFEST_PATH

Path to manifest CSV file that contains a row per sample of: person_id,cram_uri,cram_index_uri


We provide CRAM files and CRAM index files with the research ID in the name of the file. One CRAM file for each WGS sample. See CRAM files for more information.

        
srWGS: Structural Variants (SVs)      
  • srWGS SV VCF 
gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/v7_offcycle/vcf/full/   SVs for 97,940 srWGS samples.
  • srWGS SV sites-only VCF 
gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/v7_offcycle/vcf/sites-only/   The sites-only VCF has all variant sites from the 97,940 samples but no genotype information.
  • aux
    Auxiliary information from the srWGS structural variant calling:
  • srWGS SV samples with probable aneuploidies
gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/v7_offcycle/aux/aneuploidies/   We provide lists of samples with probable aneuploidies identified during srWGS SV ploidy estimation as tsv files. There are three separate files for samples with probable aneuploidies: mosaic autosomal aneuploidy, mosaic allosomal aneuploidy, and germline allosomal aneuploidy.
  • srWGS SV maximal set of unrelated samples
gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/v7_offcycle/aux/relatedness   We provide a list of the maximal set of unrelated samples in the srWGS SV cohort. The samples are reported in a txt file as a list of research IDs. One research ID is listed per line and there is no header in the file. 
       
srWGS PGx Haplotype Calls     The following file paths are related to work done in the Featured Workspace "Demo - Pharmacogenomics (PGx) variant frequency and medication exposures"
  • srWGS PGx Haplotype v7
gs://fc-aou-datasets-controlled/v7/demo-project-files/pgx-calls/   PGx haplotype calls completed using srWGS for 15 genes and gene regions. Haplotype calls are available for Stargazer v2.0.1 and PharmCAT v2.4.0
  • srWGS PGx Haplotype v6
gs://fc-aou-datasets-controlled/v6/demo-project-files/pgx-calls/   PGx haplotype calls completed using srWGS for 15 genes and gene regions. Haplotype calls are available for Stargazer v2.0.0, PharmCAT v2.2.1
       
Long read whole genome sequencing (lrWGS)     

Single sample and joint called raw data, variant data, and auxiliary files. All single sample files are listed in the manifest, described in

Appendix 1. All other joint called and joint auxiliary files are listed below. 

 

  • lrWGS single sample file manifest
gs://fc-aou-datasets-controlled/v7/wgs/long_read/manifest.csv LONG_READS_MANIFEST_PATH

The lrWGS manifest contains file locations for all single sample lrWGS raw data, variant data, and auxillary data. See

Appendix 1 for the description for each file.

 

Path to manifest CSV file that contains a row per sample of:


Research_id,hifiasm-primary-asm-fasta,hifiasm-hap1-asm-fasta,hifiasm-hap2-asm-fasta,hifiasm-primary-asm-gfa,hifiasm-hap1-asm-gfa,hifiasm-hap2-asm-gfa,hifiasm-quast-report-html,hifiasm-quast-report-summary,chm13v2.0-pav-vcf,chm13v2.0-pav-tbi,chm13v2.0-bam,chm13v2.0-bai,chm13v2.0-pbi,chm13v2.0-haplotagged-bam,chm13v2.0-haplotagged-bai,chm13v2.0-deepvariant-vcf,chm13v2.0-deepvariant-tbi,chm13v2.0-deepvariant-phased-vcf,chm13v2.0-deepvariant-phased-tbi,chm13v2.0-pbsv-vcf,chm13v2.0-pbsv-tbi,chm13v2.0-sniffles-vcf,chm13v2.0-sniffles-tbi,chm13v2.0-sniffles-snf,grch38-pav-vcf,grch38-pav-tbi,grch38-bam,grch38-bai,grch38-pbi,grch38-haplotagged-bam,grch38-haplotagged-bai,grch38-deepvariant-vcf,grch38-deepvariant-tbi,grch38-deepvariant-phased-vcf,grch38-deepvariant-phased-tbi,grch38-pbsv-vcf,grch38-pbsv-tbi,grch38-sniffles-vcf,grch38-sniffles-tbi,grch38-sniffles-snf

  • Joint called Hail MT (grch38_noalt)
gs://fc-aou-datasets-controlled/v7/wgs/long_read/hail.mt/GRCh38/

WGS_LONGREADS_HAIL_

GRCH38_PATH

Hail MT for the lrWGS joint callset for SNPs and Indels called to the grch38_noalt reference 


When using this file in Hail, read directly from the bucket location.  Do not attempt to copy it locally.

  • Joint called Hail MT (T2Tv2.0)
gs://fc-aou-datasets-controlled/v7/wgs/long_read/hail.mt/T2T/ WGS_LONGREADS_HAIL_T2T_PATH

Hail MT for the lrWGS joint callset for SNPs and Indels called to the T2Tv2.0 reference. 


When using this file in Hail, read directly from the bucket location.  Do not attempt to copy it locally.

 

Note: The joint-called Hail MT (T2Tv2.0) was updated to version 7.1 on 2/8/24.

  • Joint called VCF (grch38_noalt)
gs://fc-aou-datasets-controlled/v7/wgs/long_read/joint_vcf/GRCh38/

WGS_LONGREADS_JOINT_SNP_

INDEL_VCF_GRCH38_PATH

Joint called lrWGS SNP and  Indel VCFagainst the grch38_noalt reference.


TBI index file accompanying the VCF is also provided.

  • Joint called VCF (T2Tv2.0)
gs://fc-aou-datasets-controlled/v7/wgs/long_read/joint_vcf/T2T/

WGS_LONGREADS_JOINT_SNP_

INDEL_VCF_T2T_PATH

Joint called lrWGS SNP and  Indel VCF against the T2T-CHM13-v2.0 reference.

TBI index file accompanying the VCF is also provided.

  • Aux (grch38_noalt)
gs://fc-aou-datasets-controlled/v7/wgs/long_read/aux/auxiliary_metrics.GRCh38.tsv   Auxiliary file holding metric values for each sample; grch38_noalt. Described in the lrWGS variant metrics.
  • Aux (T2Tv2.0)
gs://fc-aou-datasets-controlled/v7/wgs/long_read/aux/auxiliary_metrics.T2T.tsv   Auxiliary file holding metric values for each sample; each metric value is a floating point value. T2Tv2.0. Described in the lrWGS variant metrics.
       
Array: single sample VCFs gs://fc-aou-datasets-controlled/v7/microarray/vcf/manifest.csv

MICROARRAY_VCF_MANIFEST

_PATH

Path to manifest CSV file that contains a row per sample of: person_id,vcf_uri,vcf_index_uri


One VCF per participant sample.


Please see Array VCFs for more information.

Array: all samples Hail MT gs://fc-aou-datasets-controlled/v7/microarray/hail.mt_v7.1 MICROARRAY_HAIL_STORAGE_PATH

Hail MT of the array samples in this release.  All of the samples have been merged into a single matrix table.  


Please see Array MatrixTable for more information. 

Array: all samples PLINK files gs://fc-aou-datasets-controlled/v7/microarray/plink_v7.1/arrays.*  

PLINK binary merged representation of the microarray samples in this release (.bed). Includes .fam, .bim files for usage with the plink tool as well. 


Please see Array PLINK 1.9 data for more information

Array: IDAT files

gs://fc-aou-datasets-controlled/v7/microarray/idat/manifest.csv

MICROARRAY_IDAT_MANIFEST

_PATH

Path to manifest.csv file that contains a row per sample of: person_id,green_idat_uri,red_idat_uri


Two IDAT files per array sample with the research id in the name of the file. Please see IDAT files for more information.

       
Known Issues: samples’ lists associated with known issues

gs://fc-aou-datasets-controlled/v7/known_issues/wgs_v7_not_in_cdr_known_issue_1.tsv

gs://fc-aou-datasets-controlled/v7/known_issues/array_v7_not_in_cdr_known_issue_1.tsv

 

gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/v7_offcycle/aux/known_issues/AoU_srWGS_SV.v7_offcycle.not_in_cdr_known_issue_1.txt


gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_array_known_issue_2.tsv


gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_wgs_known_issue_2.tsv

gs://fc-aou-datasets-controlled/v7/known_issues/array_rids_v6_not_in_v7_known_issue_3.tsv

 

gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_array_known_issue_14.tsv

 

gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_array_known_issue_15.tsv


gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_wgs_known_issue_15.tsv
 

Each file contains a list of sample IDs associated with known issues.


For more information, please see All Of Us Release Genomic Quality Report (Known Issue #1,2,3,14,15)

 

 

Appendix 1. lrWGS manifest column descriptions 

Reference-free genomic files 

We have provided three de novo assemblies, each in two formats (FASTA and GFA), for each sample.

 

column_name note
hifiasm-primary-asm-fasta Hifiasm primary assembly, in FASTA format.
Hifiasm-hap1-asm-fasta Hifiasm haplotype-resolved assembly for haplotype-1 (in no particular order), in FASTA format.
Hifiasm-hap2-asm-fasta Hifiasm haplotype-resolved assembly for haplotype-2 (in no particular order), in FASTA format.
Hifiasm-primary-asm-gfa Hifiasm primary assembly, in GFA format.
Hifiasm-hap1-asm-gfa Hifiasm haplotype-resolved assembly for haplotype-1 (in no particular order), in GFA format.
Hifiasm-hap2-asm-gfa Hifiasm haplotype-resolved assembly for haplotype-2 (in no particular order), in GFA format.
Hifiasm-quast-report-html An HTML-formatted report on the quality of the three assembly FASTA files; produced by the tool QUAST.
Hifiasm-quast-report-summary A summary on the QUAST reported metrics of the three assembly FASTA files.

 

Reference-specific genomic files

References used

The DRC pipelines align sequences to two references. 

  1. The grch38_noalt reference contains a subset of contigs from the full GRCh38 references. Specifically, only primary assembly autosomes (1-22), sex chromosomes (X and Y), mitochondria, human EBV, and random and unplaced contigs are included.
  2. The T2Tv2.0 refers to the T2T CHM13v2.0 reference,retrieved from the T2T consortium's AWS bucket, and then with the human EBV contig appended.

 

We have provided two sets of all downstream genomic files for each sample, one for each reference. 

 

Genomic files based on grch38_noalt

Unless otherwise specified, all VCF files are gzipped into .vcf.gz.

 

column_name note
grch38-bam WGS BAM for the sample, aligned to grch38_noalt.
grch38-bai The accompanying index for the BAM.
grch38-pbi The accompanying PBI index for the BAM.
grch38-haplotagged-bam Haplotagged BAM.
grch38-haplotagged-bai The accompanying index for the haplotagged BAM.
grch38-pav-vcf PAV-generated VCF.
grch38-pav-tbi TBI index for the PAV-generated VCF.
grch38-deepvariant-vcf PEPPER-Margin-DeepVariant-generated (DV-generated) single sample small variant VCF; a filter of QUAL<40 has been applied.
grch38-deepvariant-tbi TBI index for the DV-generated VCF.
grch38-deepvariant-phased-vcf DV-generated single sample small variant VCF, phased; a filter of QUAL<40 has been applied.
grch38-deepvariant-phased-tbi TBI index for the DV-generated phased VCF.
grch38-pbsv-vcf PBSV-generated single-sample SV VCF.
grch38-pbsv-tbi TBI index for the PBSV-generated VCF.
grch38-sniffles-vcf Sniffles-generated single sample SV VCF.
grch38-sniffles-tbi TBI index for the Sniffles-generated VCF.
grch38-sniffles-snf Sniffles-2 SNF file for the single sample.

 

Genomic files based on T2Tv2.0

 

column_name note
chm13v2.0-bam lrWGS BAM for the sample, aligned to T2Tv2.0.
chm13v2.0-bai The accompanying index for the BAM.
chm13v2.0-pbi The accompanying PBI index for the BAM.
chm13v2.0-haplotagged-bam Haplotagged BAM.
chm13v2.0-haplotagged-bai The accompanying index for the haplotagged BAM.
chm13v2.0-pav-vcf PAV-generated VCF.
chm13v2.0-pav-tbi TBI index for the PAV-generated VCF.
chm13v2.0-deepvariant-vcf PEPPER-Margin-DeepVariant-generated (DV-generated) single sample small variant VCF; a filter of QUAL<40 has been applied.
chm13v2.0-deepvariant-tbi TBI index for the DV-generated VCF.
chm13v2.0-deepvariant-phased-vcf DV-generated single-sample small variant VCF, phased; a filter of QUAL<40 has been applied.
chm13v2.0-deepvariant-phased-tbi TBI index for the DV-generated phased VCF.
chm13v2.0-pbsv-vcf PBSV-generated single-sample SV VCF.
chm13v2.0-pbsv-tbi TBI index for the PBSV-generated VCF.
chm13v2.0-sniffles-vcf Sniffles-generated single-sample SV VCF.
chm13v2.0-sniffles-tbi TBI index for the Sniffles-generated VCF.
chm13v2.0-sniffles-snf Sniffles-2 SNF file for the single sample.

 

Auxiliary files

 

For each of the two references, we also release one auxiliary file (TSV) that describes each sample in one row. Each entry in the columns is either string or a numerical value.

 

column_name note
mosdepth_cov A floating point value describing the mean coverage, computed with mosdepth.
aligned_frac_bases A floating point value describing the fraction of bases that are aligned to the reference.
aligned_num_bases An integer describing the number of bases aligned to the reference.
aligned_num_reads An integer describing the number of aligned reads.
aligned_read_length_N50 An integer describing the N50 of aligned reads.
aligned_read_length_median A floating point value describing the median of aligned reads.
aligned_read_length_mean A floating point value describing the mean of aligned reads.
aligned_read_length_stdev A floating point value describing the standard deviation of aligned read length distribution.
average_identity A floating point value describing the mean identity between the reads and reference.
median_identity A floating point value describing the median identity between the reads and reference.
dvp_ft_pass_snp_cnt The count of SNPs in the DV-generated VCF whose Filter column is PASS.
pbsv_nonBND_50bpSV_cnt The count of SVs in the PBSV VCF whose Filter column is PASS, not a BND type, and size >= 50bp.
snf2_nonBND_50bpSV_cnt The count of SVs in the Sniffles-2 VCF whose Filter column is PASS, not a BND type, and size >= 50bp.

 

In addition, the grch38_noalt version has one extra column from the QC procedures

 

column_name note
contamination_est an estimation of the level of cross-individual contamination as reported by VerifyBAMID2.

 

 

 

 

 

 

 

 

 

 

Was this article helpful?

5 out of 6 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.