Summary of changes since v6
- 146804 short-read WGS (srWGS) samples added with single nucleotide polymorphism, insertion, and deletion variant calls (SNPs and Indels) (278 Withdrawn, 147082 New) - Total: 245394
- 147818 genotyping array (“array”) samples added (416 Withdrawn, 148234 New) - Total: 312945
- srWGS samples with structural variant (SV) calls added(97940 samples)
- Long read WGS (lrWGS) samples with SNP and Indel variants and SVs added (1027 samples)
BigQuery
Included within the Controlled Tier is a “mainline” curated data repository (CDR) that contains data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) also represented within the Registered Tier, but with different privacy protections applied. Like the Registered Tier, the Controlled Tier includes a default CDR, which is queryable through the Researcher Workbench dataset tools or BigQuery directly, and a base CDR that is queryable only through BigQuery. Below is information about the mainline CDR location within BigQuery.
For more information about mainline CDR creation and the differences between the default and base instances, please see here. For information about the specific tables, fields, and privacy protections applied to the Registered and Controlled Tier CDRs, please see the CDR Data Dictionary.
Asset | Location | Env var | Description |
Default Mainline CDR | fc-aou-cdr-prod.C2022Q4R11 | WORKSPACE_CDR | Main BigQuery CDR representative of the same data types (e.g., Survey, Electronic Health Record (EHR), Wearable Device (Fitbit), Physical Measurement) included in the Registered Tier, but with few privacy protections applied. |
Base Mainline CDR | fc-aou-cdr-prod.C2022Q4R11_base | Same CDR representation as noted above, but with fewer processing/convenience transformations applied. |
Cloud Storage
Also included in the Controlled Tier is a genomics CDR that is stored separate from the above referenced BigQuery assets. Note, the genomics CDR is only available through the Controlled Tier (e.g., genomic data is not available within the Registered Tier).
For more information on the specifics of the genomic data assets (including the auxiliary files) and the associated file formats referenced below, please see How the All of Us genomic data are organized. Before using the genomic data, we highly recommend reading the Known Issues section of the All of Us Research Program Genomic Research Data Quality Report. For help displaying html files in the Researcher Workbench, please see our featured notebook.
If you use gsutil to access the CDR bucket, you will need to pass an additional flag in the command as described more here:
!gsutil -u $GOOGLE_PROJECT ls gs://fc-aou-datasets-controlled
|
Asset | Location | Env var | Description |
All CDR assets path | gs://fc-aou-datasets-controlled/v7 | CDR_STORAGE_PATH | All Cloud Storage assets for this CDR version are under this path |
srWGS: VDS | gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/vds/hail.vds | WGS_VDS_PATH | Joint SNP/Indel callset of all available samples on the entire called genomic territory. This is stored as a VariantDataset (VDS). This callset has also been released, subsetted to smaller genomic territories, in other formats (such as VCF). See below for more information. |
srWGS: auxiliary files | gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux | Auxiliary files for the SNP and Indel variants for srWGS data. | |
|
…vat/vat_complete_v7.1.bgz.tsv.gz |
Variant-level metadata and functional annotations for the SNP and Indel variants contained in the srWGS data. Please see Variant Annotation Table for more details, including the provided fields. |
|
|
…/ancestry/ancestry_preds.tsv …/ancestry/preds_oth.html …/ancestry/merged_sites_only_intersection.vcf.bgz …/ancestry/merged_sites_only_intersection.vcf.bgz.tbi …/ancestry/loadings.ht |
Computed ancestry predictions for all samples in the srWGS joint callset (tsv). We also provide a plot of the ancestry predictions and the sites-only VCF (vcf.bgz) of the locations we used for training the ancestry classifier. Furthermore, we supply the coefficients of PCs utilized by the ancestry predictions classifier. Please see Genetic predicted ancestry for more information on these files. For more information on how we predict ancestry, please see the All Of Us Release Genomic Quality Report (Appendix A) |
|
|
…/qc/flagged_samples.tsv …/qc/metrics.html …/qc/pc1vspc2.html …/qc/genomic_metrics.tsv |
The list of srWGS samples flagged in the joint callset QC process (tsv) and metrics related to the genomic sequencing of the WGS samples. We also include plots of the metric residuals and the first two principal components (*.html) used in the Ancestry and Joint Callset QC. Please see Flagged srWGS Samples for details on the tsv format. Please see the All Of Us Release Genomic Quality Report (Joint Callset QC) for details on how we flag samples. One srWGS sample (person_id: 3518297) does not have its corresponding array in the array Hail MT and PLINK files. Please note that the array VCF for this sample will be available. For more information, please see All Of Us Release Genomic Quality Report (Known Issue #4) |
|
|
…/relatedness/relatedness.tsv …/relatedness/relatedness_flagged_samples.tsv |
The relatedness of the srWGS samples as kinship scores. We provide one file which lists all pairs of samples w/ a kinship score greater than 0.1 (relatedness.tsv). We also provide a list of samples that would need to be removed to remove related samples from the full cohort. (relatedness_flagged_samples.tsv) Please see Relatedness for more information. |
|
srWGS: Exome |
These subset files include srWGS SNP and Indel variants that are within exon regions of the Gencode v42 basic transcripts, for all samples. The exome srWGS callset is provided in VCF, Hail MT, BGEN, and PLINK bed formats. For more information on how we determine the exome, please see Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK.
Note: The exome smaller callsets were updated on 9/22/23 with an update to all three smaller callsets; more details can be found in this article. The new versions are referred to as version 7.1. |
||
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/multiMT/hail.mt |
WGS_EXOME_MULTI_HAIL_PATH |
Hail multiallelic MatrixTable (MT) for the exome srWGS joint callset. Multiallelic sites are not split. When using this file in Hail, read directly from the bucket location. Do not attempt to copy it locally. Please see srWGS SNP & Indel Hail MT for more information. |
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/vcf/ | WGS_EXOME_VCF_PATH |
Variant Call Format (VCF) for the exome srWGS joint callset. This callset is converted from the exome Hail MT and sharded by chromosome into multiple files. Please see srWGS SNP & Indel VCFs for more information |
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/splitMT/hail.mt | WGS_EXOME_SPLIT_HAIL_PATH | Hail multi MatrixTable (MT) for the exome srWGS joint callset. Multiallelic sites are split into separate records. |
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/plink_bed/ | PLINK binary biallelic genotype table (.bed) for the exome srWGS joint callset. Includes .fam and .bim files for usage with the PLINK tool as well. These PLINK triplets are converted from the exome Hail MT. | |
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/bgen/ | Binary GEN (BGEN) files for the exome srWGS joint callset. Contains sample, Hail index, and bgenix index. | |
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/exome_v7.1/bed/ | UCSC BED file used to generate the Exome Variants callset. | |
srWGS: ACAF Threshold |
These subset files include srWGS SNP and Indel variants that are frequent in the All of Us computed ancestry subpopulations. We use population-specific allele frequency (AF) > 1% or population-specific allele count (AC) > 100, in any computed ancestry subpopulations. The ACAF threshold srWGS callset is provided in VCF, Hail MT, BGEN, and PLINK bed formats. For more information on these subset files, please see Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK.
Note: The ACAF Threshold smaller callsets were updated on 9/22/23 with an update to all three smaller callsets; more details can be found in this article. The new versions are referred to as version 7.1. |
||
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/multiMT/hail.mt |
WGS_ACAF_THRESHOLD_MULTI _HAIL_PATH |
Hail multiallelic MT for the ACAF threshold srWGS joint callset. Multiallelic sites are not split. When using this file in Hail, read directly from the bucket location. Do not attempt to copy it locally. Please see srWGS SNP & Indel Hail MT for more information. |
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/vcf/ | WGS_ACAF_THRESHOLD_VCF_PATH |
VCF for the ACAF threshold srWGS joint callset. This callset is converted from the ACAF Threshold Hail MT and sharded by chromosome into multiple files. Please see srWGS SNP & Indel VCFs for more information. |
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/splitMT/hail.mt |
WGS_ACAF_THRESHOLD_SPLIT _HAIL_PATH |
Hail multi MT for the ACAF threshold srWGS joint callset. Multiallelic sites are split into separate records. |
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/plink_bed/ |
PLINK binary biallelic genotype table (.bed) for the ACAF threshold srWGS joint callset. Includes .fam, .bim files for usage with the PLINK tool as well. These PLINK triplets are converted from the ACAF threshold Hail MT. |
|
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bgen/ |
BGEN files for the ACAF threshold srWGS joint callset . Contains sample, Hail index, and bgenix index. | |
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bed/ | UCSC BED file used to generate the ACAF threshold Variants callset. | |
srWGS: ClinVar Variants |
These subset files include srWGS SNP and Indel variants that are in Clinvar, not limited to pathogenic or likely pathogenic variants. The ClinVar srWGS callset is provided in VCF, Hail MT, BGEN, and PLINK bed formats. For more information on these subset files, please see Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK.
Note: The ClinVar smaller callsets were updated on 9/22/23 with an update to all three smaller callsets; more details can be found in this article. The new versions are referred to as version 7.1. |
||
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/multiMT/hail.mt | WGS_CLINVAR_MULTI_HAIL_PATH |
Hail MT for the ClinVar srWGS joint callset. Multiallelic sites are not split. When using this file in Hail, read directly from the bucket location. Do not attempt to copy it locally. Please see srWGS SNP & Indel Hail MT for more information. |
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/vcf/ | WGS_CLINVAR_VCF_PATH |
VCF for the ClinVar srWGS joint callset. This callset is converted from the ACAF Threshold Hail MT and sharded by chromosome into multiple files. Please see srWGS SNP & Indel VCFs for more information. |
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/splitMT/hail.mt | WGS_CLINVAR_SPLIT_HAIL_PATH | Hail multi MT for the ClinVar srWGS joint callset. Multiallelic sites are split into separate records. |
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/plink_bed/ |
PLINK binary biallelic genotype table (.bed) for the ClinVar srWGS joint callset. Includes .fam, .bim files for usage with the PLINK tool as well. These PLINK triplets are converted from the ClinVar Hail MT. |
|
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/bgen/ | BGEN files for the ClinVar srWGS joint callset. Contains sample file, Hail index, and bgenix index. | |
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/bed/ | UCSC BED file used to generate the ClinVar srWGS joint callset. | |
srWGS: CRAM files | gs://fc-aou-datasets-controlled/v7/wgs/cram/manifest.csv | WGS_CRAM_MANIFEST_PATH |
Path to manifest CSV file that contains a row per sample of: person_id,cram_uri,cram_index_uri We provide CRAM files and CRAM index files with the research ID in the name of the file. One CRAM file for each WGS sample. See CRAM files for more information. |
srWGS: Structural Variants (SVs) | |||
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/v7_offcycle/vcf/full/ | SVs for 97,940 srWGS samples. | |
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/v7_offcycle/vcf/sites-only/ | The sites-only VCF has all variant sites from the 97,940 samples but no genotype information. | |
|
Auxiliary information from the srWGS structural variant calling: | ||
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/v7_offcycle/aux/aneuploidies/ | We provide lists of samples with probable aneuploidies identified during srWGS SV ploidy estimation as tsv files. There are three separate files for samples with probable aneuploidies: mosaic autosomal aneuploidy, mosaic allosomal aneuploidy, and germline allosomal aneuploidy. | |
|
gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/v7_offcycle/aux/relatedness | We provide a list of the maximal set of unrelated samples in the srWGS SV cohort. The samples are reported in a txt file as a list of research IDs. One research ID is listed per line and there is no header in the file. | |
srWGS PGx Haplotype Calls | The following file paths are related to work done in the Featured Workspace "Demo - Pharmacogenomics (PGx) variant frequency and medication exposures" | ||
|
gs://fc-aou-datasets-controlled/v7/demo-project-files/pgx-calls/ | PGx haplotype calls completed using srWGS for 15 genes and gene regions. Haplotype calls are available for Stargazer v2.0.1 and PharmCAT v2.4.0 | |
|
gs://fc-aou-datasets-controlled/v6/demo-project-files/pgx-calls/ | PGx haplotype calls completed using srWGS for 15 genes and gene regions. Haplotype calls are available for Stargazer v2.0.0, PharmCAT v2.2.1 | |
Long read whole genome sequencing (lrWGS) |
Single sample and joint called raw data, variant data, and auxiliary files. All single sample files are listed in the manifest, described in Appendix 1. All other joint called and joint auxiliary files are listed below.
|
||
|
gs://fc-aou-datasets-controlled/v7/wgs/long_read/manifest.csv | LONG_READS_MANIFEST_PATH |
The lrWGS manifest contains file locations for all single sample lrWGS raw data, variant data, and auxillary data. See Appendix 1 for the description for each file.
Path to manifest CSV file that contains a row per sample of: Research_id,hifiasm-primary-asm-fasta,hifiasm-hap1-asm-fasta,hifiasm-hap2-asm-fasta,hifiasm-primary-asm-gfa,hifiasm-hap1-asm-gfa,hifiasm-hap2-asm-gfa,hifiasm-quast-report-html,hifiasm-quast-report-summary,chm13v2.0-pav-vcf,chm13v2.0-pav-tbi,chm13v2.0-bam,chm13v2.0-bai,chm13v2.0-pbi,chm13v2.0-haplotagged-bam,chm13v2.0-haplotagged-bai,chm13v2.0-deepvariant-vcf,chm13v2.0-deepvariant-tbi,chm13v2.0-deepvariant-phased-vcf,chm13v2.0-deepvariant-phased-tbi,chm13v2.0-pbsv-vcf,chm13v2.0-pbsv-tbi,chm13v2.0-sniffles-vcf,chm13v2.0-sniffles-tbi,chm13v2.0-sniffles-snf,grch38-pav-vcf,grch38-pav-tbi,grch38-bam,grch38-bai,grch38-pbi,grch38-haplotagged-bam,grch38-haplotagged-bai,grch38-deepvariant-vcf,grch38-deepvariant-tbi,grch38-deepvariant-phased-vcf,grch38-deepvariant-phased-tbi,grch38-pbsv-vcf,grch38-pbsv-tbi,grch38-sniffles-vcf,grch38-sniffles-tbi,grch38-sniffles-snf |
|
gs://fc-aou-datasets-controlled/v7/wgs/long_read/hail.mt/GRCh38/ |
WGS_LONGREADS_HAIL_ GRCH38_PATH |
Hail MT for the lrWGS joint callset for SNPs and Indels called to the grch38_noalt reference When using this file in Hail, read directly from the bucket location. Do not attempt to copy it locally. |
|
gs://fc-aou-datasets-controlled/v7/wgs/long_read/hail.mt/T2T/ | WGS_LONGREADS_HAIL_T2T_PATH |
Hail MT for the lrWGS joint callset for SNPs and Indels called to the T2Tv2.0 reference. When using this file in Hail, read directly from the bucket location. Do not attempt to copy it locally.
Note: The joint-called Hail MT (T2Tv2.0) was updated to version 7.1 on 2/8/24. |
|
gs://fc-aou-datasets-controlled/v7/wgs/long_read/joint_vcf/GRCh38/ |
WGS_LONGREADS_JOINT_SNP_ INDEL_VCF_GRCH38_PATH |
Joint called lrWGS SNP and Indel VCFagainst the grch38_noalt reference. TBI index file accompanying the VCF is also provided. |
|
gs://fc-aou-datasets-controlled/v7/wgs/long_read/joint_vcf/T2T/ |
WGS_LONGREADS_JOINT_SNP_ INDEL_VCF_T2T_PATH |
Joint called lrWGS SNP and Indel VCF against the T2T-CHM13-v2.0 reference. TBI index file accompanying the VCF is also provided. |
|
gs://fc-aou-datasets-controlled/v7/wgs/long_read/aux/auxiliary_metrics.GRCh38.tsv | Auxiliary file holding metric values for each sample; grch38_noalt. Described in the lrWGS variant metrics. | |
|
gs://fc-aou-datasets-controlled/v7/wgs/long_read/aux/auxiliary_metrics.T2T.tsv | Auxiliary file holding metric values for each sample; each metric value is a floating point value. T2Tv2.0. Described in the lrWGS variant metrics. | |
Array: single sample VCFs | gs://fc-aou-datasets-controlled/v7/microarray/vcf/manifest.csv |
MICROARRAY_VCF_MANIFEST _PATH |
Path to manifest CSV file that contains a row per sample of: person_id,vcf_uri,vcf_index_uri One VCF per participant sample. Please see Array VCFs for more information. |
Array: all samples Hail MT | gs://fc-aou-datasets-controlled/v7/microarray/hail.mt_v7.1 | MICROARRAY_HAIL_STORAGE_PATH |
Hail MT of the array samples in this release. All of the samples have been merged into a single matrix table. Please see Array MatrixTable for more information. |
Array: all samples PLINK files | gs://fc-aou-datasets-controlled/v7/microarray/plink_v7.1/arrays.* |
PLINK binary merged representation of the microarray samples in this release (.bed). Includes .fam, .bim files for usage with the plink tool as well. Please see Array PLINK 1.9 data for more information |
|
Array: IDAT files |
gs://fc-aou-datasets-controlled/v7/microarray/idat/manifest.csv |
MICROARRAY_IDAT_MANIFEST _PATH |
Path to manifest.csv file that contains a row per sample of: person_id,green_idat_uri,red_idat_uri Two IDAT files per array sample with the research id in the name of the file. Please see IDAT files for more information. |
Known Issues: samples’ lists associated with known issues |
gs://fc-aou-datasets-controlled/v7/known_issues/wgs_v7_not_in_cdr_known_issue_1.tsv gs://fc-aou-datasets-controlled/v7/known_issues/array_v7_not_in_cdr_known_issue_1.tsv
gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/v7_offcycle/aux/known_issues/AoU_srWGS_SV.v7_offcycle.not_in_cdr_known_issue_1.txt gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_array_known_issue_2.tsv gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_wgs_known_issue_2.tsv
gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_array_known_issue_14.tsv
gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_array_known_issue_15.tsv gs://fc-aou-datasets-controlled/v7/known_issues/research_id_v7_wgs_known_issue_15.tsv |
Each file contains a list of sample IDs associated with known issues. For more information, please see All Of Us Release Genomic Quality Report (Known Issue #1,2,3,14,15) |
Appendix 1. lrWGS manifest column descriptions
Reference-free genomic files
We have provided three de novo assemblies, each in two formats (FASTA and GFA), for each sample.
column_name | note |
hifiasm-primary-asm-fasta | Hifiasm primary assembly, in FASTA format. |
Hifiasm-hap1-asm-fasta | Hifiasm haplotype-resolved assembly for haplotype-1 (in no particular order), in FASTA format. |
Hifiasm-hap2-asm-fasta | Hifiasm haplotype-resolved assembly for haplotype-2 (in no particular order), in FASTA format. |
Hifiasm-primary-asm-gfa | Hifiasm primary assembly, in GFA format. |
Hifiasm-hap1-asm-gfa | Hifiasm haplotype-resolved assembly for haplotype-1 (in no particular order), in GFA format. |
Hifiasm-hap2-asm-gfa | Hifiasm haplotype-resolved assembly for haplotype-2 (in no particular order), in GFA format. |
Hifiasm-quast-report-html | An HTML-formatted report on the quality of the three assembly FASTA files; produced by the tool QUAST. |
Hifiasm-quast-report-summary | A summary on the QUAST reported metrics of the three assembly FASTA files. |
Reference-specific genomic files
References used
The DRC pipelines align sequences to two references.
- The grch38_noalt reference contains a subset of contigs from the full GRCh38 references. Specifically, only primary assembly autosomes (1-22), sex chromosomes (X and Y), mitochondria, human EBV, and random and unplaced contigs are included.
- The T2Tv2.0 refers to the T2T CHM13v2.0 reference,retrieved from the T2T consortium's AWS bucket, and then with the human EBV contig appended.
We have provided two sets of all downstream genomic files for each sample, one for each reference.
Genomic files based on grch38_noalt
Unless otherwise specified, all VCF files are gzipped into .vcf.gz.
column_name | note |
grch38-bam | WGS BAM for the sample, aligned to grch38_noalt. |
grch38-bai | The accompanying index for the BAM. |
grch38-pbi | The accompanying PBI index for the BAM. |
grch38-haplotagged-bam | Haplotagged BAM. |
grch38-haplotagged-bai | The accompanying index for the haplotagged BAM. |
grch38-pav-vcf | PAV-generated VCF. |
grch38-pav-tbi | TBI index for the PAV-generated VCF. |
grch38-deepvariant-vcf | PEPPER-Margin-DeepVariant-generated (DV-generated) single sample small variant VCF; a filter of QUAL<40 has been applied. |
grch38-deepvariant-tbi | TBI index for the DV-generated VCF. |
grch38-deepvariant-phased-vcf | DV-generated single sample small variant VCF, phased; a filter of QUAL<40 has been applied. |
grch38-deepvariant-phased-tbi | TBI index for the DV-generated phased VCF. |
grch38-pbsv-vcf | PBSV-generated single-sample SV VCF. |
grch38-pbsv-tbi | TBI index for the PBSV-generated VCF. |
grch38-sniffles-vcf | Sniffles-generated single sample SV VCF. |
grch38-sniffles-tbi | TBI index for the Sniffles-generated VCF. |
grch38-sniffles-snf | Sniffles-2 SNF file for the single sample. |
Genomic files based on T2Tv2.0
column_name | note |
chm13v2.0-bam | lrWGS BAM for the sample, aligned to T2Tv2.0. |
chm13v2.0-bai | The accompanying index for the BAM. |
chm13v2.0-pbi | The accompanying PBI index for the BAM. |
chm13v2.0-haplotagged-bam | Haplotagged BAM. |
chm13v2.0-haplotagged-bai | The accompanying index for the haplotagged BAM. |
chm13v2.0-pav-vcf | PAV-generated VCF. |
chm13v2.0-pav-tbi | TBI index for the PAV-generated VCF. |
chm13v2.0-deepvariant-vcf | PEPPER-Margin-DeepVariant-generated (DV-generated) single sample small variant VCF; a filter of QUAL<40 has been applied. |
chm13v2.0-deepvariant-tbi | TBI index for the DV-generated VCF. |
chm13v2.0-deepvariant-phased-vcf | DV-generated single-sample small variant VCF, phased; a filter of QUAL<40 has been applied. |
chm13v2.0-deepvariant-phased-tbi | TBI index for the DV-generated phased VCF. |
chm13v2.0-pbsv-vcf | PBSV-generated single-sample SV VCF. |
chm13v2.0-pbsv-tbi | TBI index for the PBSV-generated VCF. |
chm13v2.0-sniffles-vcf | Sniffles-generated single-sample SV VCF. |
chm13v2.0-sniffles-tbi | TBI index for the Sniffles-generated VCF. |
chm13v2.0-sniffles-snf | Sniffles-2 SNF file for the single sample. |
Auxiliary files
For each of the two references, we also release one auxiliary file (TSV) that describes each sample in one row. Each entry in the columns is either string or a numerical value.
column_name | note |
mosdepth_cov | A floating point value describing the mean coverage, computed with mosdepth. |
aligned_frac_bases | A floating point value describing the fraction of bases that are aligned to the reference. |
aligned_num_bases | An integer describing the number of bases aligned to the reference. |
aligned_num_reads | An integer describing the number of aligned reads. |
aligned_read_length_N50 | An integer describing the N50 of aligned reads. |
aligned_read_length_median | A floating point value describing the median of aligned reads. |
aligned_read_length_mean | A floating point value describing the mean of aligned reads. |
aligned_read_length_stdev | A floating point value describing the standard deviation of aligned read length distribution. |
average_identity | A floating point value describing the mean identity between the reads and reference. |
median_identity | A floating point value describing the median identity between the reads and reference. |
dvp_ft_pass_snp_cnt | The count of SNPs in the DV-generated VCF whose Filter column is PASS. |
pbsv_nonBND_50bpSV_cnt | The count of SVs in the PBSV VCF whose Filter column is PASS, not a BND type, and size >= 50bp. |
snf2_nonBND_50bpSV_cnt | The count of SVs in the Sniffles-2 VCF whose Filter column is PASS, not a BND type, and size >= 50bp. |
In addition, the grch38_noalt version has one extra column from the QC procedures
column_name | note |
contamination_est | an estimation of the level of cross-individual contamination as reported by VerifyBAMID2. |
Comments
0 comments
Please sign in to leave a comment.