Introduction
The All of Us genomic data includes short read whole genome sequencing (srWGS) data, long read whole genome sequencing (lrWGS) data, and microarray genotyping array (“array”) data. Researchers access this genomic data through the Researcher Workbench (RW) Controlled Tier dataset (e.g. genomic data is not available through the Registered Tier). Bucket locations for accessing the data in analysis notebooks can be found in the Controlled CDR Directory.
Short variants - Single Nucleotide Polymorphisms (SNPs) and Insertions & Deletions (Indels) - are available for srWGS data, lrWGS data, and arrays. Structural variants (SVs) are available for srWGS and lrWGS data. We provide variant data in VariantDataset (VDS), Hail MatrixTable (MT), Variant Call Format (VCF), Binary GEN format (BGEN), and PLINK 1.9 bed/bim/fam triplets. Raw data is available in compressed CRAM or BAM format for the WGS reads and IDAT files for array data. We also provide auxiliary tabular data, such as the joint callset QC flagged samples or related pairs. A summary of the file formats for each data type can be found in Table 1.
In this article, we will summarize the genomic data formats and what information is available in each data type. In some cases, we will refer to other documentation when it describes the data format we deliver. This article assumes a general knowledge of genomics and bioinformatics. For a workspace on getting started with genomic data on the Researcher Workbench, please see the How to Work with All of Us Genomic Data Featured Workspace. We also provide a detailed report on the quality of the genomic data with each release in the All of Us Genomic Data Quality Report available on the User Support Hub.
List of All of Us genomic data
Short-read whole genome sequencing (srWGS) - 414,830
- Sequencing reads in CRAM format, aligned to hg38/GRCh38
-
SNP & Indel Variant data
- Hail Variant Dataset (VDS): joint callset across the entire genome
- Exome callset: Hail MT, VCF, PLINK bed, PGEN, BGEN
- ClinVar callset: Hail MT, VCF, PLINK bed, PGEN, BGEN
- ACAF threshold callset: Hail MT, VCF, PLINK bed, PGEN, BGEN - variants that have population-specific AF > 1% or population-specific AC > 100
- Annotated variants: Variant Annotation Table
- Auxiliary Data
Short-read whole genome sequencing structural variants (SVs) - 97,061
- Joint-called SV VCF
- Sites-only SV VCF
- srWGS SV maximal set of unrelated samples
- Unrelated sites-only VCF
- srWGS SV samples with probable aneuploidies
- Sample list
Genotyping array - 447,278
Long-read whole genome sequencing (lrWGS) - 2,800
- 11 cohorts grouped by sequencing facility and platform
- Sequencing reads in BAM format, aligned to grch38_noalt & T2Tv2.0
- De novo assembly in GFA and FASTA format for cohorts with PacBio data
-
Variant data for each grch38_noalt & T2Tv2.0 assembly
- Joint SNP & Indel variants for each cohort in GVCF & Hail MT formats
- Single-sample SNP & Indel variants in GVCF format
- Single-sample SVs from PBSV & Sniffles2
- Single-sample PAV variants in VCF format for samples with PacBio data
- Auxiliary sample metrics for both reference versions
Overview of the Genomic Data
The main deliverables of interest in the All of Us Research Program are the genomic variants, which are delivered in multiple data formats in order to meet researchers' various needs.
Table 1 – Deliverables for each genomic data type
Deliverable | srWGS SNP & Indel | Array | srWGS SVs | lrWGS |
Reference version | hg38/GRCh38 reference |
Note: variants are called originally with hg19 reference but they are lifted over before release on RW |
hg38/GRCh38 reference | |
Raw data | CRAM files | IDAT files | CRAM files (same deliverable as srWGS SNP & Indels) |
BAM files for each reference version De novo assembly for PacBio cohorts: primary, alternate, and two chromosome copies in Graphical Fragment Assembly (GFA) format and FASTA format |
Variant data |
Joint-callset variant data for all samples: VDS Smaller callsets: ACAF threshold, exome, ClinVar - in VCF, Hail MT, BGEN, PGEN, PLINK bed formats |
Single sample VCFs (all VCFs have the same variants) Hail MT (merged) PLINK bed files (merged) |
Joint SNP & Indel variants (GVCF & Hail MT format) Single-sample SNP & Indel variants (GVCF) Single-sample PBSV SVs (VCF) Single-sample Sniffles2 SVs (VCF & SNF) Single-sample PAV variants (VCF) - for PacBio cohorts |
|
Auxiliary files |
Annotated variants: (Variant Annotation Table) Pharmacogenomics variant calls (star alleles) |
Genetic ancestry, admixture estimates, pharmacogenomics, and relatedness available for array samples that have srWGS data |
Genetic ancestry, admixture estimates, pharmacogenomics, and relatedness available for srWGS samples based on the srWGS SNP & Indel deliverables Maximal set of unrelated samples |
Genetic ancestry, admixture estimates, pharmacogenomics, and relatedness available for lrWGS samples based on the srWGS SNP & Indel deliverables |
Short-Read Whole Genome Sequencing (srWGS) Data
srWGS CRAM files
We provide raw data for srWGS samples in CRAM format, otherwise known as compressed SAM (sequence alignment map) format. The data are mapped to the hg38/GRCh38 reference. Refer to the All of Us Genomic Data Quality Report for more information on how variant calling was performed on these raw data files.
There is one CRAM file and one CRAM index file for each srWGS sample and the research ID appears in the file name. The path to each CRAM file is found in the manifest CSV file, which contains a row per sample of person_id,cram_uri,cram_index_uri
The raw data is more expensive to use because you must pay egress charges, which are the costs to retrieve the data from the cloud for analysis. We do not charge egress for variant data and so the raw data will be more expensive to use. Please see the Genomics FAQ for Recommendations for processing CRAMs with GATK on the Researcher Workbench.
srWGS SNP & Indel variant data
The srWGS SNP & Indel dataset is joint-called and delivered as a complete callset in VariantDataset (VDS) format, which is a Hail data storage format for large datasets. Hail MT, VCFs, and PLINK files are available for all samples over limited regions, including the exome, ClinVar variants, and common variants within each genomic ancestry group. For further information about the Hail MT, VCF, and PLINK files, please see Smaller Callsets for Analyzing Short Read WGS SNP & Indel Data with Hail MT, VCF, and PLINK.
VariantDataset (VDS)
The Hail VariantDataset (VDS) is a data storage format we use for the All of Us srWGS SNP & Indel variant data. With one of the largest callsets in the world, the VDS helps to store variant data efficiently for all samples over the entire genome. The VDS is a sparse Hail data storage format that stores less data, but more information. As a comparison, the Hail MT is a dense variant storage format with every entry populated. For an overview of the VDS, check out ‘The new VDS format for All of Us srWGS data’ article.
If possible, we recommend that researchers use the smaller callsets for their analysis to save time and money. Most downstream analyses of the VDS involve filtering and converting the VDS into a VCF, Hail MT, or other dense format (“densifying”). We have performed this step already to cover most use cases with reduced srWGS SNP & Indel variant datasets in VCF, Hail MT, BGEN, and PLINK bed formats over commonly used areas of the genome (see Genomics FAQ: Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK).
Instructions for densifying the VDS are available in the article ‘The new VDS format for All of Us srWGS data’ and the Manipulate Hail VariantDataset tutorial notebook.
In the following sections, we describe how the VDS stores variant data, reference data, and how to determine if a variant site is filtered.
Variant Data
The VDS uses variant level row fields to store data for all samples, including the variant locus (locus), a list of alternate alleles (alleles), and site level filtering data (filters). Local fields store data that only apply to a single sample, including genotype metadata and genotype filtering. The local alleles (LA) array maps the alleles that appear in the individual sample to the list of alternate alleles (alleles), thus genotype metadata is only stored for samples with the genotype.
Some familiar annotations from a VCF or Hail MT are not present in the VDS, but can be rendered when densifying the VDS. The allele count for each alternate allele (AC), the total number of alleles at each site (AN), and the frequency of each alternate allele (AF) are also stored in the Variant Annotation Table (VAT) for all variants that pass filtering.
Tables 2-5 describe the fields in the All of Us VDS. Please see the Hail documentation for more information on the Hail data types.
Table 2. VDS column fields: stores sample name
VDS Field | Description | Hail data type |
s | Research ID | str |
Table 3. VDS row fields: stores variant data
VDS Field | Description | Hail data type | Example |
locus | Positional data for the variant. Formatted as chromosome name and position separated by colon. | locus<GRCh38> | chr1:12807 |
alleles | List of alleles at a locus for all samples (otherwise known as global alleles). The first allele is the reference allele. All the alternate alleles are then listed in alphabetical order. | array<str> | [“C”, “T”] |
filters | Site level filtering information. Hard threshold filters include EXCESS_ALLELES, NO_HQ_GENOTYPES, LowQual, and ExcessHet. If no filtering reason is provided or there is a PASS, then the site has passed filtering. | set<str> | {“LowQual”, “NO_HQ_GENOTYPES”} |
as_vets | Variant Extract-Train-Score Filtering model information for this site. Does not contain information about whether or not the site was filtered. We recommend that most users ignore this field and look at filters for useful filtering information |
dict<str, struct { model: str, vqslod: float64, yng_status: str }> |
{“A”:(“SNP”,-1.15e+00,”G”)} |
Table 4. VDS entry fields: stores genotype level variant data
VDS Field | Description | Hail data type | Example |
GQ | Genotype Quality. Follows VCF description. | int32 | 63 |
RGQ | Reference Genotype Quality. Follows VCF description. | int32 | 101 |
LGT | Local genotype. The coordinates map to LA. LA always includes the reference allele so the call can be [0/1], [1/1], or [1/2]. | call | [1/1] |
LAD | Local allele depth, describes the allele depth for one sample. Maps to the alleles described in the local alleles (LA) array. See VCF description. | array<int32> | [0,8] |
LA | Local alleles. The reference allele and allele(s) that appear in the sample are listed as coordinates mapping to the global alleles array. The reference coordinate is always included. | array<int32> | [0,1] |
FT | Boolean containing genotype level filtering. True for PASS, False for FAIL, and NA for (.). In most cases, NA should be treated as PASS. The filtering reason is not provided. | bool | True |
Table 5. VDS global fields: filtering metadata for the entire callset
Note: These fields report metadata of the filtering model. See the row filter field filters or entry genotype field FT to see whether a variant did not meet the threshold reported in these fields.
VDS Field | Description | Hail data type |
truth_sensitivity_snp_threshold | SNP sensitivity threshold | float64 |
truth_sensistivity_indel_threshold | Indel sensitivity threshold | float64 |
Reference Data
The VDS also stores reference data for each sample as reference blocks in a separate component table reference_data. The row key is the locus and the ref_allele denotes the reference base at the genomic coordinate. Columns are keyed by the sample ID. No data at a particular location indicates that the sample has a variant call.
Table 6. VDS reference data column fields: stores sample name
VDS Field | Description | Hail data type |
s | Research ID | str |
Table 7. VDS reference data row fields: stores reference data
VDS Field | Description | Hail data type | Example |
locus | Positional data for the variant. Formatted as chromosome name and position separated by colon. | locus<GRCh38> | chr1:10029 |
ref_allele | The reference allele at the genomic coordinate | str | “A” |
Table 8. VDS reference data entry fields: stores reference blocks
VDS Field | Description | Hail data type | Example |
GQ | Genotype Quality. Follows VCF description. | int32 | 40 |
END | Indicates the end of the reference block, which is the group of consecutive non-variant sites that have the same genotype quality. All coordinates between the start locus and the end coordinate are called as reference for the sample. | int32 | 10036 |
Filtering Information
The variant filtering data is represented in two fields in the VDS, filters and the FT field (Table 3, Table 4). The filters array contains site level filters, including EXCESS_ALLELES, NO_HQ_GENOTYPES, LowQual, and ExcessHet. If no filtering reason is provided or the filters field contains PASS, then the site has passed filtering. The FT field contains genotype level filtering. The genotype level filtering reasons are not specified in the All of Us VDS, there will be a boolean describing the filtering status for the genotype. True is PASS and False is FAIL. If all genotypes fail at a site, the True or False boolean can also apply to the filters array. The variant filtering process is described in depth in the QC report. All filtered variants are soft filtered, which means the variants will be marked but not removed from the callset.
We provide a tutorial notebook for converting VDS to a Hail MT format, including code to transform the FT boolean True or False in the VDS to PASS or FAIL so that it is compatible for converting to a VCF.
srWGS SNP & Indel smaller callsets
We released the srWGS SNP and Indel callset in familiar data formats over limited genomic regions: VCF, Hail MT, BGEN, and PLINK bed formats. The smaller callsets, described in Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK, cover regions of the genome that are popular for All of Us researchers: an Allele Count/Allele Frequency (ACAF) threshold callset, an exome callset, and a ClinVar callset. We recommend that you stick with these premade Hail smaller callsets instead of using the VDS, if possible, to save time and money.
The ACAF threshold callset contains variants that have a population-specific allele frequency (AF) greater than 1% or a population-specific allele count (AC) greater than 100 in any computed ancestry subpopulations. The exome callset contains variants that are within the exon regions of the Gencode v42 basic transcripts, with padding of 15 bases on either side of each exon. The ClinVar callset contains variants in ClinVar, regardless of pathogenicity.
The complete srWGS SNP and Indel callset across all sites is released as a VDS, which is a Hail sparse data format. We provide a tutorial notebook for converting VDS to a Hail MT format, though we recommend that you stick with the premade Hail MT, if possible, to save time and money.
srWGS Hail MT
We provide two Hail MTs for each smaller callset, both a multiallelic and multiallelic split Hail MT, resulting in six total Hail MT deliverables for the srWGS SNP and Indel callset. In the multiallelic split MT, sites with multiple alternate alleles will be split, so each row will only have one alternate allele. In the multiallelic MT, sites with multiple alternate alleles will be retained in the same row.
When using Hail MT files in the Researcher Workbench, read directly from the bucket location. Do not attempt to copy them locally.
srWGS VCF
The srWGS limited callset VCFs are sorted and block compressed in bgz format (.vcf.bgz) with a local tabix index (.vcf.bgz.tbi). Each VCF is split into multiple non-overlapping sections of the genome by chromosome in separate files for usability (sharding).
Please note that we recommend using the FILTER column and the filter tag (FT) to determine the filtering status of a variant because the QUAL information is not available.
FORMAT fields (per sample-site):
- Genotype (GT) -- The GT field specifies the alleles carried by the sample, encoded by a 0 for the reference (REF) allele, 1 for the first alternative (ALT) allele, 2 for the second ALT allele, etc. Since humans are diploid organisms, we expect two alleles (e.g. “0/1”). Please note that the GT calls on sex chromosomes will have two alleles, even in the case of chrY and chrX in males.
- Allelic Depth (AD) -- Allelic depths for the reference allele and the alternate allele(s) present at this site. For more information about AD and which reads are counted, see this article on Allele Depth.
- Genotype Quality (GQ) -- The phred-scaled confidence that the called genotype is correct. A higher score indicates a higher confidence. For more information on GQ, please see the GQ documentation. For more information on interpreting phred-scaled values, please see Phred-scaled quality scores.
- Reference Genotype Quality (RGQ) -- The phred-scaled confidence that the reference genotypes are correct. A higher score indicates a higher confidence. For more information on RGQ, please see the GQ documentation, but note that RGQ applies to the reference, not the variant. For more information on interpreting phred-scaled values, please see Phred-scaled quality scores.
-
Genotype Filter (FT) -- The srWGS SNP & Indel genotype-level filtering information. As part of our joint callset quality control processing, we run the Variant Extract-Train-Score (VETS) method, which is a genotype-level filtering algorithm. If the genotype passes, there will be no value in this field. If the genotype fails, the value will be high_CALIBRATION_SENSITIVITY_SNP or high_CALIBRATION_SENSITIVITY_INDEL. An example code snippet for filtering genotypes, in Hail, can be found in the Manipulating Hail Matrix Table tutorial notebook.
- high_CALIBRATION_SENSITIVITY_SNP: Sample Genotype FT filter value indicating that the genotyped allele failed SNP model calibration sensitivity cutoff (0.997)
- high_CALIBRATION_SENSITIVITY_INDEL: Sample Genotype FT filter value indicating that the genotyped allele failed INDEL model calibration sensitivity cutoff (0.99)
INFO fields (per site):
Descriptions of the INFO fields can also be found in the header of the VCF.
- Allele Count (AC) -- the number of times we see each alternate allele for all samples. For example, a “1/1” genotype would count as 2 observations of the first alternate allele.
- Allele Number (AN) -- the total number of alleles seen. Usually, this will be the number of samples times two, since humans are diploid organisms. No-call genotypes (“./.”) are not counted towards AN.
- Allele Frequency (AF) -- the frequency of the alternate allele in the population that is the callset cohort. This is equivalent to AC/AN.
- QUAL approximation (QUALapprox) -- the sum of the phred-scaled homozygous reference probability values across all samples, which is a proxy for the site-level QUAL score, but without the SNP or indel heterozygosity applied as a per-site prior probability of variation.
Allele-specific QUAL approximations (AS_QUALapprox) -- a per-allele, phred-scaled quality score derived from the sum of homozygous reference probability values across samples when each allele is considered in isolation. This is an approximation of the QUAL score for each allele. For more information on the QUAL score, see the VCF specification.
FILTER values (per site):
-
QUAL score does not meet threshold (LowQual) -- sites with this filter have a posterior probability of being variant that is equal to or below the probability of being variant by chance, represented by the expected heterozygosity for humans (QUALapprox lower than 60 for SNPs; 69 for Indels)
- QUAL tells you how confident we are that there is some kind of variation at a given site. The variation may be present in one or more samples.
-
No high-quality genotypes (NO_HQ_GENOTYPES) -- sites with this filter do not have any genotypes that are considered high quality (GQ>=20, DP>=10, and AB>=0.2 for heterozygotes)
- Allele Balance (AB) is calculated for each heterozygous variant as the number of reads supporting the least-represented allele over the total number of read observations. In other words, min(allele depth)/(total depth) for diploid GTs.
- Excess Heterozygosity (ExcessHet) -- sites with this filter have more heterozygote genotypes than expected by chance under Hardy-Weinberg equilibrium. ExcessHet is a phred-scaled p-value. We cutoff anything more extreme than a z-score of -4.5 (p-value of 3.4e-06), which phred-scaled is 54.69
- Excess alleles (EXCESS_ALLELES) -- sites with this filter have an excess of alternate alleles, which our cutoff is 100. When a site has more than 100 alternate alleles, this filter will be present.
PLINK 1 binary biallelic genotype table (PLINK bed)
We provide PLINK 1.9 data (.bed / .bim .fam) for the srWGS SNP and Indel smaller callsets. The PLINK files are converted from the Hail MT using the export_plink command in Hail and contain all information in the Hail MT. PLINK file type information can be found at the PLINK site. The .bed file is the PLINK binary biallelic genotype table and contains genotype calls. The .bim file is the PLINK extended .map file, and is a text file containing variant information. The .fam file is a text file with sample information for each participant. Please refer to the published notebooks on how to use the PLINK 1.9 data.
PLINK 2 binary genotype table (PGEN)
We provide PLINK 2 data (.pgen, .psam, .pvar) for the srWGS SNP and Indel smaller callsets. The PLINK files are converted from the smaller callset VCFs. The .pgen file is the file containing genotype calls. It is accompanied by a .pvar and a .psam file. Please see the PLINK documentation for more details. The .pvar is a text file containing variant information to accompany the .pgen file. The .psam file is a text file containing sample information.
Binary GEN format (BGEN)
We have released the srWGS SNP and Indel smaller callsets in Binary GEN format (BGEN). The files are sharded by chromosome and only contain hard calls, which are calls with probability values of 0.0 or 1.0. Please see the BGEN documentation for more information about this format.
srWGS SNP & Indel smaller callset BED files
We provide the genomic territory, otherwise known as interval files, used to create the srWGS SNP and Indel smaller callsets as UCSC BED files. The BED files contain the genomic regions for the exome, ACAF threshold, and ClinVar callsets.
Annotated Variants - Variant Annotation Table (VAT)
The Variant Annotation Table (VAT) is a resource provided for all samples with srWGS SNP & Indel data. The VAT gives functional annotations for all passing variants. Variants must pass both site-level (filters) and genotype-level (FT) filtering. The Variant Annotation table contains site-level annotations such as allele counts for each alternate allele (AC), the total number of alleles at each site (AN), and the frequency of each alternate allele (AF). These site-level annotations are not in the VDS. Using the VAT in addition to the VDS can be used to determine variants of interest to your analysis. We provide the annotations as one single, merged tsv file (“.tsv.bgz”) which can be loaded into Hail. Please read the Variant Annotation Table article for more information.
srWGS auxiliary data
srWGS genetic predicted ancestry
We provide genetic ancestry groupings for all samples with srWGS data as a .tsv file, sorted by research ID. Genetic ancestry is inferred by measuring the genetic similarity of each participant to global reference populations. We compute these categorical groupings of genetic similarity to reference populations using harmonized continental metadata labels from the Human Genome Diversity Project (HGDP) and 1000 Genomes Project training data (N=3,942) for all srWGS samples in All of Us. Please see the All of Us Genomic Data Quality Report Appendix G for more information.
As genetic similarity is continuous, the groupings of the genetic similarity categories presented here are used to highlight genetic similarity between individuals to aid in variant classification and risk. The categories are based on the labels used in gnomAD, the HGDP and 1000 Genomes: We use the following acronyms or terms to describe genetic similarity to a reference population: 1KGP-HGDP-AFR-like (AFR or African); 1KGP-HGDP-AMR-like (AMR or Americas); 1KGP-HGDP-EAS-like (EAS or East Asian); 1KGP-HGDP-EUR-like (EUR or European); 1KGP-HGDP-MID-like (MID or Middle Eastern); 1KGP-HGDP-SAS-like (SAS or South Asian); and not belonging to one of the other ancestries or is an admixture (OTH or remaining individuals).
We provide the genetic ancestry groupings as a .tsv file along with a plot of the ancestry predictions (html file). The PCA analysis was performed using Hail's hwe_normalized_pca method. In order to allow researchers to reproduce these files and also apply our method for predicting genetic ancestry groupings on their own data, we also provide a set of files we used to predict genetic ancestry, described as follows:
- Loadings file: captures how each genetic variant contributes to the principal components (PCs). The file can be used to project an individual’s genetic data on the same PCA space as the one used for the ancestry prediction
- Eigenvalues of the PCs: the eigenvalues represent the amount of genetic variation each PC explains.
- Classifier .pkl file: contains the trained ancestry prediction model.
Training PCA: The genetic ancestry groupings of the training data (1000 Genomes and HGDP)
Sites-only VCF: a sites-only VCF of the locations we used for training the ancestry predictions classifier (which is described as the HQ sites in the QC report, Appendix H). The VCF is block compressed and accompanied by a TBI index.
Table 9. srWGS genetic predicted ancestry TSV file description
Field Name | Key? | Type | Nullable? | Example Value | Notes |
research_id | yes | String | No | 1000055 | This comes from sample metadata. |
ancestry_pred | no | String | No | mid | The predicted ancestry for the sample, not including “other.” |
probabilities | no | Array[number] | No | [0.10, 0.99, 0.001, … 0.0] | Confidence of each output class (i.e. computed ancestry). Each will have a length equal to the number of possible computed ancestry labels minus one (6). The ancestry “Other” is computed separately based on the confidence of the other classes. |
pca_features | no | Array[number] | No | [8.1232, 0.01234, 3.1123, …, 0.00132] | The principal components of the projection for the sample. Each value is an array with a length of 16. |
ancestry_pred_other | no | String | No | oth | The predicted ancestry for the sample, including “other.” |
Column Explanations:
- Field name -- The name of the field. In tsv files, this will appear on the first row of the file.
- Type -- Data type. Arrays are possible.
- Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
- Notes -- Any other relevant information.
srWGS genetic admixture estimates
We provide genetic ancestry admixture estimates for all samples with srWGS data in .Q and FAM file formats. The analysis was performed with the Rye tool and the output file descriptions can be found in the Rye documentation.
The .Q file contains columns with the ancestry groups used in the training data and the rows are admixture estimates for each sample. The ancestry group labels that we use are 1KGP-HGDP-AFR-like (AFR), 1KGP-HGDP-AMR-like (AMR), 1KGP-HGDP-EAS-like (EAS), 1KGP-HGDP-EUR-like (EUR), 1KGP-HGDP-MID-like (MID), 1KGP-HGDP-SAS-like (SAS), and Remaining Individuals (OTH). We also provide the reference admixture estimates.
The .fam file contains the information for how each individual mapped to the training data.
Please note: The genetic admixture estimates for individuals with American ancestry may not be fully captured due to lack of appropriate samples from publicly available reference genome datasets (1KGP-HGDP in this case) to account for the full range of diversity within this group. This inaccuracy may also exist for other global populations where there is limited reference data available such as the Middle Eastern group. Additionally, the ancestry proportion estimates for the All of Us participants in the 1KGP-HGDP-AMR-like genetic ancestry group is influenced by the presence of admixture within the genomes of 1KGP-HGDP-AMR individuals included in the reference datasets, affecting the accuracy. We advise caution when interpreting these estimates, as they may not fully capture the genetic diversity within the Americas population.
srWGS pharmacogenomics data
The pharmacogenomics auxiliary dataset includes haplotype calls and predicted phenotypes for over 19 genes relevant to human drug metabolism for all samples with srWGS data. Pharmacogenomic haplotype calling is also known as star allele calling. We provide variant data across 19 genes from PharmGKB Tier 1 and Tier2 lists that are supported by the tool Stargazer. Genes with strong validation data are provided in a set of "high concordance" outputs. Genes that play significant roles in drug metabolism but do not have convincing validation results are included in a set of "low concordance" outputs.
Star allele calls are provided as per-gene .tsv files. The .tsvs contain sample names and gene names and can be concatenated easily, but are provided separately for memory usage considerations.
Cyrius v1.1.1 was run on per-sample cram input to call CYP2D6 star alleles and gene copy number. Structural variation nomenclature was harmonized and phenotypes were applied using the cyp2d6_parser package. For all other genes, we ran Stargazer 2.0.2. Stargazer output was post-processed to apply allele function definitions according to CPIC and improve phasing.
High concordance genes: CYP2C_CLUSTER, ABCG2, CACNA1S, CFTR, CYP2C9, CYP2D6, CYP3A5, G6PD, NUDT15, RYR1, TPMT, VKORC1
Low concordance genes: CYP2B6, CYP2C19, CYP4F2, DPYD, SLCO1B1, UGT1A1
srWGS relatedness kinship scores
We calculate relatedness for all samples with srWGS data and report the kinship score of any pair with a score over 0.1. The kinship score is half of the fraction of the genetic material shared. (Parent-child or siblings will have a score of 0.25 while identical twins will have a score of 0.5). Please see the Hail pc_relate function documentation for more information, including interpretation.
We provide the kinship scores for pairwise samples with kinship scores above 0.1. We do not provide identity kinship scores (i.e. kinship of a sample with itself). Each pair will only appear once (in other words, {sample1, sample2, 0.25} is equivalent to {sample2, sample1, 0.25}).
Table 10. srWGS pairwise samples with a kinship score over 0.1 TSV file description
Field name | Type | Key? | Notes |
i.s | string | yes | Sample ID of a sample in the pair |
j.s | string | yes | Sample ID of the other sample in the pair |
kin | float | no | Kinship score (0-0.5) |
Column Explanations:
- Field name -- The name of the field. In tsv files, this will appear on the first row of the file.
- Type -- Data type. Arrays are possible.
- Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
- Notes -- Any other relevant information.
srWGS SNP & Indel maximal set of unrelated samples
We provide a list of samples to prune in order to remove related samples from the srWGS SNP & Indel cohort. Relatedness is calculated as described in the kinship score description above. This will be the maximal independent set for related samples which minimizes the number of samples that need pruning.
Table 11. List of srWGS SNP & Indel related samples to prune TSV file description
Field name | Type | Key? | Notes |
sample_id.s | string | Yes | Research ID of the sample |
Column Explanations:
- Field name -- The name of the field. In tsv files, this will appear on the first row of the file.
- Type -- Data type. Arrays are possible.
- Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
- Notes -- Any other relevant information.
Flagged srWGS samples
We provide a table listing samples that are flagged as part of the sample outlier QC for the srWGS SNP and Indel joint callset. This includes the specific residual tests that were failed. The schema is described in the table below. The table will be released as a tsv.
Flagged sample tsv schema
- No fields can have a null value.
- Count fields do not include filtered variants.
- For all of the fail_* fields, a value of true indicates that the sample is an outlier and should be flagged.
Table 12. Flagged srWGS samples TSV file description
Field Name | Type | Key? | Example Value | Notes |
s | int | yes | 1000000 | Research ID |
ancestry_pred | string | no | eur | The predicted ancestry for the sample, not including “other.” |
probabilities | array<float> | no | [0.10, 0.99, 0.001, … 0.0] | Confidence of each output class (i.e. computed ancestry). Each will have a length equal to the number of possible computed ancestry labels minus one (6). The ancestry “Other” is computed separately based on the confidence of the other classes. |
pca_features | array<float> | no | [8.1232, 0.01234, 3.1123, …, 0.00132] | Each will have a length of 16. |
ancestry_pred_other | string | no | oth | The predicted ancestry for the sample, including “other.” |
snp_count | int | no | 3910035 | Number of SNPs called in this sample. |
ins_del_ratio | float | no | 0.98814 | Ratio of insertion to deletion counts. |
del_count | int | no | 427102 | |
ins_count | float | no | 456515 | |
snp_het_homvar_ratio | float | no | 2.1119 | |
indel_het_homvar_ratio | float | no | 2.3994 | |
ti_tv_ratio | float | no | 1.9967 | |
singleton | int | no | 15819 |
IMPORTANT: This is not the number of singletons in a sample. This field is a count of the number of variants not appearing in gnomAD 3.1. |
fail_snp_count_residual | boolean | no | true | |
fail_ins_del_ratio_residual | boolean | no | false | |
fail_del_count_residual | boolean | no | true | |
fail_ins_count_residual | boolean | no | false | |
fail_snp_het_homvar_ratio_residual | boolean | no | true | |
fail_indel_het_homvar_ratio_residual | boolean | no | false | |
fail_ti_tv_ratio_residual | boolean | no | true | |
fail_singleton_residual | boolean | no | false | |
qc_metrics_filters | array<string> | no |
["indel_het_homvar_ratio_residual", "snp_count_residual"] |
A list of each failed test. These will correspond to all fail_* fields with a value of “true.” |
srWGS genomic metrics
We provide a table with supplemental genomic QC metrics for each srWGS sample. The schema is described in the table below. The table will be released as a tsv.
Genomic metrics tsv schema
- No fields can have a null value.
- No samples will be in the table if they do not pass the QC thresholds.
Table 13. Supplemental genomic metrics for each srWGS sample TSV file description
Field Name | Type | Key? | Example Value | Notes |
research_id | int | yes | 1000000 | Unique identifier for each participant |
sample_source | string | no | Whole Blood | Sample source (blood or saliva) |
site_id | string | no | bi | The genome center (GC) where the sample was sequenced. This will be one of three values (bi = "Broad Institute", uw = "University of Washington", or bcm = "Baylor College of Medicine") |
sex_at_birth | string | no | Female | Participant provided information for sex at birth |
dragen_sex_ploidy | string | no | XX | Ploidy output from DRAGEN |
mean_coverage | float | no | 107.69 | Mean number of overlapping reads at every targeted base of the genome (threshold ≥30x) |
genome_coverage | float | no | 97.61 | Percent of bases with at least 20x coverage (threshold ≥90% at 20x) |
aou_hdr_coverage | float | no | 100 | Percent of bases in the All of Us Hereditary Disease Risk gene (AoUHDR) with at least 20x coverage (threshold ≥95% at 20x) |
dragen_contamination | float | no | 0.003 | Cross-individual contamination rate from DRAGEN |
aligned_q30_bases | float | no | 174329894399 | Aligned Q30 bases from DRAGEN (threshold ≥8e10) |
verify_bam_id2_contamination | float | no | 0.0000104116 | Cross-individual contamination rate from VerifyBamID2 |
biosample_collection_date | string | 2/11/2020 | Dates that biosamples were collected |
Column Explanations:
- Field name -- The name of the field. In tsv files, this will appear on the first row of the file.
- Type -- Data type.
- Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
- Notes -- Any other relevant information.
Structural variants (SVs) for srWGS data
We provide structural variant (SV) calls for 97,061 participants with srWGS data. The SV dataset includes a standard VCF with genotypes, a sites-only VCF, a list of the maximal set of unrelated samples, a sites-only VCF containing annotations from the maximal set of unrelated samples, and lists of the samples with probable aneuploidies. Please read more information about the SV calls and pipeline in the All of Us Genomic Data Quality Report.
srWGS SV VCF
The SVs are joint-called and delivered as a joint VCF for all samples, a sites-only VCF, and a sites-only VCF with annotations for the maximal set of unrelated samples. The VCFs are sorted and block and block compressed (.vcf.gz) with a local tabix index (.vcf.gz.tbi).
The full VCF has genotypes for all 97,061 participants and is sharded by chromosome.
The GATK-SV team has documented the SV VCF format in an article on the GATK site: How to interpret SV VCFs. The format has many similarities to a short variant VCF but you will see some differences that are necessary to specify SV variant details. The header describes the data fields in the VCF.
The SV VCF is annotated with the GATK tool SVAnnotate. It adds the gene overlap and the predicted functional consequence. These annotations are added in the INFO field. The annotations produced by SVAnnotate are described in detail in the tool documentation. The GTF used for gene annotations was GENCODE v39.
Some of the most important fields in the VCF are described below:
- CHROM: The chromosome location of the start position of the SV
- POS: The start position of the SV
- ID: Unique identifier for the SV
- REF: Not commonly used in structural variant VCFs, commonly has an N
- ALT: Information about the SV type, descriptions of the SV types can be found in the header
-
FILTER: Filtering information for the SV
- HIGH_NCR: Unacceptably high rate of no-call GTs.
- MULTIALLELIC: Multiallelic CNV site. This FILTER status does not mean that the site is not real, but it should be treated differently from a biallelic SV site.
- UNRESOLVED: Variant is unresolved. There was some evidence for an SV at this site but it was not able to be resolved completely from the available evidence.
- VARIABLE_ACROSS_BATCHES: Site appears at variable frequencies across batches. Likely reflects technical batch effects.
- PASS: None of the above site-level filters were applied.
-
INFO: Important annotations describing the variant at the site level. The annotations are described in depth in the SV VCF header. Some of these annotations include:
- END: End position of the structural variant
- CHR2: Second chromosome for interchromosomal events
- END2: Position of breakpoint on CHR2
- ALGORITHMS: The original algorithm that called the SV (GATK-SV is an ensemble method)
- SVLEN: SV length in base pairs
- SVTYPE: SV type
- CPX_TYPE: Subtype of complex rearrangement
- CPX_INTERVALS: Details of complex rearrangement
- FORMAT: Annotations describing the variant at the genotype level (site and sample specific annotations). Depends on the SV type and the evidence categories that support the SV. All FORMAT annotations are described in the VCF header.
srWGS SV sites-only VCF
The sites-only VCF contains all of the sites and site-level annotations in the full VCF but no genotype information. It is useful as a smaller file when genotype information is not required. See the above information for SV VCF details.
srWGS SV maximal set of unrelated samples
We provide a list of samples to prune in order to remove related samples from the srWGS SV cohort. Relatedness is calculated as described in the kinship score description above. This will be the minimal list of related samples to prune in order to produce the maximal independent set of unrelated samples.
The samples are reported in a txt file as a list of research IDs. One research ID is listed per line and there is no header in the file.
srWGS SV unrelated sites-only VCF
We provide a sites-only VCF, containing no genotype information, with annotations for the maximal set of 93,360 unrelated samples. We removed from the complete VCF the 3,701 samples from the above list of samples to prune to obtain the maximal set of unrelated samples. Sites that were unique to the removed samples were removed from the VCF. We re-annotated allele frequencies in the VCF based on the remaining samples. This VCF is provided in order to save researchers computational time for analyses requiring unrelated samples.
srWGS SV samples with probable aneuploidies
We provide lists of samples with probable aneuploidies identified during srWGS SV ploidy estimation as tsv files. Ploidy estimation was performed across the srWGS SV samples using coverage estimations over binned regions of the genome as part of the GATK-SV pipeline. Details of this ploidy estimation process are described in the All of Us QC report.
There are three separate files for samples with probable aneuploidies: mosaic autosomal aneuploidy, mosaic allosomal aneuploidy, and germline allosomal aneuploidy.
srWGS SV samples with probable mosaic aneuploidies
We provide two files describing samples with probable mosaic aneuploidies. One is samples with mosaic autosomal aneuploidy and the second is samples with mosaic allosomal aneuploidy. Both files have the same format, described in Table 14. Note that fewer than 20 samples had more than one probable mosaic autosomal aneuploidy, so these samples appear once per affected chromosome.
Table 14. srWGS SV samples with probable mosaic aneuploidies TSV file description
Field name | Type | Key? | Notes |
research_id | string | no | Research ID of the sample |
chromosome | string | no | Chromosome for which the sample is predicted to have a mosaic aneuploidy, ie. chr8 or chrX |
estimated_copy_ratio | float | no | Estimated copy ratio (see QC report Ploidy Estimation) for the chromosome with the probable mosaic aneuploidy |
aneuploidy_type | string | no | Type of aneuploidy predicted. For the probable mosaic aneuploidies, the possible values are MOSAIC_GAIN or MOSAIC_LOSS |
Column Explanations:
- Field name -- The name of the field. In tsv files, this will appear on the first row of the file.
- Type -- Data type.
- Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
- Notes -- Any other relevant information.
srWGS SV samples with probable germline allosomal aneuploidy
We provide one file describing samples with probable germline allosomal aneuploidies, described in Table 15.
Table 15. srWGS SV samples with probable germline allosomal aneuploidy TSV file description
Field name | Type | Key? | Notes |
research_id | string | yes | Research ID of the sample |
copy_number_chrX | integer | no | Estimated copy number for chrX, rounded to the nearest integer |
copy_number_chrY | integer | no | Estimated copy number for chrY, rounded to the nearest integer |
aneuploidy_type | string | no | Type of aneuploidy predicted. For the probable germline allosomal aneuploidies, the possible values are JACOBS, KLINEFELTER, and TRIPLE X (contains a space) |
Column Explanations:
- Field name -- The name of the field. In tsv files, this will appear on the first row of the file.
- Type -- Data type.
- Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
- Notes -- Any other relevant information.
srWGS SV sample list
We provide a list file of all research_ids that have srWGS SV data. The file is a text file containing one research_id per line.
Genotyping Array ("Array") Data
The array data represents 447,278 participants and includes single sample VCFs, joint Hail MT files, joint PLINK files, and raw genotyping data in IDAT format.
Array IDAT files
We provide IDAT files for all array samples. The IDAT file is a binary file containing raw BeadArray data directly from the scanner. There are two files for each sample, corresponding to the red and green intensity values. These values give information about specific nucleotides on the genome. You can read more about the steps to call variants from these IDAT files in the Genomic Quality Report.
For an in depth description and how to process these files, read more about the illuminaio tool.
Array variant data
The variant data for array samples is delivered in VCF, Hail MT, and PLINK format.
Array VCFs
We provide single-sample VCFs for all 447,278 participants with array data. The array VCFs are sorted and block compressed (vcf.gz) with local tabix index files (vcf.gz.tbi).
Array VCFs in the All of Us genomic dataset will contain the following:
Header
The header field of the VCF contains many attributes which generally describe the processing of the sample in the array. Many of these are specific to a single sample.
- arrayType - This contains the name of the genotyping array that was processed.
- autocallDate - The date that the genotyping array was processed by ‘autocall’ (aka gencall), the Illumina genotype calling software.
- autocallGender - The gender (sex) that autocall determined for the sample processed.
- autocallVersion - The version of the autocall/gencall software used.
- chipWellBarcode - The chip well barcode (a unique identifier for sample as processed on a specific location on the Illumina genotyping array).
- clusterFile - The cluster file used.
- extendedIlluminaManifestVersion - The version of the ‘extended Illumina manifest’ used by the VCF generation software.
- extendedManifestFile - The filename of the ‘extended Illumina manifest’ used by the VCF generation software.
- fingerprintGender - The gender (sex) determined using an orthogonal fingerprinting technology. This is populated by an optional parameter used by the VCF generation software.
- gtcCallRate - The gtc call rate of the sample processed. This value is generated by the autocall/gencall software and represents the fraction of callable loci that had valid calls.
- imagingDate - The date that the IDAT files (raw image scans) for the chip well barcode were created.
- manifestFile - The name of the Illumina manifest (.bpm) file used by the VCF generation software.
- sampleAlias - The name of the sample.
Note that there are many other attributes in the header (Biotin*, DNP*, Extension*, Hyb*, NP*, NSB*, Restore, String*, TargetRemoval) that are populated with Illumina control values. They are not described here.
Filtered Sites (FILTER)
There are several filters specific to genotyping array content. These are:
- DUPE - This filter is applied if there are multiple rows in the VCF for the same loci and alleles. That is, if there are two or more rows that share the same chromosome, position, ref allele and alternate alleles, all but one of them will have the ‘DUPE’ filter set.
- TRIALLELIC - This filter is applied if there is a site at which there are two alternate alleles and neither of them is the same as the reference allele.
- ZEROED_OUT_ASSAY - This filter is applied if the variant at the site was ‘zeroed out’ in the Illumina cluster file - this is typically done when the calls at the site are intentionally marked as unusual. Genotypes called sites that are ‘zeroed out’ will always be no-calls.
Genotype (sample level fields)
These fields describe attributes specific to the sample genotyped on the array. The FORMAT specifier in the VCF header describes these fields. They are:
- GT - GENOTYPE. This field describes the genotype. It is a standard field, described in the VCF specification.
- IGC - Illumina GenCall Confidence Score. A measure of the call confidence.
- X - Raw X intensity as scanned from the original genotyping array
- Y - Raw Y intensity as scanned from the original genotyping array
- NORMX - Normalized X intensity
- NORMY - Normalized Y intensity
- R - Normalized R Value (one of the polar coordinates after the transformation of NORMX and NORMY)
- THETA - Normalized Theta value (one of the polar coordinates after the transformation of NORMX and NORMY)
- LRR - Log R Ratio
- BAF - B Allele Frequency
INFO (site level fields)
These fields describe attributes specific to the probe on an array. The INFO specifier in the VCF header describes these fields. They are:
- AC - Allele Count in genotypes, for each ALT allele. A standard field, described in the VCF specification
- AF - Allele Frequency. A standard field, described in the VCF specification
- AN - Allele Number. A standard field, described in the VCF specification
- ALLELE_A - The A Allele, as annotated in the Illumina manifest (a *suffix indicates this is the reference allele)
- ALLELE_B - The B Allele, as annotated in the Illumina manifest (a *suffix indicates this is the reference allele)
- BEADSET_ID - The BeadSet ID. An Illumina identifier. Used for normalization.
- GC_SCORE - The Illumina GenTrain Score. A quality score describing the probe design
- ILLUMINA_BUILD - The Genome Build for the design probe sequence, as annotated in the Illumina manifest
- ILLUMINA_CHR - The chromosome of the design probe sequence, as annotated in the Illumina manifest.
- ILLUMINA_POS - The position of the design probe sequence (on ILLUMINA_CHR), as annotated in the Illumina manifest.
- ILLUMINA_STRAND - The strand for the design probe sequence, as annotated in the Illumina manifest.
- PROBE_A - The allele A probe sequence as annotated in the Illumina manifest.
- PROBE_B - The allele B probe sequence as annotated in the Illumina manifest. Note that this is only present on strand ambiguous SNPs.
- SOURCE - The probe source as annotated in the Illumina manifest.
- refSNP - The dbSNP rsId for this probe
Array Hail MT
We have merged the array VCFs into a Hail MT with no additional processing across samples. Each column corresponds to the research ID of the sample and each row corresponds to the variant. Since the single sample array VCFs have identical sites and FILTER values, the FILTER field is populated with the value from a single sample VCF.
In conversion, we have dropped all of the 505 variants from alternate, unlocalized, and unplaced contigs (436 variants from ALT contigs (e.g. chr19_KI270866v1_alt), 72 from random contigs (e.g. chr1_KI270706v1_random), and 13 from chrUn (e.g. chrUn_KI270742v1). These variants are still in the compressed array VCFs. Please refer to the published Featured Workspaces on how we generated the Hail MT from the VCFs.
Array PLINK 1 binary biallelic genotype table (PLINK bed)
We provide PLINK 1.9 data (.bed / .bim / .fam) for array data, converted from the Hail MT using the export_plink command in Hail and contain all information in the Hail MT. PLINK file type information can be found within the PLINK documentation. The .bed file is the PLINK binary biallelic genotype table and contains genotype calls. The .bim file is the PLINK extended .map file, and is a text file containing variant information. The .fam file is a text file with sample information for each participant. Please refer to the published Featured Workspaces on how to use the PLINK files.
Long-Read Whole Genome Sequencing (lrWGS)
We provide lrWGS data representing 2,800 participants in the CDRv7 and CDRv8 callsets. These data are particularly useful for resolving complex genomic regions, structural variants, and phasing of alleles, to provide a more comprehensive view of the genome. The CDRv8 callsets represent 1,773 participants and the CDRv7 callset represents 1,027 participants (Table 16).
These 2,800 participants are represented by a total of 2,842 samples, because 41 participants are sequenced on both PacBio and ONT. In addition, one participant was sequenced at both BI and UW, though to different coverage.
Table 16. Sample cohorts for all 2,800 participants with lrWGS data
Cohort name | Sequencing facility | Sequencing platform | Number of samples | Minimum coverage | Notes |
HA_Rev_mid | HA | PacBio Revio | 65 | Mid-pass (12x) | |
HA_Seq_CDRv7 | HA | PacBio Sequel Ile and Sequel II | 1027 | Mid-pass (12x) | The CDRv7 data |
BI_Seq_high | BI | PacBio Sequel Ile | 84 | High-pass (25x) | |
BI_Seq_mid | BI | PacBio Sequel Ile | 198 | Mid-pass (12x) | |
BI_Rev_mid | BI | PacBio Revio | 803 | Mid-pass (12x) | |
BCM_Seq_high | BCM | PacBio Sequel Ile | 77 | High-pass (25x) | |
BCM_Rev_high | BCM | PacBio Revio | 111 | High-pass (25x) | |
BCM_ONT_high | BCM | ONT R10.4 on PromethION | 196 | High-pass (25x) | |
JHU_ONT_high | JHU | ONT R10.4 on PromethION | 128 | High-pass (25x) | |
UW_Seq_high | UW | PacBio Sequel Ile | 100 | High-pass (25x) | |
UW_Rev_high | UW | PacBio Revio | 53 | High-pass (25x) | |
Total samples | 2842 | 42 CDRv8 participants were sequenced in two different samples |
The file types are available depending on the cohort of each sample, please see Table 17 for more information. Joint callsets are generated per-cohort and single sample files are available on a per-sample level. One file with auxiliary metrics is generated for each sequencing location. All samples sequenced at that sequencing location are represented on a per-sample level in the auxiliary metrics. The main difference between the data available for each cohort is that PacBio cohorts have de novo assembly data while ONT cohorts do not.
For locations of the lrWGS files available, please see the lrWGS manifest and the CDR Directory Document.
Table 17. Data available for each lrWGS cohort
Cohort name | Sequencing reads | Variant data | Auxiliary data |
PacBio cohorts |
grch38_noalt BAM T2Tv2.0 BAM GFA files: primary de novo assembly, alternative de novo assembly, one de novo assembly for each chromosome copy FASTA: one for each GFA file |
One for each grch38_noalt & T2Tv2.0:
Joint-called SNP & Indel variants (GVCF & Hail MT) Single sample SNP & indel variants (GVCF) Single sample PBSV SVs (VCF) Single sample Sniffles2 SVs (VCF) Single sample Sniffles2 SNF Single sample PAV variants (VCF) |
Single sample data available in the the file per sequencing facility: Auxiliary metrics grch38_noalt Auxiliary metrics T2Tv2.0 |
ONT cohorts |
grch38_noalt BAM T2Tv2.0 BAM |
One for each grch38_noalt & T2Tv2.0: Joint-called SNP & Indel variants (GVCF & Hail MT) Single sample SNP & indel variants (GVCF) Single sample PBSV SVs (VCF) Single sample Sniffles2 SVs (VCF) Single sample Sniffles2 SNF |
Single sample data available in the the file per sequencing facility: Auxiliary metrics grch38_noalt Auxiliary metrics T2Tv2.0 |
HA_Seq_CDRv7 |
grch38_noalt BAM: standard & haplotagged T2Tv2.0 BAM: standard & haplotagged GFA files: primary de novo assembly, alternative de novo assembly, one de novo assembly for each chromosome copy FASTA: one for each GFA file |
One for each grch38_noalt & T2Tv2.0: Joint-called SNP & Indel variants (VCF & Hail MT) Joint-called SVs (VCF): strict & lenient Single sample SNP & indel variants (VCF) Single sample SNP & Indel phased variants (VCF) Single sample PBSV SVs (VCF) Single sample Sniffles2 SVs (VCF) Single sample Sniffles2 SNF Single sample PAV variants (VCF) |
Auxiliary metrics grch38_noalt Auxiliary metrics T2Tv2.0 |
lrWGS sequencing reads
Each sample in the lrWGS data is aligned to two references, grch38_noalt and T2Tv2.0 in BAM format. Each BAM file is accompanied by an index BAI file.
grch38_noalt corresponds to the GRCh38 reference with no alternate sequences. T2Tv2.0 in the CDRv8 release corresponds to the T2T-CHM13v2.0 reference with these modifications: the EBV contig is added from the grch38_noalt reference, Chromosome Y is hardmasked with N bases in the Human Pseudoautosomal Region (PAR) region, and the mitochondrial genome is updated to the revised Cambridge Reference Sequence (rCRS). We updated the T2Tv2.0 reference for this CDRv8 release and so it is different from the previous CDRv7 T2Tv2.0 version. Please see Known Issue #7 in the All of Us Genomic Data Quality Report regarding how the T2Tv2.0 reference in CDRv8 is different from that used in CDRv7.
In the CDRv7 cohort, we additionally provide haplotagged files. A haplotagged BAM file contains additional information for each read to distinguish reads that come from different haplotypes.
lrWGS de novo assembly
Haplotype-resolved de novo assembly is available for all PacBio HiFi samples (Table 16) in Graphical Fragment Assembly (GFA) and FASTA format. Each de novo assembly includes a primary de novo assembly, an alternative de novo assembly, and two chromosome copies. The tool PAV is used to call variants from the PacBio GFA files.
GFA files
We release four Graphical Fragment Assembly (GFA) files for each PacBio sample sample, which are de novo graph-based assemblies. One GFA file is the primary assembly for the sample, another being the alternative assembly, and the other two GFA files are the chromosome copy assemblies. The GFA files describe the graph layouts of the contigs.
We use the tool hifiasm, which is a tool for generating haplotype-resolved de novo assemblies. Please check out the GFA specifications for more details about GFA format.
FASTA files
We provide four de novo assemblies as FASTA files for each PacBio HiFi lrWGS sample, matching the sequences from the GFA files. A FASTA file is a text file representation of genomic data. Each genomic sequence is described in two lines: the first line is a description line starting with a greater-than (">") symbol at the beginning and the second line contains the genomic sequence data as a string with the nucleotide sequence. Other than the first line of the FASTA file which is the description, these two lines representing genomic sequences are repeated in the file.
The CDRv8 files are block-gzipped and the CDRv7 files are gzipped. Each FASTA is accompanied by an index file.
Long-read variant data
For a detailed description of the CDRv7 lrWGS variant data, please refer to the CDRv7 How the All of Us Genomic Data are Organized.
lrWGS SNP & Indel GVCF
We perform SNP and Indel variant calling per-sample with DeepVariant for each reference version, grch38_noalt and T2Tv2.0. The single-sample variant data is released in GVCF format with accompanying GVCF TBI index files.
lrWGS joint callset Hail MT
We generate a lrWGS joint SNP & Indel callset by joining the single-sample GVCFs with GLNexus. The joint callsets are generated per-cohort, not across the entire lrWGS sample set (Table 16 for cohorts). The joint callset is available in Hail MT and GVCF format. The variants are hard-filtered with a QUAL cutoff of 40 for PacBio samples and 34 for ONT samples (see the All of Us Genomic Data Quality Report for more information).
lrWGS structural variant VCF
Structural variants are called from both PBSV and Sniffles2 for all lrWGS samples. Each lrWGS sample has a single VCF from each of the two variant callers, accompanied by TBI index files. Please see the headers of these VCF files for descriptions of the VCF fields. In addition, we output a Sniffles2 binary SNF file for use with Sniffles2’s multi-sample SV calling mode.
lrWGS PAV phased variants
Variants from the tool PAV are provided in VCF format for each PacBio HiFi sample. The VCF files are accompanied by a TBI index. PAV variants are derived from the haplotype resolved assembly (GFA files) generated by hifiasm. PAV-generated VCFs are phased. Please see the header of the PAV VCFs for a description of the VCF fields.
lrWGS CDRv7 cohort
Please see the CDRv6 version of the article How the All of Us Genomic Data are Organized for a thorough description of each lrWGS sample in the CDRv6 cohort. The files available for the CDRv7 callset are featured in Table 17. The major differences are as follows:
- CDRv7 has haplotagged BAM files available.
- The CDRv8 joint callsets are broken up into smaller cohorts.
- In CDRv7, the single sample SNP & Indel variants were called with the PEPPER-MARGIN-DeepVariant pipeline.
lrWGS sample metrics
We provide two lrWGS variant metrics files, corresponding to each lrWGS reference, described in Table 18.
Table 18 -- lrWGS variant metrics file description
Field name | Type | Key? | Notes |
research_id | string | yes | Research ID of the sample |
mosdepth_cov | float | no | Coverage from the mosdepth tool (See the QC report for a description) |
aligned_frac_bases | float | no | Fraction of bases aligned to the reference |
aligned_num_bases | float | no | Number of bases aligned to the reference |
aligned_num_reads | float | no | Number of reads aligned to the reference |
aligned_read_length_N50 | float | no | N50 of the aligned reads |
aligned_read_length_median | float | no | Median length of the aligned reads |
aligned_read_length_mean | float | no | Mean length of the aligned reads |
aligned_read_length_stdev | float | no | Standard deviation of the aligned read length |
average_identity | float | no | Mean percentage of matches to the reference per aligned read |
median_identity | float | no | Median percentage of matches to the reference per aligned read |
dvp_ft_pass_snp_cnt | float | no | Number of PASS SNPs after filtering |
pbsv_nonBND_50bpSV_cnt | float | no | Number of SVs >= 50 bp called by PBSV (excluding break-end calls) |
snf2_nonBND_50bpSV_cnt | float | no | Number of SVs >= 50 bp called by Sniffles2 (excluding break-end calls) |
Column Explanations:
- Field name -- The name of the field. In tsv files, this will appear on the first row of the file.
- Type -- Data type.
- Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
- Notes -- Any other relevant information.
lrWGS flagged samples
As described in the QC doc, several lrWGS samples were flagged during the QC process, but not filtered. We release a separate file—a 4-column TSV—listing the samples that have been flagged. To uniquely identify a sample, you need the combination of the sample_id, sequencing facility, and platform.
Field name | Notes |
sample_id | Research ID of the sample |
sequencing_facility | The sequencing facility of the sample. Possible values are: BCM, BI, HA, JHU, UW. |
platform | Sequencing technology of the sample. Possible values are revio, sequel, ont |
reasons_for_flagging | The reasons for the sample to be flagged. There can be more than one reason for the sample to be flagged, separated by comma. No white spaces. Possible values: contamination_between_1_and_3_pct, coverage_slightly_below_target, diploid_assembly_length_anomaly, female_with_low_chrX_coverage, male_with_low_chrY_coverage, read_len_median_below_10kbp |
lrWGS manifest
The location of each single sample file is listed in the lrWGS manifest file. This resource goes hand in hand with the Controlled CDR Directory Document, which lists the location of the manifest file and the paths for all joint callsts. Some samples will have two rows in the lrWGS manifest because they were sequenced at multiple sequencing facilities or on multiple platforms. To uniquely identify a sample, you need the combination of the sample_id, center, and platform.
Not all columns will be filled, depending on what data is available for the sample. See Table 17 for details.
Table 20 -- lrWGS manifest
Field name | Notes |
research_id | Research ID of the sample |
center | Sequencing facility of the sample. Possible values are: BCM, BI, HA, JHU, UW. |
Platform | Sequencing technology of the sample. Possible values are revio, sequel, ont |
assembly_alternate_fa | De novo alternate assembly, in FASTA format. Only available for PacBio samples. CDRv8 files are block-gzipped and CDRv7 files are gzipped. |
assembly_alternate_fa_gzi | De novo alternate assembly, FASTA index file. Only available for PacBio samples in CDRv8. |
assembly_alternate_gfa | De novo alternate assembly, FASTA index file. Only available for PacBio samples in CDRv8. |
assembly_hap1_fa | De novo haplotype-resolved assembly for haplotype-1 (in no particular order), in FASTA format. Only available for PacBio samples. CDRv8 files are block-gzipped and CDRv7 files are gzipped. |
assembly_hap1_fa_gzi | De novo haplotype-resolved assembly for haplotype-1 (in no particular order), FASTA index file. Only available for PacBio samples in CDRv8. |
assembly_hap1_gfa | De novo haplotype-resolved assembly for haplotype-1 (in no particular order), in GFA format. Only available for PacBio samples. |
assembly_hap2_fa | De novo haplotype-resolved assembly for haplotype-2 (in no particular order), in FASTA format. Only available for PacBio samples. CDRv8 files are block-gzipped and CDRv7 files are gzipped. |
assembly_hap2_fa_gzi | De novo haplotype-resolved assembly for haplotype-2 (in no particular order), FASTA index file. Only available for PacBio samples in CDRv8. |
assembly_hap2_gfa | De novo haplotype-resolved assembly for haplotype-2 (in no particular order), in GFA format. Only available for PacBio samples. |
assembly_primary_fa | De novo primary assembly, in FASTA format. Only available for PacBio samples. CDRv8 files are block-gzipped and CDRv7 files are gzipped. |
assembly_primary_fa_gzi | De novo primary assembly, FASTA index file. Only available for PacBio samples in CDRv8. |
assembly_primary_gfa | De novo primary assembly, in GFA format. Only available for PacBio samples. |
assembly_quast_report_html | HTML report for the primary, haplotype-1 and haplotype-2 assemblies generated by the QUAST program. Only available for CDRv7 PacBio samples. |
assembly_quast_report_summary | A summary about the quality of the primary, haplotype-1 and haplotype-2 assemblies, reported by the QUAST program. Only available for PacBio samples. |
chm13v2.0_bai | The accompanying index for the T2Tv2.0 BAM. |
chm13v2.0_bam | T2Tv2.0 sequencing reads in BAM format |
chm13v2.0_bam_pbi | The accompanying PBI index for the T2Tv2.0 BAM. Only available for CDRv7 samples. |
chm13v2.0_deepvariant_phased_tbi | TBI index for the T2Tv2.0 PEPPER-Margin-DeepVariant phased VCF. Only available for CDRv7 samples. |
chm13v2.0_deepvariant_phased_vcf | T2Tv2.0 PEPPER-Margin-DeepVariant phased single-sample VCF; a filter of QUAL<40 has been applied. Only available for CDRv7 samples. |
chm13v2.0_deepvariant_tbi | TBI index for the T2Tv2.0 PEPPER-Margin-DeepVariant VCF. Only available for CDRv7 samples. |
chm13v2.0_deepvariant_vcf | T2Tv2.0 PEPPER-Margin-DeepVariant single-sample VCF; a filter of QUAL<40 has been applied. Only available for CDRv7 samples. |
chm13v2.0_dv_gtbi | TBI index for the DeepVariant T2Tv2.0 GVCF. Only available for CDRv8 samples. |
chm13v2.0_dv_gvcf | T2Tv2.0 DeepVariant single-sample SNP & Indel GVCF. Only available for CDRv8 samples. |
chm13v2.0_haplotagged_bai | T2Tv2.0 haplotagged BAM index. Only available for CDRv7 samples. |
chm13v2.0_haplotagged_bam | T2Tv2.0 haplotagged BAM. Only available for CDRv7 samples. |
chm13v2.0_pav_tbi | TBI index for the T2Tv2.0 PAV VCF. Only available for the PacBio samples. |
chm13v2.0_pav_vcf | T2Tv2.0 PAV single-sample VCF. Only available for the PacBio samples. |
chm13v2.0_pbsv_tbi | TBI index for the T2Tv2.0 PBSV SV single-sample VCF |
chm13v2.0_pbsv_vcf | T2Tv2.0 PBSV SV single-sample VCF |
chm13v2.0_sniffles_snf | T2Tv2.0 Sniffles2 single-sample SNF file |
chm13v2.0_sniffles_tbi | TBI index for the T2Tv2.0 Sniffles2 VCF file |
chm13v2.0_sniffles_vcf | T2Tv2.0 Sniffles2 single-sample VCF file |
grch38_bai | The accompanying index for the grch38_noalt BAM. |
grch38_bam | grch38_noalt sequencing reads in BAM format |
grch38_bam_pbi | The accompanying PBI index for the grch38_noalt BAM. Only available for CDRv7 samples. |
grch38_deepvariant_phased_tbi | TBI index for the grch38_noalt PEPPER-Margin-DeepVariant phased VCF. Only available for CDRv7 samples. |
grch38_deepvariant_phased_vcf | grch38_noalt PEPPER-Margin-DeepVariant phased single-sample VCF; a filter of QUAL<40 has been applied. Only available for CDRv7 samples. |
grch38_deepvariant_tbi | TBI index for the grch38_noalt PEPPER-Margin-DeepVariant VCF. Only available for CDRv7 samples. |
grch38_deepvariant_vcf | grch38_noalt PEPPER-Margin-DeepVariant single-sample VCF; a filter of QUAL<40 has been applied. Only available for CDRv7 samples. |
grch38_dv_gtbi | TBI index for the DeepVariant grch38_noalt GVCF. Only available for CDRv8 samples. |
grch38_dv_gvcf | grch38_noalt DeepVariant single-sample SNP & Indel GVCF. Only available for CDRv8 samples. |
grch38_haplotagged_bai | grch38_noalt haplotagged BAM index. Only available for CDRv7 samples. |
grch38_haplotagged_bam | grch38_noalt haplotagged BAM. Only available for CDRv7 samples. |
grch38_pav_tbi | TBI index for the grch38_noalt PAV VCF. Only available for the PacBio samples. |
grch38_pav_vcf | grch38_noalt single-sample PAV VCF. Only available for the PacBio samples. |
grch38_pbsv_tbi | TBI index for the grch38_noalt PBSV SV VCF |
grch38_pbsv_vcf | grch38_noalt PBSV SV single-sample VCF |
grch38_sniffles_snf | grch38_noalt Sniffles2 single-sample SNF file |
grch38_sniffles_tbi | TBI index for the grch38_noalt Sniffles2 VCF file |
grch38_sniffles_vcf | grch38_noalt single-sample Sniffles2 VCF file |
Frequently Asked Questions (FAQs) Regarding the Genomic Data Organization
1. Which variants in the VDS are included in the VAT?
Variants included in the VAT must meet the following criteria:
- Sites that pass the 'filters' field
- Sites with 50 or fewer alternative alleles (for CDRv7)
- Variants from these sites that pass the ‘FT' field and can be annotated by Nirvana from the VDS.
Passing the 'FT' field means that at least one call for the variant has passed the 'FT' filter.
Note: The cutoff for alternative alleles in CDRv7 is 50, though with other releases, this number can change.
2. Does the All of Us genomic dataset have Whole Exome Sequencing (WES) data?
No, the All of Us genomic dataset has Whole Genome Sequencing (WGS) data and not WES data. WES data only contains sequencing data for the protein-coding regions of the genome, known as exons, whereas WGS data sequences the entire genome. If you are only interested in the exome, we recommend that you use the exome smaller callset, which provides the variants within the exome.
3. Where can I find the research ID in the CRAM and IDAT files?
The research ID is in the file names of the CRAM and IDAT files. To correlate research IDs between the variant files and the raw data files, use the research IDs in the file name of the raw data files (CRAM and IDATs).
4. Where is the gene name (rsID) stored for each variant?
The rsID for each gene is stored in the Variant Annotation Table (VAT). If you have a rsID of interest, you can use the VAT to determine the genomic coordinates of the variant for analysis in the Hail MT, VCFs, or PLINK formats.
Comments
0 comments
Article is closed for comments.