Introduction

The All of Us genomic data includes short read whole genome sequencing (srWGS) data, long read whole genome sequencing (lrWGS) data, and microarray genotyping array (“array”) data. Researchers access this genomic data through the Researcher Workbench (RW) Controlled Tier dataset (e.g. genomic data is not available through the Registered Tier). Bucket locations for accessing the data in analysis notebooks can be found in the Controlled CDR Directory.

Short variants - Single Nucleotide Polymorphisms (SNPs) and Insertions & Deletions (Indels) - are available for srWGS data, lrWGS data, and arrays. Structural variants (SVs) are available for srWGS and lrWGS data. We provide variant data in VariantDataset (VDS), Hail MatrixTable (MT), Variant Call Format (VCF), Binary GEN format (BGEN), and PLINK 1.9 bed/bim/fam triplets. Raw data is available in compressed CRAM or BAM format for the WGS reads and IDAT files for array data. We also provide auxiliary tabular data, such as the joint callset QC flagged samples or related pairs. A summary of the file formats for each data type can be found in Table 1.

In this article, we will summarize the genomic data formats and what information is available in each data type. In some cases, we will refer to other documentation when it describes the data format we deliver. This article assumes a general knowledge of genomics and bioinformatics. For a workspace on getting started with genomic data on the Researcher Workbench, please see the How to Work with All of Us Genomic Data Featured Workspace. We also provide a detailed report on the quality of the genomic data with each release in the All of Us Genomic Data Quality Report available on the User Support Hub.

List of All of Us genomic data

Short-read whole genome sequencing (srWGS) - 414,830

Sequencing reads in CRAM format, aligned to hg38/GRCh38
SNP & Indel Variant data
- Hail Variant Dataset (VDS): joint callset across the entire genome
- Exome callset: Hail MT, VCF, PLINK bed, PGEN, BGEN
- ClinVar callset: Hail MT, VCF, PLINK bed, PGEN, BGEN
- ACAF threshold callset: Hail MT, VCF, PLINK bed, PGEN, BGEN - variants that have population-specific AF > 1% or population-specific AC > 100
- Challenging medically relevant genes (CMRG)
Annotated variants: Variant Annotation Table
Auxiliary Data

Short-read whole genome sequencing structural variants (SVs) - 97,061

Joint-called SV VCF
Sites-only SV VCF
srWGS SV maximal set of unrelated samples
Unrelated sites-only VCF
srWGS SV samples with probable aneuploidies
Sample list

Genotyping array - 447,278

Raw genotyping scanner data in IDAT format
SNP & Indel variant data

Long-read whole genome sequencing (lrWGS) - 2,800

11 cohorts grouped by sequencing facility and platform
Sequencing reads in BAM format, aligned to grch38_noalt & T2Tv2.0
- Annotated with methylation signals
De novo assembly in GFA and FASTA format for cohorts with PacBio data
Variant data for each grch38_noalt & T2Tv2.0 assembly
- Joint SNP & Indel variants for each cohort in GVCF & Hail MT formats
- Single-sample SNP & Indel variants in GVCF format
- Single-sample SVs from PBSV & Sniffles2
- Single-sample PAV variants in VCF format for samples with PacBio data
- Auxiliary sample metrics for both reference versions

Overview of the Genomic Data

The main deliverables of interest in the All of Us Research Program are the genomic variants, which are delivered in multiple data formats in order to meet researchers' various needs.

Table 1 – Deliverables for each genomic data type

Deliverable	srWGS SNP & Indel	Array	srWGS SVs	lrWGS
Reference version	hg38/GRCh38 reference	hg38/GRCh38 reference Note: variants are called originally with hg19 reference but they are lifted over before release on RW	hg38/GRCh38 reference	T2Tv2.0 grch38_noalt
Raw data	CRAM files	IDAT files	CRAM files (same deliverable as srWGS SNP & Indels)	BAM files for each reference version De novo assembly for PacBio cohorts: primary, alternate, and two chromosome copies in Graphical Fragment Assembly (GFA) format and FASTA format
Variant data	Joint-callset variant data for all samples: VDS Smaller callsets: ACAF threshold, exome, ClinVar - in VCF, Hail MT, BGEN, PGEN, PLINK bed formats CMRG callset	Single sample VCFs (all VCFs have the same variants) Hail MT (merged) PLINK bed files (merged)	Joint called VCF Sites-only VCF Unrelated sites-only VCF	Joint SNP & Indel variants (GVCF & Hail MT format) Single-sample SNP & Indel variants (GVCF) Single-sample PBSV SVs (VCF) Single-sample Sniffles2 SVs (VCF & SNF) Single-sample PAV variants (VCF) - for PacBio cohorts
Auxiliary files	Annotated variants: (Variant Annotation Table) Genetic ancestry Genetic admixture estimates Pharmacogenomics variant calls (star alleles) Statistical phasing Relatedness Maximal set of unrelated samples Flagged samples Genomic QC values srWGS Genomic metrics file Control samples	Genetic ancestry, admixture estimates, pharmacogenomics, and relatedness available for array samples that have srWGS data	Genetic ancestry, admixture estimates, pharmacogenomics, and relatedness available for srWGS samples based on the srWGS SNP & Indel deliverables Maximal set of unrelated samples Samples with probable aneuploidies srWGS SV sample list	Genetic ancestry, admixture estimates, pharmacogenomics, and relatedness available for lrWGS samples based on the srWGS SNP & Indel deliverables lrWGS sample metrics lrWGS flagged samples

Short-Read Whole Genome Sequencing (srWGS) Data

srWGS CRAM files

We provide raw data for srWGS samples in CRAM format, otherwise known as compressed SAM (sequence alignment map) format. The data are mapped to the hg38/GRCh38 reference. Refer to the All of Us Genomic Data Quality Report for more information on how variant calling was performed on these raw data files.

There is one CRAM file and one CRAM index file for each srWGS sample and the research ID appears in the file name. The path to each CRAM file is found in the manifest CSV file, which contains a row per sample of person_id,cram_uri,cram_index_uri

The raw data is more expensive to use because you must pay egress charges, which are the costs to retrieve the data from the cloud for analysis. We do not charge egress for variant data and so the raw data will be more expensive to use. Please see the Genomics FAQ for Recommendations for processing CRAMs with GATK on the Researcher Workbench.

srWGS SNP & Indel variant data

The srWGS SNP & Indel dataset is joint-called and delivered as a complete callset in VariantDataset (VDS) format, which is a Hail data storage format for large datasets. Hail MT, VCFs, and PLINK files are available for all samples over limited regions, including the exome, ClinVar variants, and common variants within each genomic ancestry group. For further information about the Hail MT, VCF, and PLINK files, please see Smaller Callsets for Analyzing Short Read WGS SNP & Indel Data with Hail MT, VCF, and PLINK.

VariantDataset (VDS)

The Hail VariantDataset (VDS) is a data storage format we use for the All of Us srWGS SNP & Indel variant data. With one of the largest callsets in the world, the VDS helps to store variant data efficiently for all samples over the entire genome. The VDS is a sparse Hail data storage format that stores less data, but more information. As a comparison, the Hail MT is a dense variant storage format with every entry populated. For an overview of the VDS, check out ‘The new VDS format for All of Us srWGS data’ article.

If possible, we recommend that researchers use the smaller callsets for their analysis to save time and money. Most downstream analyses of the VDS involve filtering and converting the VDS into a VCF, Hail MT, or other dense format (“densifying”). We have performed this step already to cover most use cases with reduced srWGS SNP & Indel variant datasets in VCF, Hail MT, BGEN, and PLINK bed formats over commonly used areas of the genome (see Genomics FAQ: Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK).

Instructions for densifying the VDS are available in the article ‘The new VDS format for All of Us srWGS data’ and the Manipulate Hail VariantDataset tutorial notebook.

In the following sections, we describe how the VDS stores variant data, reference data, and how to determine if a variant site is filtered.

Variant Data

The VDS uses variant level row fields to store data for all samples, including the variant locus (locus), a list of alternate alleles (alleles), and site level filtering data (filters). Local fields store data that only apply to a single sample, including genotype metadata and genotype filtering. The local alleles (LA) array maps the alleles that appear in the individual sample to the list of alternate alleles (alleles), thus genotype metadata is only stored for samples with the genotype.

Some familiar annotations from a VCF or Hail MT are not present in the VDS, but can be rendered when densifying the VDS. The allele count for each alternate allele (AC), the total number of alleles at each site (AN), and the frequency of each alternate allele (AF) are also stored in the Variant Annotation Table (VAT) for all variants that pass filtering.

Tables 2-5 describe the fields in the All of Us VDS. Please see the Hail documentation for more information on the Hail data types.

Table 2. VDS column fields: stores sample name

VDS Field	Description	Hail data type
s	Research ID	str

Table 3. VDS row fields: stores variant data

VDS Field	Description	Hail data type	Example
locus	Positional data for the variant. Formatted as chromosome name and position separated by colon.	locus<GRCh38>	chr1:12807
alleles	List of alleles at a locus for all samples (otherwise known as global alleles). The first allele is the reference allele. All the alternate alleles are then listed in alphabetical order.	array<str>	[“C”, “T”]
filters	Site level filtering information. Hard threshold filters include EXCESS_ALLELES, NO_HQ_GENOTYPES, LowQual, and ExcessHet. If no filtering reason is provided or there is a PASS, then the site has passed filtering.	set<str>	{“LowQual”, “NO_HQ_GENOTYPES”}
as_vets	Variant Extract-Train-Score Filtering model information for this site. Does not contain information about whether or not the site was filtered. We recommend that most users ignore this field and look at filters for useful filtering information	dict<str, struct { model: str, calibration_sensitivity: float64 }>	{"T":("INDEL",7.58e-01)}

Table 4. VDS entry fields: stores genotype level variant data

VDS Field	Description	Hail data type	Example
GQ	Genotype Quality. Follows VCF description.	int32	63
RGQ	Reference Genotype Quality. Follows VCF description.	int32	101
PS	Phase set - the set of phased genotypes to which this genotype belongs. The PS field contains an integer that represents the position of the first phased variant in the set. If the genotype is unphased, the corresponding PS field is ignored.	int64	26887031
LGT	Local genotype. The coordinates map to LA. LA always includes the reference allele so the call can be [0/1], [1/1], or [1/2].	call	[1/1]
LAD	Local allele depth, describes the allele depth for one sample. Maps to the alleles described in the local alleles (LA) array. See VCF description.	array<int32>	[0,8]
LA	Local alleles. The reference allele and allele(s) that appear in the sample are listed as coordinates mapping to the global alleles array. The reference coordinate is always included.	array<int32>	[0,1]
FT	Boolean containing genotype level filtering. True for PASS, False for FAIL, and NA for (.). In most cases, NA should be treated as PASS. The filtering reason is not provided.	bool	True

Table 5. VDS global fields: filtering metadata for the entire callset

Note: These fields report metadata of the filtering model. See the row filter field filters or entry genotype field FT to see whether a variant did not meet the threshold reported in these fields.

VDS Field	Description	Hail data type
truth_sensitivity_snp_threshold	SNP sensitivity threshold	float64
truth_sensistivity_indel_threshold	Indel sensitivity threshold	float64

Reference Data

The VDS also stores reference data for each sample as reference blocks in a separate component table reference_data. The row key is the locus and the ref_allele denotes the reference base at the genomic coordinate. Columns are keyed by the sample ID. No data at a particular location indicates that the sample has a variant call.

Table 6. VDS reference data column fields: stores sample name

VDS Field	Description	Hail data type
s	Research ID	str

Table 7. VDS reference data row fields: stores reference data

VDS Field	Description	Hail data type	Example
locus	Positional data for the variant. Formatted as chromosome name and position separated by colon.	locus<GRCh38>	chr1:10029
ref_allele	The reference allele at the genomic coordinate	str	“A”

Table 8. VDS reference data entry fields: stores reference blocks

VDS Field	Description	Hail data type	Example
GQ	Genotype Quality. Follows VCF description.	int32	40
END	Indicates the end of the reference block, which is the group of consecutive non-variant sites that have the same genotype quality. All coordinates between the start locus and the end coordinate are called as reference for the sample.	int32	10036

Filtering Information

The variant filtering data is represented in two fields in the VDS, filters and the FT field (Table 3, Table 4). The filters array contains site level filters, including EXCESS_ALLELES, NO_HQ_GENOTYPES, LowQual, and ExcessHet. If no filtering reason is provided or the filters field contains PASS, then the site has passed filtering. The FT field contains genotype level filtering. The genotype level filtering reasons are not specified in the All of Us VDS, there will be a boolean describing the filtering status for the genotype. True is PASS and False is FAIL. If all genotypes fail at a site, the True or False boolean can also apply to the filters array. The variant filtering process is described in depth in the QC report. All filtered variants are soft filtered, which means the variants will be marked but not removed from the callset.

We provide a tutorial notebook for converting VDS to a Hail MT format, including code to transform the FT boolean True or False in the VDS to PASS or FAIL so that it is compatible for converting to a VCF.

srWGS SNP & Indel smaller callsets

We released the srWGS SNP and Indel callset in familiar data formats over limited genomic regions: VCF, Hail MT, BGEN, and PLINK bed formats. The smaller callsets, described in Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK, cover regions of the genome that are popular for All of Us researchers: an Allele Count/Allele Frequency (ACAF) threshold callset, an exome callset, and a ClinVar callset. We recommend that you stick with these premade Hail smaller callsets instead of using the VDS, if possible, to save time and money.

The ACAF threshold callset contains variants that have a population-specific allele frequency (AF) greater than 1% or a population-specific allele count (AC) greater than 100 in any computed ancestry subpopulations. The exome callset contains variants that are within the exon regions of the Gencode v42 basic transcripts, with padding of 15 bases on either side of each exon. The ClinVar callset contains variants in ClinVar, regardless of pathogenicity.
The complete srWGS SNP and Indel callset across all sites is released as a VDS, which is a Hail sparse data format. We provide a tutorial notebook for converting VDS to a Hail MT format, though we recommend that you stick with the premade Hail MT, if possible, to save time and money.

srWGS Hail MT

We provide two Hail MTs for each smaller callset, both a multiallelic and multiallelic split Hail MT, resulting in six total Hail MT deliverables for the srWGS SNP and Indel callset. In the multiallelic split MT, sites with multiple alternate alleles will be split, so each row will only have one alternate allele. In the multiallelic MT, sites with multiple alternate alleles will be retained in the same row.

When using Hail MT files in the Researcher Workbench, read directly from the bucket location. Do not attempt to copy them locally.

The Hail MT follows Hail format specifications. For additional examples for what you may expect to see in the data, see the following VCF examples.

srWGS VCF

The srWGS limited callset VCFs are sorted and block compressed in bgz format (.vcf.bgz) with a local tabix index (.vcf.bgz.tbi). Each VCF is split into multiple non-overlapping sections of the genome by chromosome in separate files for usability (sharding).
Please note that we recommend using the FILTER column and the filter tag (FT) to determine the filtering status of a variant because the QUAL information is not available.

FORMAT fields (per sample-site):

Genotype (GT) -- The GT field specifies the alleles carried by the sample, encoded as 0 for the reference (REF) allele, 1 for the first alternative (ALT) allele, 2 for the second ALT allele, etc. The allele values are separated by a / or |, depending on if the genotype is phased or not. The / separator is for unphased variants and the | separator is for phased variants (see PS field below). Since humans are diploid organisms, we expect two alleles (e.g. “0/1”). Please note that the GT calls for chrX and chrY variants may be reported as either haploid or diploid, even in the case of chrY and chrX in males.
Allelic Depth (AD) -- Allelic depths for the reference allele and the alternate allele(s) present at this site. For more information about AD and which reads are counted, see this article on Allele Depth.
Genotype Quality (GQ) -- The phred-scaled confidence that the called genotype is correct. A higher score indicates a higher confidence. For more information on GQ, please see the GQ documentation. For more information on interpreting phred-scaled values, please see Phred-scaled quality scores.
Reference Genotype Quality (RGQ) -- The phred-scaled confidence that the reference genotypes are correct. A higher score indicates a higher confidence. For more information on RGQ, please see the GQ documentation, but note that RGQ applies to the reference, not the variant. For more information on interpreting phred-scaled values, please see Phred-scaled quality scores.
Genotype Filter (FT) -- The srWGS SNP & Indel genotype-level filtering information. As part of our joint callset quality control processing, we run the Variant Extract-Train-Score (VETS) method, which is a genotype-level filtering algorithm. If the genotype passes, there will be no value in this field. If the genotype fails, the value will be high_CALIBRATION_SENSITIVITY_SNP or high_CALIBRATION_SENSITIVITY_INDEL. An example code snippet for filtering genotypes, in Hail, can be found in the Manipulating Hail Matrix Table tutorial notebook.
- high_CALIBRATION_SENSITIVITY_SNP: Sample Genotype FT filter value indicating that the genotyped allele failed SNP model calibration sensitivity cutoff (0.997)
- high_CALIBRATION_SENSITIVITY_INDEL: Sample Genotype FT filter value indicating that the genotyped allele failed INDEL model calibration sensitivity cutoff (0.99)
Phase Set (PS) -- A phase set is defined as a set of phased genotypes to which this genotype belongs (See VCF 4.1 specifications). The PS value is an integer representing the position of the first phased variant in the set. It is not available for all variants. The first variant in the phase set will contain the PS identifier. If the genotype in the GT field is unphased, the corresponding PS field is ignored. The phasing data is from the DRAGEN 3.7.8 pipeline during the genotyping step, by comparing haplotypes and variants within an active variant region.
- The PS field will appear in the VCF, Hail MT, and VDS and will not appear in PLINK data types. If using downstream tools from Hail or VCF that expect unphased data, then researchers need to perform a step to unphase the data.

INFO fields (per site):

Descriptions of the INFO fields can also be found in the header of the VCF.

Allele Count (AC) -- the number of times we see each alternate allele for all samples. For example, a “1/1” genotype would count as 2 observations of the first alternate allele.
Allele Number (AN) -- the total number of alleles seen. Usually, this will be the number of samples times two, since humans are diploid organisms. No-call genotypes (“./.”) are not counted towards AN.
Allele Frequency (AF) -- the frequency of the alternate allele in the population that is the callset cohort. This is equivalent to AC/AN.
QUAL approximation (QUALapprox) -- the sum of the phred-scaled homozygous reference probability values across all samples, which is a proxy for the site-level QUAL score, but without the SNP or indel heterozygosity applied as a per-site prior probability of variation.
Allele-specific QUAL approximations (AS_QUALapprox) -- a per-allele, phred-scaled quality score derived from the sum of homozygous reference probability values across samples when each allele is considered in isolation. This is an approximation of the QUAL score for each allele. For more information on the QUAL score, see the VCF specification.

FILTER values (per site):

QUAL score does not meet threshold (LowQual) -- sites with this filter have a posterior probability of being variant that is equal to or below the probability of being variant by chance, represented by the expected heterozygosity for humans (QUALapprox lower than 60 for SNPs; 69 for Indels)
- QUAL tells you how confident we are that there is some kind of variation at a given site. The variation may be present in one or more samples.
No high-quality genotypes (NO_HQ_GENOTYPES) -- sites with this filter do not have any genotypes that are considered high quality (GQ>=20, DP>=10, and AB>=0.2 for heterozygotes)
- Allele Balance (AB) is calculated for each heterozygous variant as the number of reads supporting the least-represented allele over the total number of read observations. In other words, min(allele depth)/(total depth) for diploid GTs.
Excess Heterozygosity (ExcessHet) -- sites with this filter have more heterozygote genotypes than expected by chance under Hardy-Weinberg equilibrium. ExcessHet is a phred-scaled p-value. We cutoff anything more extreme than a z-score of -4.5 (p-value of 3.4e-06), which phred-scaled is 54.69
Excess alleles (EXCESS_ALLELES) -- sites with this filter have an excess of alternate alleles, which our cutoff is 100. When a site has more than 100 alternate alleles, this filter will be present.

PLINK 1 binary biallelic genotype table (PLINK bed)

We provide PLINK 1.9 data (.bed / .bim .fam) for the srWGS SNP and Indel smaller callsets. The PLINK files are converted from the Hail MT using the export_plink command in Hail and contain all information in the Hail MT. PLINK file type information can be found at the PLINK site. The .bed file is the PLINK binary biallelic genotype table and contains genotype calls. The .bim file is the PLINK extended .map file, and is a text file containing variant information. The .fam file is a text file with sample information for each participant. Please refer to the published notebooks on how to use the PLINK 1.9 data.

PLINK 2 binary genotype table (PGEN)

We provide PLINK 2 data (.pgen, .psam, .pvar) for the srWGS SNP and Indel smaller callsets. The PLINK files are converted from the smaller callset VCFs. The .pgen file is the file containing genotype calls. It is accompanied by a .pvar and a .psam file. Please see the PLINK documentation for more details. The .pvar is a text file containing variant information to accompany the .pgen file. The .psam file is a text file containing sample information.

Binary GEN format (BGEN)

We have released the srWGS SNP and Indel smaller callsets in Binary GEN format (BGEN). The files are sharded by chromosome and only contain hard calls, which are calls with probability values of 0.0 or 1.0. Please see the BGEN documentation for more information about this format.

srWGS SNP & Indel smaller callset BED files

We provide the genomic territory, otherwise known as interval files, used to create the srWGS SNP and Indel smaller callsets as UCSC BED files. The BED files contain the genomic regions for the exome, ACAF threshold, and ClinVar callsets.

Challenging medically relevant genes (CMRG)

The CMRG callset is a separate callset for 30 protein-coding genes, including 7 challenging medically relevant genes (CMRG) such as KCNE1, CBS, and MAP2K3. These genes were impacted by falsely duplicated and collapsed errors in the GRCh38 reference genome as identified in a previous report in Science. We currently see reduced sensitivity in the srWGS SNP & Indel callsets for these genes.

To provide variant calls for these genes, we extracted reads from the CRAM files, we reconstructed BAM files using FixItFelix, called variants with DRAGEN-GATK, reblocked the VCFs, and performed joint calling with the GVS pipeline. We have provided USCS BED files containing the genomic regions that we called for the CMRG callset, available on the Researcher Workbench. The FixItFelix tool reconstructed the BAMs using a modified version of the hg38 reference with duplicate genes masked out. The modified version of the hg38 reference can be found at this link for download. Note: this callset has not been filtered and should not be intersected with other All of US callsets as it is called on a different reference.

Annotated Variants - Variant Annotation Table (VAT)

The Variant Annotation Table (VAT) is a resource provided for all samples with srWGS SNP & Indel data. The VAT gives functional annotations for all passing variants. Variants must pass both site-level (filters) and genotype-level (FT) filtering. The Variant Annotation table contains site-level annotations such as allele counts for each alternate allele (AC), the total number of alleles at each site (AN), and the frequency of each alternate allele (AF). These site-level annotations are not in the VDS. Using the VAT in addition to the VDS can be used to determine variants of interest to your analysis. We provide the annotations as one single, merged tsv file (“.tsv.bgz”) which can be loaded into Hail. Please read the Variant Annotation Table article for more information.

srWGS auxiliary data

srWGS genetic predicted ancestry

We provide genetic ancestry groupings for all samples with srWGS data as a .tsv file, sorted by research ID. Genetic ancestry is inferred by measuring the genetic similarity of each participant to global reference populations. We compute these categorical groupings of genetic similarity to reference populations using harmonized continental metadata labels from the Human Genome Diversity Project (HGDP) and 1000 Genomes Project training data (N=3,942) for all srWGS samples in All of Us. Please see the All of Us Genomic Data Quality Report Appendix G for more information.

As genetic similarity is continuous, the groupings of the genetic similarity categories presented here are used to highlight genetic similarity between individuals to aid in variant classification and risk. The categories are based on the labels used in gnomAD, the HGDP and 1000 Genomes: We use the following acronyms or terms to describe genetic similarity to a reference population: 1KGP-HGDP-AFR-like (AFR or African); 1KGP-HGDP-AMR-like (AMR or Americas); 1KGP-HGDP-EAS-like (EAS or East Asian); 1KGP-HGDP-EUR-like (EUR or European); 1KGP-HGDP-MID-like (MID or Middle Eastern); 1KGP-HGDP-SAS-like (SAS or South Asian); and not belonging to one of the other ancestries or is an admixture (OTH or remaining individuals).

We provide the genetic ancestry groupings as a .tsv file along with a plot of the ancestry predictions (html file). The PCA analysis was performed using Hail's hwe_normalized_pca method. In order to allow researchers to reproduce these files and also apply our method for predicting genetic ancestry groupings on their own data, we also provide a set of files we used to predict genetic ancestry, described as follows:

Loadings file: captures how each genetic variant contributes to the principal components (PCs). The file can be used to project an individual’s genetic data on the same PCA space as the one used for the ancestry prediction. The loadings file is a Hail file type.
Eigenvalues of the PCs: the eigenvalues represent the amount of genetic variation each PC explains.
Classifier .pkl file: contains the trained ancestry prediction model.
Training PCA: The genetic ancestry groupings of the training data (1000 Genomes and HGDP)
Sites-only VCF: a sites-only VCF of the locations we used for training the ancestry predictions classifier (which is described as the HQ sites in the QC report, Appendix H). The VCF is block compressed and accompanied by a TBI index.

Table 9. srWGS genetic predicted ancestry TSV file description

Field Name	Key?	Type	Nullable?	Example Value	Notes
research_id	yes	String	No	1000055	This comes from sample metadata.
ancestry_pred	no	String	No	mid	The predicted ancestry for the sample, not including “other.”
probabilities	no	Array[number]	No	[0.10, 0.99, 0.001, … 0.0]	Confidence of each output class (i.e. computed ancestry). Each will have a length equal to the number of possible computed ancestry labels minus one (6). Probabilities are listed in the order: AFR, AMR, EAS, EUR, MID, and SAS. The ancestry “Other” is computed separately based on the confidence of the other classes.
pca_features	no	Array[number]	No	[8.1232, 0.01234, 3.1123, …, 0.00132]	The principal components of the projection for the sample. Each value is an array with a length of 16.
ancestry_pred_other	no	String	No	oth	The predicted ancestry for the sample, including “other.”

Column Explanations:

Field name -- The name of the field. In tsv files, this will appear on the first row of the file.
Type -- Data type. Arrays are possible.
Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
Notes -- Any other relevant information.

srWGS genetic admixture estimates

We provide genetic ancestry admixture estimates for all samples with srWGS data in .Q and FAM file formats. The analysis was performed with the Rye tool and the output file descriptions can be found in the Rye documentation.

The .Q file contains columns with the ancestry groups used in the training data and the rows are admixture estimates for each sample. The ancestry group labels that we use are 1KGP-HGDP-AFR-like (AFR), 1KGP-HGDP-AMR-like (AMR), 1KGP-HGDP-EAS-like (EAS), 1KGP-HGDP-EUR-like (EUR), 1KGP-HGDP-MID-like (MID), 1KGP-HGDP-SAS-like (SAS), and Remaining Individuals (OTH). We also provide the reference admixture estimates.

The .fam file contains the information for how each individual mapped to the training data.

Please note: The genetic admixture estimates for individuals with American ancestry may not be fully captured due to lack of appropriate samples from publicly available reference genome datasets (1KGP-HGDP in this case) to account for the full range of differences within this group. This inaccuracy may also exist for other global populations where there is limited reference data available such as the Middle Eastern group. Additionally, the ancestry proportion estimates for the All of Us participants in the 1KGP-HGDP-AMR-like genetic ancestry group is influenced by the presence of admixture within the genomes of 1KGP-HGDP-AMR individuals included in the reference datasets, affecting the accuracy. We advise caution when interpreting these estimates, as they may not fully capture the genetic richness within the Americas population.

srWGS pharmacogenomics data

A full description of the pharmacogenomics data is available in the article All of Us Pharmacogenomics (Star Allele) Calling.

The pharmacogenomics auxiliary dataset includes haplotype calls and predicted phenotypes for 18 genes relevant to human drug metabolism for all samples with srWGS data. Pharmacogenomic haplotype calling is also known as star allele calling. We provide variant data across 17 genes from PharmGKB Tier 1 and Tier2 lists that are supported by the tool Stargazer and one gene called by Cyrius v1.1.1. Genes with strong validation data are provided in a set of "high concordance" outputs. Genes that play significant roles in drug metabolism but do not have convincing validation results are included in a set of "low concordance" outputs.

Star allele calls are provided as per-gene .tsv files. The .tsvs contain sample names and gene names and can be concatenated easily, but are provided separately for memory usage considerations.

We ran Stargazer 2.0.2 for all 18 genes other than CYP2D6. Stargazer output was post-processed to apply allele function definitions according to CPIC and improve phasing. Cyrius v1.1.1 was run on per-sample cram input to call CYP2D6 star alleles and gene copy number. Structural variation nomenclature was harmonized and phenotypes were applied using the cyp2d6_parser package.

High concordance genes: CYP2C_CLUSTER, ABCG2, CACNA1S, CFTR, CYP2C9, CYP2D6, G6PD, NUDT15, RYR1, TPMT, VKORC1

Low concordance genes: CYP2B6, CYP2C19, CYP3A5, CYP4F2, DPYD, SLCO1B1, UGT1A1

srWGS statistical phasing

We provide haplotype phasing data for all samples in the CDRv8 srWGS callset. The data is delivered as multi-sample VCFs, sharded by chromosome. Haplotype phasing is the estimation of haplotypes that are inherited from each parent. We use statistical methods to infer the sequence of alleles on each chromosome, following methods from the 2021 paper from Browning, Brian L. et al.: Fast two-stage phasing of large-scale sequence data.

To generate the phasing data, we performed focused QC on the srWGS VDS. We first removed variants with more than 31 alternate alleles. We then removed variants with an average sum of their AD less than 28, removed variants with a max AC less than 2 (to remove singletons and doubletons), removed variants with a mean GQ of less than 30, and variants with FILTER values of LowQual, NO_HQ_GENOTYPES, or ExcessHet.

We used Beagle 5.5 for phasing with a window parameter of 15, 20, or 40, depending on the chromosome. All other parameters were default. Genetic distances were interpolated from the HapMap genetic map.

srWGS relatedness kinship scores

We calculate relatedness for all samples with srWGS data and report the kinship score of any pair with a score over 0.1. The kinship score is half of the fraction of the genetic material shared. (Parent-child or siblings will have a score of 0.25 while identical twins will have a score of 0.5). Please see the Hail pc_relate function documentation for more information, including interpretation.

We provide the kinship scores for pairwise samples with kinship scores above 0.1. We do not provide identity kinship scores (i.e. kinship of a sample with itself). Each pair will only appear once (in other words, {sample1, sample2, 0.25} is equivalent to {sample2, sample1, 0.25}).

Table 10. srWGS pairwise samples with a kinship score over 0.1 TSV file description

Field name	Type	Key?	Notes
i.s	string	yes	Sample ID of a sample in the pair
j.s	string	yes	Sample ID of the other sample in the pair
kin	float	no	Kinship score (0-0.5)

Column Explanations:

Field name -- The name of the field. In tsv files, this will appear on the first row of the file.
Type -- Data type. Arrays are possible.
Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
Notes -- Any other relevant information.

srWGS SNP & Indel maximal set of unrelated samples

We provide a list of samples to prune in order to remove related samples from the srWGS SNP & Indel cohort. Relatedness is calculated as described in the kinship score description above. This will be the maximal independent set for related samples which minimizes the number of samples that need pruning.

Table 11. List of srWGS SNP & Indel related samples to prune TSV file description

Field name	Type	Key?	Notes
sample_id.s	string	Yes	Research ID of the sample

Column Explanations:

Field name -- The name of the field. In tsv files, this will appear on the first row of the file.
Type -- Data type. Arrays are possible.
Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
Notes -- Any other relevant information.

Flagged srWGS samples

We provide a table listing samples that are flagged as part of the sample outlier QC for the srWGS SNP and Indel joint callset. This includes the specific residual tests that were failed. The schema is described in the table below. The table will be released as a tsv.

Flagged sample tsv schema

No fields can have a null value.
Count fields do not include filtered variants.
For all of the fail_* fields, a value of true indicates that the sample is an outlier and should be flagged.

Table 12. Flagged srWGS samples TSV file description

Field Name	Type	Key?	Example Value	Notes
s	int	yes	1000000	Research ID
ancestry_pred	string	no	eur	The predicted ancestry for the sample, not including “other.”
probabilities	array<float>	no	[0.10, 0.99, 0.001, … 0.0]	Confidence of each output class (i.e. computed ancestry). Each will have a length equal to the number of possible computed ancestry labels minus one (6). The ancestry “Other” is computed separately based on the confidence of the other classes.
pca_features	array<float>	no	[8.1232, 0.01234, 3.1123, …, 0.00132]	Each will have a length of 16.
ancestry_pred_other	string	no	oth	The predicted ancestry for the sample, including “other.”
snp_count	int	no	3910035	Number of SNPs called in this sample.
ins_del_ratio	float	no	0.98814	Ratio of insertion to deletion counts.
del_count	int	no	427102
ins_count	float	no	456515
snp_het_homvar_ratio	float	no	2.1119
indel_het_homvar_ratio	float	no	2.3994
ti_tv_ratio	float	no	1.9967
singleton	int	no	15819	IMPORTANT: This is not the number of singletons in a sample. This field is a count of the number of variants not appearing in gnomAD 3.1.
fail_snp_count_residual	boolean	no	true
fail_ins_del_ratio_residual	boolean	no	false
fail_del_count_residual	boolean	no	true
fail_ins_count_residual	boolean	no	false
fail_snp_het_homvar_ratio_residual	boolean	no	true
fail_indel_het_homvar_ratio_residual	boolean	no	false
fail_ti_tv_ratio_residual	boolean	no	true
fail_singleton_residual	boolean	no	false
qc_metrics_filters	array<string>	no	["indel_het_homvar_ratio_residual", "snp_count_residual"]	A list of each failed test. These will correspond to all fail_* fields with a value of “true.”

srWGS genomic QC values

We provide the QC testing values for all srWGS samples, which is a duplicate of the srWGS flagged sample schema, but for all srWGS samples. The table is released as a tsv and corresponds to the schema in Table 12.

srWGS genomic metrics

We provide a table with supplemental genomic QC metrics for each srWGS sample. The schema is described in the table below. The table will be released as a tsv.

Genomic metrics tsv schema

No fields can have a null value.
No samples will be in the table if they do not pass the QC thresholds.

Table 13. Supplemental genomic metrics for each srWGS sample TSV file description

Field Name	Type	Key?	Example Value	Notes
research_id	int	yes	1000000	Unique identifier for each participant
sample_source	string	no	Whole Blood	Sample source (blood or saliva)
site_id	string	no	bi	The genome center (GC) where the sample was sequenced. This will be one of three values (bi = "Broad Institute", uw = "University of Washington", or bcm = "Baylor College of Medicine")
sex_at_birth	string	no	Female	Participant provided information for sex at birth
dragen_sex_ploidy	string	no	XX	Ploidy output from DRAGEN
mean_coverage	float	no	107.69	Mean number of overlapping reads at every targeted base of the genome (threshold ≥30x)
genome_coverage	float	no	97.61	Percent of bases with at least 20x coverage (threshold ≥90% at 20x)
aou_hdr_coverage	float	no	100	Percent of bases in the All of Us Hereditary Disease Risk gene (AoUHDR) with at least 20x coverage (threshold ≥95% at 20x)
dragen_contamination	float	no	0.003	Cross-individual contamination rate from DRAGEN
aligned_q30_bases	float	no	174329894399	Aligned Q30 bases from DRAGEN (threshold ≥8e10)
verify_bam_id2_contamination	float	no	0.0000104116	Cross-individual contamination rate from VerifyBamID2
biosample_collection_date	string	no	2/11/2020	Date when biosamples were collected. This reflects the date the collection site finalized the order, which is generally close to, but may not exactly match, the actual time of collection.

Column Explanations:

Field name -- The name of the field. In tsv files, this will appear on the first row of the file.
Type -- Data type.
Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
Notes -- Any other relevant information.

srWGS control samples

We provide GVCF files for the eight public control samples that were used for the srWGS SNP and Indel sensitivity and precision evaluation (see Table F.1 in the All of Us Genomic Data Quality Report). The samples come from Genomes-in-a-Bottle (GiaB): from The International HapMap Project and Personal Genome Project. The samples were sequenced with the same protocol as the srWGS All of Us samples and are provided for researchers to use for their own QC processes and analyses. The data is provided in GVCF format along with their index files.

Structural variants (SVs) for srWGS data

We provide structural variant (SV) calls for 97,061 participants with srWGS data. The SV dataset includes a standard VCF with genotypes, a sites-only VCF, a list of the maximal set of unrelated samples, a sites-only VCF containing annotations from the maximal set of unrelated samples, and lists of the samples with probable aneuploidies. Please read more information about the SV calls and pipeline in the All of Us Genomic Data Quality Report.

srWGS SV VCF

The SVs are joint-called and delivered as a joint VCF for all samples, a sites-only VCF, and a sites-only VCF with annotations for the maximal set of unrelated samples. The VCFs are sorted and block and block compressed (.vcf.gz) with a local tabix index (.vcf.gz.tbi).

The full VCF has genotypes for all 97,061 participants and is sharded by chromosome.

The GATK-SV team has documented the SV VCF format in an article on the GATK site: How to interpret SV VCFs. The format has many similarities to a short variant VCF but you will see some differences that are necessary to specify SV variant details. The header describes the data fields in the VCF.

The SV VCF is annotated with the GATK tool SVAnnotate. It adds the gene overlap and the predicted functional consequence. These annotations are added in the INFO field. The annotations produced by SVAnnotate are described in detail in the tool documentation. The GTF used for gene annotations was GENCODE v39.

Some of the most important fields in the VCF are described below:

CHROM: The chromosome location of the start position of the SV
POS: The start position of the SV
ID: Unique identifier for the SV
REF: Not commonly used in structural variant VCFs, commonly has an N
ALT: Information about the SV type, descriptions of the SV types can be found in the header
FILTER: Filtering information for the SV
- HIGH_NCR: Unacceptably high rate of no-call GTs.
- MULTIALLELIC: Multiallelic CNV site. This FILTER status does not mean that the site is not real, but it should be treated differently from a biallelic SV site.
- UNRESOLVED: Variant is unresolved. There was some evidence for an SV at this site but it was not able to be resolved completely from the available evidence.
- VARIABLE_ACROSS_BATCHES: Site appears at variable frequencies across batches. Likely reflects technical batch effects.
- PASS: None of the above site-level filters were applied.
INFO: Important annotations describing the variant at the site level. The annotations are described in depth in the SV VCF header. Some of these annotations include:
- END: End position of the structural variant
- CHR2: Second chromosome for interchromosomal events
- END2: Position of breakpoint on CHR2
- ALGORITHMS: The original algorithm that called the SV (GATK-SV is an ensemble method)
- SVLEN: SV length in base pairs
- SVTYPE: SV type
- CPX_TYPE: Subtype of complex rearrangement
- CPX_INTERVALS: Details of complex rearrangement
FORMAT: Annotations describing the variant at the genotype level (site and sample specific annotations). Depends on the SV type and the evidence categories that support the SV. All FORMAT annotations are described in the VCF header.

srWGS SV sites-only VCF

The sites-only VCF contains all of the sites and site-level annotations in the full VCF but no genotype information. It is useful as a smaller file when genotype information is not required. See the above information for SV VCF details.

srWGS SV maximal set of unrelated samples

We provide a list of samples to prune in order to remove related samples from the srWGS SV cohort. Relatedness is calculated as described in the kinship score description above. This will be the minimal list of related samples to prune in order to produce the maximal independent set of unrelated samples.

The samples are reported in a txt file as a list of research IDs. One research ID is listed per line and there is no header in the file.

srWGS SV unrelated sites-only VCF

We provide a sites-only VCF, containing no genotype information, with annotations for the maximal set of 93,360 unrelated samples. We removed from the complete VCF the 3,701 samples from the above list of samples to prune to obtain the maximal set of unrelated samples. Sites that were unique to the removed samples were removed from the VCF. We re-annotated allele frequencies in the VCF based on the remaining samples. This VCF is provided in order to save researchers computational time for analyses requiring unrelated samples.

srWGS SV samples with probable aneuploidies

We provide lists of samples with probable aneuploidies identified during srWGS SV ploidy estimation as tsv files. Ploidy estimation was performed across the srWGS SV samples using coverage estimations over binned regions of the genome as part of the GATK-SV pipeline. Details of this ploidy estimation process are described in the All of Us QC report.

There are three separate files for samples with probable aneuploidies: mosaic autosomal aneuploidy, mosaic allosomal aneuploidy, and germline allosomal aneuploidy.

srWGS SV samples with probable mosaic aneuploidies

We provide two files describing samples with probable mosaic aneuploidies. One is samples with mosaic autosomal aneuploidy and the second is samples with mosaic allosomal aneuploidy. Both files have the same format, described in Table 14. Note that fewer than 20 samples had more than one probable mosaic autosomal aneuploidy, so these samples appear once per affected chromosome.

Table 14. srWGS SV samples with probable mosaic aneuploidies TSV file description

Field name	Type	Key?	Notes
research_id	string	no	Research ID of the sample
chromosome	string	no	Chromosome for which the sample is predicted to have a mosaic aneuploidy, ie. chr8 or chrX
estimated_copy_ratio	float	no	Estimated copy ratio (see QC report Ploidy Estimation) for the chromosome with the probable mosaic aneuploidy
aneuploidy_type	string	no	Type of aneuploidy predicted. For the probable mosaic aneuploidies, the possible values are MOSAIC_GAIN or MOSAIC_LOSS

Column Explanations:

Field name -- The name of the field. In tsv files, this will appear on the first row of the file.
Type -- Data type.
Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
Notes -- Any other relevant information.

srWGS SV samples with probable germline allosomal aneuploidy

We provide one file describing samples with probable germline allosomal aneuploidies, described in Table 15.

Table 15. srWGS SV samples with probable germline allosomal aneuploidy TSV file description

Field name	Type	Key?	Notes
research_id	string	yes	Research ID of the sample
copy_number_chrX	integer	no	Estimated copy number for chrX, rounded to the nearest integer
copy_number_chrY	integer	no	Estimated copy number for chrY, rounded to the nearest integer
aneuploidy_type	string	no	Type of aneuploidy predicted. For the probable germline allosomal aneuploidies, the possible values are JACOBS, KLINEFELTER, and TRIPLE X (contains a space)

Column Explanations:

Field name -- The name of the field. In tsv files, this will appear on the first row of the file.
Type -- Data type.
Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
Notes -- Any other relevant information.

srWGS SV sample list

We provide a list file of all research_ids that have srWGS SV data. The file is a text file containing one research_id per line.

Genotyping Array ("Array") Data

The array data represents 447,278 participants and includes single sample VCFs, joint Hail MT files, joint PLINK files, and raw genotyping data in IDAT format.

Array IDAT files

We provide IDAT files for all array samples. The IDAT file is a binary file containing raw BeadArray data directly from the scanner. There are two files for each sample, corresponding to the red and green intensity values. These values give information about specific nucleotides on the genome. You can read more about the steps to call variants from these IDAT files in the Genomic Quality Report.

For an in depth description and how to process these files, read more about the illuminaio tool.

Array variant data

The variant data for array samples is delivered in VCF, Hail MT, and PLINK format.

Array VCFs

We provide single-sample VCFs for all 447,278 participants with array data. The array VCFs are sorted and block compressed (vcf.gz) with local tabix index files (vcf.gz.tbi).

Array VCFs in the All of Us genomic dataset will contain the following:

Header

The header field of the VCF contains many attributes which generally describe the processing of the sample in the array. Many of these are specific to a single sample.

arrayType - This contains the name of the genotyping array that was processed.
autocallDate - The date that the genotyping array was processed by ‘autocall’ (aka gencall), the Illumina genotype calling software.
autocallGender - The gender (sex) that autocall determined for the sample processed.
autocallVersion - The version of the autocall/gencall software used.
chipWellBarcode - The chip well barcode (a unique identifier for sample as processed on a specific location on the Illumina genotyping array).
clusterFile - The cluster file used.
extendedIlluminaManifestVersion - The version of the ‘extended Illumina manifest’ used by the VCF generation software.
extendedManifestFile - The filename of the ‘extended Illumina manifest’ used by the VCF generation software.
fingerprintGender - The gender (sex) determined using an orthogonal fingerprinting technology. This is populated by an optional parameter used by the VCF generation software.
gtcCallRate - The gtc call rate of the sample processed. This value is generated by the autocall/gencall software and represents the fraction of callable loci that had valid calls.
imagingDate - The date that the IDAT files (raw image scans) for the chip well barcode were created.
manifestFile - The name of the Illumina manifest (.bpm) file used by the VCF generation software.
sampleAlias - The name of the sample.

Note that there are many other attributes in the header (Biotin*, DNP*, Extension*, Hyb*, NP*, NSB*, Restore, String*, TargetRemoval) that are populated with Illumina control values. They are not described here.

Filtered Sites (FILTER)

There are several filters specific to genotyping array content. These are:

DUPE - This filter is applied if there are multiple rows in the VCF for the same loci and alleles. That is, if there are two or more rows that share the same chromosome, position, ref allele and alternate alleles, all but one of them will have the ‘DUPE’ filter set.
TRIALLELIC - This filter is applied if there is a site at which there are two alternate alleles and neither of them is the same as the reference allele.
ZEROED_OUT_ASSAY - This filter is applied if the variant at the site was ‘zeroed out’ in the Illumina cluster file - this is typically done when the calls at the site are intentionally marked as unusual. Genotypes called sites that are ‘zeroed out’ will always be no-calls.

Genotype (sample level fields)

These fields describe attributes specific to the sample genotyped on the array. The FORMAT specifier in the VCF header describes these fields. They are:

GT - GENOTYPE. This field describes the genotype. It is a standard field, described in the VCF specification.
IGC - Illumina GenCall Confidence Score. A measure of the call confidence.
X - Raw X intensity as scanned from the original genotyping array
Y - Raw Y intensity as scanned from the original genotyping array
NORMX - Normalized X intensity
NORMY - Normalized Y intensity
R - Normalized R Value (one of the polar coordinates after the transformation of NORMX and NORMY)
THETA - Normalized Theta value (one of the polar coordinates after the transformation of NORMX and NORMY)
LRR - Log R Ratio
BAF - B Allele Frequency

INFO (site level fields)

These fields describe attributes specific to the probe on an array. The INFO specifier in the VCF header describes these fields. They are:

AC - Allele Count in genotypes, for each ALT allele. A standard field, described in the VCF specification
AF - Allele Frequency. A standard field, described in the VCF specification
AN - Allele Number. A standard field, described in the VCF specification
ALLELE_A - The A Allele, as annotated in the Illumina manifest (a *suffix indicates this is the reference allele)
ALLELE_B - The B Allele, as annotated in the Illumina manifest (a *suffix indicates this is the reference allele)
BEADSET_ID - The BeadSet ID. An Illumina identifier. Used for normalization.
GC_SCORE - The Illumina GenTrain Score. A quality score describing the probe design
ILLUMINA_BUILD - The Genome Build for the design probe sequence, as annotated in the Illumina manifest
ILLUMINA_CHR - The chromosome of the design probe sequence, as annotated in the Illumina manifest.
ILLUMINA_POS - The position of the design probe sequence (on ILLUMINA_CHR), as annotated in the Illumina manifest.
ILLUMINA_STRAND - The strand for the design probe sequence, as annotated in the Illumina manifest.
PROBE_A - The allele A probe sequence as annotated in the Illumina manifest.
PROBE_B - The allele B probe sequence as annotated in the Illumina manifest. Note that this is only present on strand ambiguous SNPs.
SOURCE - The probe source as annotated in the Illumina manifest.
refSNP - The dbSNP rsId for this probe

Array Hail MT

We have merged the array VCFs into a Hail MT with no additional processing across samples. Each column corresponds to the research ID of the sample and each row corresponds to the variant. Since the single sample array VCFs have identical sites and FILTER values, the FILTER field is populated with the value from a single sample VCF.

In conversion, we have dropped all of the 505 variants from alternate, unlocalized, and unplaced contigs (436 variants from ALT contigs (e.g. chr19_KI270866v1_alt), 72 from random contigs (e.g. chr1_KI270706v1_random), and 13 from chrUn (e.g. chrUn_KI270742v1). These variants are still in the compressed array VCFs. Please refer to the published Featured Workspaces on how we generated the Hail MT from the VCFs.

Array PLINK 1 binary biallelic genotype table (PLINK bed)

We provide PLINK 1.9 data (.bed / .bim / .fam) for array data, converted from the Hail MT using the export_plink command in Hail and contain all information in the Hail MT. PLINK file type information can be found within the PLINK documentation. The .bed file is the PLINK binary biallelic genotype table and contains genotype calls. The .bim file is the PLINK extended .map file, and is a text file containing variant information. The .fam file is a text file with sample information for each participant. Please refer to the published Featured Workspaces on how to use the PLINK files.

Long-Read Whole Genome Sequencing (lrWGS)

We provide lrWGS data representing 2,800 participants in the CDRv7 and CDRv8 callsets. These data are particularly useful for resolving complex genomic regions, structural variants, and phasing of alleles, to provide a more comprehensive view of the genome. The CDRv8 callsets represent 1,773 participants and the CDRv7 callset represents 1,027 participants (Table 16).

These 2,800 participants are represented by a total of 2,842 samples, because 41 participants are sequenced on both PacBio and ONT. In addition, one participant was sequenced at both BI and UW, though to different coverage.

Table 16. Sample cohorts for all 2,800 participants with lrWGS data

Cohort name	Sequencing facility	Sequencing platform	Number of samples	Minimum coverage	Notes
HA_Rev_mid	HA	PacBio Revio	65	Mid-pass (12x)
HA_Seq_CDRv7	HA	PacBio Sequel Ile and Sequel II	1027	Mid-pass (12x)	The CDRv7 data
BI_Seq_high	BI	PacBio Sequel Ile	84	High-pass (25x)
BI_Seq_mid	BI	PacBio Sequel Ile	198	Mid-pass (12x)
BI_Rev_mid	BI	PacBio Revio	803	Mid-pass (12x)
BCM_Seq_high	BCM	PacBio Sequel Ile	77	High-pass (25x)
BCM_Rev_high	BCM	PacBio Revio	111	High-pass (25x)
BCM_ONT_high	BCM	ONT R10.4 on PromethION	196	High-pass (25x)
JHU_ONT_high	JHU	ONT R10.4 on PromethION	128	High-pass (25x)
UW_Seq_high	UW	PacBio Sequel Ile	100	High-pass (25x)
UW_Rev_high	UW	PacBio Revio	53	High-pass (25x)
Total samples			2842		42 CDRv8 participants were sequenced in two different samples

The file types are available depending on the cohort of each sample, please see Table 17 for more information. Joint callsets are generated per-cohort and single sample files are available on a per-sample level. One file with auxiliary metrics is generated for each sequencing location. All samples sequenced at that sequencing location are represented on a per-sample level in the auxiliary metrics. The main difference between the data available for each cohort is that PacBio cohorts have de novo assembly data while ONT cohorts do not.

For locations of the lrWGS files available, please see the lrWGS manifest and the CDR Directory Document.

Table 17. Data available for each lrWGS cohort

Cohort name

Sequencing reads

Variant data

Auxiliary data

PacBio cohorts

grch38_noalt BAM with methylation signals

T2Tv2.0 BAM with methylation signals

GFA files: primary de novo assembly, alternative de novo assembly, one de novo assembly for each chromosome copy

FASTA: one for each GFA file

One for each grch38_noalt & T2Tv2.0:

Joint-called SNP & Indel variants (GVCF & Hail MT)

Single sample SNP & indel variants (GVCF)

Single sample PBSV SVs (VCF)

Single sample Sniffles2 SVs (VCF)

Single sample Sniffles2 SNF

Single sample PAV variants (VCF)

Single sample data available in the the file per sequencing facility:

Auxiliary metrics grch38_noalt

Auxiliary metrics T2Tv2.0

ONT cohorts

grch38_noalt BAM with methylation signals

T2Tv2.0 BAM with methylation signals

One for each grch38_noalt & T2Tv2.0:

Joint-called SNP & Indel variants (GVCF & Hail MT)

Single sample SNP & indel variants (GVCF)

Single sample PBSV SVs (VCF)

Single sample Sniffles2 SVs (VCF)

Single sample Sniffles2 SNF

Single sample data available in the the file per sequencing facility:

Auxiliary metrics grch38_noalt

Auxiliary metrics T2Tv2.0

HA_Seq_CDRv7

grch38_noalt BAM: standard & haplotagged

T2Tv2.0 BAM: standard & haplotagged

GFA files: primary de novo assembly, alternative de novo assembly, one de novo assembly for each chromosome copy

FASTA: one for each GFA file

One for each grch38_noalt & T2Tv2.0:

Joint-called SNP & Indel variants (VCF & Hail MT)

Joint-called SVs (VCF): strict & lenient

Single sample SNP & indel variants (VCF)

Single sample SNP & Indel phased variants (VCF)

Single sample PBSV SVs (VCF)

Single sample Sniffles2 SVs (VCF)

Single sample Sniffles2 SNF

Single sample PAV variants (VCF)

Auxiliary metrics grch38_noalt

Auxiliary metrics T2Tv2.0

lrWGS sequencing reads

Each sample in the lrWGS data is aligned to two references, grch38_noalt and T2Tv2.0 in BAM format. Each BAM file is accompanied by an index BAI file.

grch38_noalt corresponds to the GRCh38 reference with no alternate sequences. T2Tv2.0 in the CDRv8 release corresponds to the T2T-CHM13v2.0 reference with these modifications: the EBV contig is added from the grch38_noalt reference, Chromosome Y is hardmasked with N bases in the Human Pseudoautosomal Region (PAR) region, and the mitochondrial genome is updated to the revised Cambridge Reference Sequence (rCRS). We updated the T2Tv2.0 reference for this CDRv8 release and so it is different from the previous CDRv7 T2Tv2.0 version. Please see Known Issue #7 in the All of Us Genomic Data Quality Report regarding how the T2Tv2.0 reference in CDRv8 is different from that used in CDRv7.

In the CDRv7 cohort, we additionally provide haplotagged files. A haplotagged BAM file contains additional information for each read to distinguish reads that come from different haplotypes.

Methylation signals

Methylation data are available in the long-read BAM files for both reference versions across all PacBio and ONT cohorts (Table 17). DNA methylation occurs in various forms, with 5-methylcytosine (5mC) being the predominant type in adult human genomes. This process involves enzymes adding a methyl group to a cytosine (C) at CpG sites—regions where a cytosine is immediately followed by a guanine (G). Methylation at these sites can influence gene transcription.

In the BAM files, methylation sites are annotated using MM and ML tags. The MM tag is binary, indicating whether a site is methylated. The ML tag provides a confidence score for the methylation status assigned by the caller. Note that while most reads include these methylation calls, some do not. To interpret the methylation data, see the PacBio BAM format specification and the SAM optional tags specifications. The documentation applies to both PacBio and ONT data.

lrWGS de novo assembly

Haplotype-resolved de novo assembly is available for all PacBio HiFi samples (Table 16) in Graphical Fragment Assembly (GFA) and FASTA format. Each de novo assembly includes a primary de novo assembly, an alternative de novo assembly, and two chromosome copies. The tool PAV is used to call variants from the PacBio GFA files.

GFA files

We release four Graphical Fragment Assembly (GFA) files for each PacBio sample sample, which are de novo graph-based assemblies. One GFA file is the primary assembly for the sample, another being the alternative assembly, and the other two GFA files are the chromosome copy assemblies. The GFA files describe the graph layouts of the contigs.

We use the tool hifiasm, which is a tool for generating haplotype-resolved de novo assemblies. Please check out the GFA specifications for more details about GFA format.

FASTA files

We provide four de novo assemblies as FASTA files for each PacBio HiFi lrWGS sample, matching the sequences from the GFA files. A FASTA file is a text file representation of genomic data. Each genomic sequence is described in two lines: the first line is a description line starting with a greater-than (">") symbol at the beginning and the second line contains the genomic sequence data as a string with the nucleotide sequence. Other than the first line of the FASTA file which is the description, these two lines representing genomic sequences are repeated in the file.

The CDRv8 files are block-gzipped and the CDRv7 files are gzipped. Each FASTA is accompanied by an index file.

Long-read variant data

For a detailed description of the CDRv7 lrWGS variant data, please refer to the CDRv7 How the All of Us Genomic Data are Organized.

lrWGS SNP & Indel GVCF

We perform SNP and Indel variant calling per-sample with DeepVariant for each reference version, grch38_noalt and T2Tv2.0. The single-sample variant data is released in GVCF format with accompanying GVCF TBI index files.

lrWGS joint callset Hail MT

We generate a lrWGS joint SNP & Indel callset by joining the single-sample GVCFs with GLNexus. The joint callsets are generated per-cohort, not across the entire lrWGS sample set (Table 16 for cohorts). The joint callset is available in Hail MT and GVCF format. The variants are hard-filtered with a QUAL cutoff of 40 for PacBio samples and 34 for ONT samples (see the All of Us Genomic Data Quality Report for more information).

lrWGS structural variant VCF

Structural variants are called from both PBSV and Sniffles2 for all lrWGS samples. Each lrWGS sample has a single VCF from each of the two variant callers, accompanied by TBI index files. Please see the headers of these VCF files for descriptions of the VCF fields. In addition, we output a Sniffles2 binary SNF file for use with Sniffles2’s multi-sample SV calling mode.

lrWGS PAV phased variants

Variants from the tool PAV are provided in VCF format for each PacBio HiFi sample. The VCF files are accompanied by a TBI index. PAV variants are derived from the haplotype resolved assembly (GFA files) generated by hifiasm. PAV-generated VCFs are phased. Please see the header of the PAV VCFs for a description of the VCF fields.

lrWGS CDRv7 cohort

Please see the CDRv6 version of the article How the All of Us Genomic Data are Organized for a thorough description of each lrWGS sample in the CDRv6 cohort. The files available for the CDRv7 callset are featured in Table 17. The major differences are as follows:

CDRv7 has haplotagged BAM files available.
The CDRv8 joint callsets are broken up into smaller cohorts.
In CDRv7, the single sample SNP & Indel variants were called with the PEPPER-MARGIN-DeepVariant pipeline.

lrWGS sample metrics

We provide two lrWGS variant metrics files, corresponding to each lrWGS reference, described in Table 18.

Table 18 -- lrWGS variant metrics file description

Field name	Type	Key?	Notes
research_id	string	yes	Research ID of the sample
mosdepth_cov	float	no	Coverage from the mosdepth tool (See the QC report for a description)
aligned_frac_bases	float	no	Fraction of bases aligned to the reference
aligned_num_bases	float	no	Number of bases aligned to the reference
aligned_num_reads	float	no	Number of reads aligned to the reference
aligned_read_length_N50	float	no	N50 of the aligned reads
aligned_read_length_median	float	no	Median length of the aligned reads
aligned_read_length_mean	float	no	Mean length of the aligned reads
aligned_read_length_stdev	float	no	Standard deviation of the aligned read length
average_identity	float	no	Mean percentage of matches to the reference per aligned read
median_identity	float	no	Median percentage of matches to the reference per aligned read
dvp_ft_pass_snp_cnt	float	no	Number of PASS SNPs after filtering
pbsv_nonBND_50bpSV_cnt	float	no	Number of SVs >= 50 bp called by PBSV (excluding break-end calls)
snf2_nonBND_50bpSV_cnt	float	no	Number of SVs >= 50 bp called by Sniffles2 (excluding break-end calls)

Column Explanations:

Field name -- The name of the field. In tsv files, this will appear on the first row of the file.
Type -- Data type.
Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
Notes -- Any other relevant information.

lrWGS flagged samples

As described in the QC doc, several lrWGS samples were flagged during the QC process, but not filtered. We release a separate file—a 4-column TSV—listing the samples that have been flagged. To uniquely identify a sample, you need the combination of the sample_id, sequencing facility, and platform.

Field name	Notes
sample_id	Research ID of the sample
sequencing_facility	The sequencing facility of the sample. Possible values are: BCM, BI, HA, JHU, UW.
platform	Sequencing technology of the sample. Possible values are revio, sequel, ont
reasons_for_flagging	The reasons for the sample to be flagged. There can be more than one reason for the sample to be flagged, separated by comma. No white spaces. Possible values: contamination_between_1_and_3_pct, coverage_slightly_below_target, diploid_assembly_length_anomaly, female_with_low_chrX_coverage, male_with_low_chrY_coverage, read_len_median_below_10kbp

lrWGS manifest

The location of each single sample file is listed in the lrWGS manifest file. This resource goes hand in hand with the Controlled CDR Directory Document, which lists the location of the manifest file and the paths for all joint callsts. Some samples will have two rows in the lrWGS manifest because they were sequenced at multiple sequencing facilities or on multiple platforms. To uniquely identify a sample, you need the combination of the sample_id, center, and platform.

Not all columns will be filled, depending on what data is available for the sample. See Table 17 for details.

Table 20 -- lrWGS manifest

Field name	Notes
research_id	Research ID of the sample
center	Sequencing facility of the sample. Possible values are: BCM, BI, HA, JHU, UW.
Platform	Sequencing technology of the sample. Possible values are revio, sequel, ont
assembly_alternate_fa	De novo alternate assembly, in FASTA format. Only available for PacBio samples. CDRv8 files are block-gzipped and CDRv7 files are gzipped.
assembly_alternate_fa_gzi	De novo alternate assembly, FASTA index file. Only available for PacBio samples in CDRv8.
assembly_alternate_gfa	De novo alternate assembly, FASTA index file. Only available for PacBio samples in CDRv8.
assembly_hap1_fa	De novo haplotype-resolved assembly for haplotype-1 (in no particular order), in FASTA format. Only available for PacBio samples. CDRv8 files are block-gzipped and CDRv7 files are gzipped.
assembly_hap1_fa_gzi	De novo haplotype-resolved assembly for haplotype-1 (in no particular order), FASTA index file. Only available for PacBio samples in CDRv8.
assembly_hap1_gfa	De novo haplotype-resolved assembly for haplotype-1 (in no particular order), in GFA format. Only available for PacBio samples.
assembly_hap2_fa	De novo haplotype-resolved assembly for haplotype-2 (in no particular order), in FASTA format. Only available for PacBio samples. CDRv8 files are block-gzipped and CDRv7 files are gzipped.
assembly_hap2_fa_gzi	De novo haplotype-resolved assembly for haplotype-2 (in no particular order), FASTA index file. Only available for PacBio samples in CDRv8.
assembly_hap2_gfa	De novo haplotype-resolved assembly for haplotype-2 (in no particular order), in GFA format. Only available for PacBio samples.
assembly_primary_fa	De novo primary assembly, in FASTA format. Only available for PacBio samples. CDRv8 files are block-gzipped and CDRv7 files are gzipped.
assembly_primary_fa_gzi	De novo primary assembly, FASTA index file. Only available for PacBio samples in CDRv8.
assembly_primary_gfa	De novo primary assembly, in GFA format. Only available for PacBio samples.
assembly_quast_report_html	HTML report for the primary, haplotype-1 and haplotype-2 assemblies generated by the QUAST program. Only available for CDRv7 PacBio samples.
assembly_quast_report_summary	A summary about the quality of the primary, haplotype-1 and haplotype-2 assemblies, reported by the QUAST program. Only available for PacBio samples.
chm13v2.0_bai	The accompanying index for the T2Tv2.0 BAM.
chm13v2.0_bam	T2Tv2.0 sequencing reads in BAM format
chm13v2.0_bam_pbi	The accompanying PBI index for the T2Tv2.0 BAM. Only available for CDRv7 samples.
chm13v2.0_deepvariant_phased_tbi	TBI index for the T2Tv2.0 PEPPER-Margin-DeepVariant phased VCF. Only available for CDRv7 samples.
chm13v2.0_deepvariant_phased_vcf	T2Tv2.0 PEPPER-Margin-DeepVariant phased single-sample VCF; a filter of QUAL<40 has been applied. Only available for CDRv7 samples.
chm13v2.0_deepvariant_tbi	TBI index for the T2Tv2.0 PEPPER-Margin-DeepVariant VCF. Only available for CDRv7 samples.
chm13v2.0_deepvariant_vcf	T2Tv2.0 PEPPER-Margin-DeepVariant single-sample VCF; a filter of QUAL<40 has been applied. Only available for CDRv7 samples.
chm13v2.0_dv_gtbi	TBI index for the DeepVariant T2Tv2.0 GVCF. Only available for CDRv8 samples.
chm13v2.0_dv_gvcf	T2Tv2.0 DeepVariant single-sample SNP & Indel GVCF. Only available for CDRv8 samples.
chm13v2.0_haplotagged_bai	T2Tv2.0 haplotagged BAM index. Only available for CDRv7 samples.
chm13v2.0_haplotagged_bam	T2Tv2.0 haplotagged BAM. Only available for CDRv7 samples.
chm13v2.0_pav_tbi	TBI index for the T2Tv2.0 PAV VCF. Only available for the PacBio samples.
chm13v2.0_pav_vcf	T2Tv2.0 PAV single-sample VCF. Only available for the PacBio samples.
chm13v2.0_pbsv_tbi	TBI index for the T2Tv2.0 PBSV SV single-sample VCF
chm13v2.0_pbsv_vcf	T2Tv2.0 PBSV SV single-sample VCF
chm13v2.0_sniffles_snf	T2Tv2.0 Sniffles2 single-sample SNF file
chm13v2.0_sniffles_tbi	TBI index for the T2Tv2.0 Sniffles2 VCF file
chm13v2.0_sniffles_vcf	T2Tv2.0 Sniffles2 single-sample VCF file
grch38_bai	The accompanying index for the grch38_noalt BAM.
grch38_bam	grch38_noalt sequencing reads in BAM format
grch38_bam_pbi	The accompanying PBI index for the grch38_noalt BAM. Only available for CDRv7 samples.
grch38_deepvariant_phased_tbi	TBI index for the grch38_noalt PEPPER-Margin-DeepVariant phased VCF. Only available for CDRv7 samples.
grch38_deepvariant_phased_vcf	grch38_noalt PEPPER-Margin-DeepVariant phased single-sample VCF; a filter of QUAL<40 has been applied. Only available for CDRv7 samples.
grch38_deepvariant_tbi	TBI index for the grch38_noalt PEPPER-Margin-DeepVariant VCF. Only available for CDRv7 samples.
grch38_deepvariant_vcf	grch38_noalt PEPPER-Margin-DeepVariant single-sample VCF; a filter of QUAL<40 has been applied. Only available for CDRv7 samples.
grch38_dv_gtbi	TBI index for the DeepVariant grch38_noalt GVCF. Only available for CDRv8 samples.
grch38_dv_gvcf	grch38_noalt DeepVariant single-sample SNP & Indel GVCF. Only available for CDRv8 samples.
grch38_haplotagged_bai	grch38_noalt haplotagged BAM index. Only available for CDRv7 samples.
grch38_haplotagged_bam	grch38_noalt haplotagged BAM. Only available for CDRv7 samples.
grch38_pav_tbi	TBI index for the grch38_noalt PAV VCF. Only available for the PacBio samples.
grch38_pav_vcf	grch38_noalt single-sample PAV VCF. Only available for the PacBio samples.
grch38_pbsv_tbi	TBI index for the grch38_noalt PBSV SV VCF
grch38_pbsv_vcf	grch38_noalt PBSV SV single-sample VCF
grch38_sniffles_snf	grch38_noalt Sniffles2 single-sample SNF file
grch38_sniffles_tbi	TBI index for the grch38_noalt Sniffles2 VCF file
grch38_sniffles_vcf	grch38_noalt single-sample Sniffles2 VCF file

Frequently Asked Questions (FAQs) Regarding the Genomic Data Organization

1. Which variants in the VDS are included in the VAT?

Variants included in the VAT must meet the following criteria:

Sites that pass the 'filters' field
Sites with 50 or fewer alternative alleles (for CDRv7)
Variants from these sites that pass the ‘FT' field and can be annotated by Nirvana from the VDS.

Passing the 'FT' field means that at least one call for the variant has passed the 'FT' filter.

Note: The cutoff for alternative alleles in CDRv7 is 50, though with other releases, this number can change.

2. Does the All of Us genomic dataset have Whole Exome Sequencing (WES) data?

No, the All of Us genomic dataset has Whole Genome Sequencing (WGS) data and not WES data. WES data only contains sequencing data for the protein-coding regions of the genome, known as exons, whereas WGS data sequences the entire genome. If you are only interested in the exome, we recommend that you use the exome smaller callset, which provides the variants within the exome.

3. Where can I find the research ID in the CRAM and IDAT files?

The research ID is in the file names of the CRAM and IDAT files. To correlate research IDs between the variant files and the raw data files, use the research IDs in the file name of the raw data files (CRAM and IDATs).

4. Where is the gene name (rsID) stored for each variant?

The rsID for each gene is stored in the Variant Annotation Table (VAT). If you have a rsID of interest, you can use the VAT to determine the genomic coordinates of the variant for analysis in the Hail MT, VCFs, or PLINK formats.

Search

Introduction

List of All of Us genomic data

Overview of the Genomic Data

Table 1 – Deliverables for each genomic data type

Short-Read Whole Genome Sequencing (srWGS) Data

srWGS CRAM files

srWGS SNP & Indel variant data

VariantDataset (VDS)

Variant Data

Table 2. VDS column fields: stores sample name

Table 3. VDS row fields: stores variant data

Table 4. VDS entry fields: stores genotype level variant data

Table 5. VDS global fields: filtering metadata for the entire callset

Reference Data

Table 6. VDS reference data column fields: stores sample name

Table 7. VDS reference data row fields: stores reference data

Table 8. VDS reference data entry fields: stores reference blocks

Filtering Information

srWGS SNP & Indel smaller callsets

srWGS Hail MT

srWGS VCF

PLINK 1 binary biallelic genotype table (PLINK bed)

PLINK 2 binary genotype table (PGEN)

Binary GEN format (BGEN)

srWGS SNP & Indel smaller callset BED files

Challenging medically relevant genes (CMRG)

Annotated Variants - Variant Annotation Table (VAT)

srWGS auxiliary data

srWGS genetic predicted ancestry

Table 9. srWGS genetic predicted ancestry TSV file description

srWGS genetic admixture estimates

srWGS pharmacogenomics data

srWGS statistical phasing

srWGS relatedness kinship scores

Table 10. srWGS pairwise samples with a kinship score over 0.1 TSV file description

srWGS SNP & Indel maximal set of unrelated samples

Table 11. List of srWGS SNP & Indel related samples to prune TSV file description

Flagged srWGS samples

Table 12. Flagged srWGS samples TSV file description

srWGS genomic QC values

srWGS genomic metrics

Table 13. Supplemental genomic metrics for each srWGS sample TSV file description

srWGS control samples

Structural variants (SVs) for srWGS data

srWGS SV VCF

srWGS SV sites-only VCF

srWGS SV maximal set of unrelated samples

srWGS SV unrelated sites-only VCF

srWGS SV samples with probable aneuploidies

srWGS SV samples with probable mosaic aneuploidies

Table 14. srWGS SV samples with probable mosaic aneuploidies TSV file description

srWGS SV samples with probable germline allosomal aneuploidy

Table 15. srWGS SV samples with probable germline allosomal aneuploidy TSV file description

srWGS SV sample list

Genotyping Array ("Array") Data

Array IDAT files

Array variant data

Array VCFs

Array Hail MT

Array PLINK 1 binary biallelic genotype table (PLINK bed)

Long-Read Whole Genome Sequencing (lrWGS)

Table 16. Sample cohorts for all 2,800 participants with lrWGS data

Table 17. Data available for each lrWGS cohort

lrWGS sequencing reads

Methylation signals

lrWGS de novo assembly

GFA files

FASTA files

Long-read variant data

lrWGS SNP & Indel GVCF

lrWGS joint callset Hail MT

lrWGS structural variant VCF

lrWGS PAV phased variants

lrWGS CDRv7 cohort

lrWGS sample metrics

Table 18 -- lrWGS variant metrics file description

lrWGS flagged samples

lrWGS manifest

Table 20 -- lrWGS manifest