How the All of Us Genomic data are organized (ARCHIVED C2021Q3R6 CDR CT Dataset v5)

The All of Us genomic dataset contains whole genome sequencing (WGS) data and microarray genotype data (Array). The genomic data are accessible through the Researcher Workbench Controlled Tier dataset (e.g., genomic data is not available through the Registered Tier). Bucket locations for accessing the data in analysis notebooks can be found in the Controlled CDR Directory. We provide variant data in Variant Call Format (VCF), Hail MatrixTables (MT), and PLINK 1.9 bed/bim/fam triplets. PLINK files are only available for the array variant data. We provide auxiliary tabular data, such as the joint callset QC flagged samples or related pairs, as tab-separated values (tsv), with the column headers in the first row. Please see Appendix A for a detailed comparison of the WGS data and the Array data. Please note that all variants are called against the hg38/GRCh38 reference.

Variant Call Format (VCF)

The Variant Call Format (VCF) is a text file that stores genomic variant data in a tabular form (genomic position by sample ID) with a descriptive header. All of Us VCF files are based on VCF version 4.2 specification. Most genomic tools that handle variant data analyses (e.g. Hail, PLINK, Variant Effect Predictor (VEP), Genome Analysis Toolkit (GATK)) support VCF format.

The current WGS dataset includes 20,039 joint VCF files and corresponding index files (.tbi) for 98,622 participants. Each file is a separate region of the genome, with no file overlapping; internal lists for each VCF shard is available. The current Array data includes 165,208 single sample VCFs and corresponding index files (.tbi) for 165,208 participants.

The information within a VCF is broken into two basic categories: site-level (INFO field) and per genotype (FORMAT field). Additionally, each site has a list of failed filters (FILTER field). If the FILTER field is empty (“.”) or “PASS,” then researchers should assume that there is a call at this site (though see Genotype Filter for WGS data, below).

In the current genomic dataset, VCFs only include single-nucleotide variants (SNVs) and short insertion-deletion (Indel) information.

Please note that individual VCFs may contain substantially different information. For example, a VCF generated from WGS samples will typically have different fields than what would be found in a VCF from array data. The header in a VCF file will explain which fields are present and what their data type (e.g. number vs string) and descriptions are.

WGS VCFs

The WGS VCFs are sharded by genomic region, with no shard spanning multiple contigs. We also provide interval lists (.interval_list) that describe the genomic region covered by each shard and a summary file that details the extent of each shard (see Appendix B). Note that the filter tag (FT) on the genotypes is populated and analyses should take these into account by treating those genotypes as “no calls” (“./.”).

Please note that several fields in the WGS VCF (and corresponding Hail MatrixTable) should be ignored. Please see Known Issues #6 in the All of Us Research Program Genomic Research Data Quality Report. Those fields are not included in the descriptions below.

WGS VCFs in the All of Us genomic dataset contain the following:

FORMAT fields (per sample-site):

Genotype (GT) -- The GT field specifies the alleles carried by the sample, encoded by a 0 for the reference (REF) allele, 1 for the first alternative (ALT) allele, 2 for the second ALT allele, etc. Since humans are diploid organisms, we expect two alleles (e.g. “0/1”). Please note that the GT calls on sex chromosomes will have two alleles, even in the case of chrY and chrX in males.
Genotype Quality (GQ) -- The phred-scaled confidence that the called genotype is correct. A higher score indicates a higher confidence. For more information on GQ, please see the GQ documentation. For more information on interpreting phred-scaled values, please see Phred-scaled quality scores.
Reference Genotype Quality (RGQ) -- The phred-scaled confidence that the reference genotypes are correct. A higher score indicates a higher confidence. For more information on RGQ, please see the GQ documentation, but note that RGQ applies to the reference, not the variant. For more information on interpreting phred-scaled values, please see Phred-scaled quality scores.
Genotype Filter (FT) -- The WGS information has additional genotype-level filtering information. As part of our joint callset quality control processing, we run Allele-Specific Variant Quality Score Recalibration, and use the results to populate the genotype filter (FT) field. An example code snippet for filtering genotypes, in Hail, can be found in the Manipulating Hail Matrix Table tutorial notebook.

Possible values:

low_VQSLOD_INDEL -- The site did not pass the indel model cutoff for the recalibrated variant quality score. The cutoff corresponds to a target sensitivity of 0.990.
low_VQSLOD_SNP -- The site did not pass the SNP model cutoff for the recalibrated variant quality score. The cutoff corresponds to a target sensitivity of 0.997.

INFO fields (per site):

Descriptions of the INFO fields can also be found in the header of the VCF.

Allele Count (AC) -- the number of times we see each alternate allele in the sample. For example, a “1/1” genotype would count as 2 observations of the first alternate allele.
Allele Number (AN) -- the total number of alleles seen. Usually, this will be the number of samples times two, since humans are diploid organisms. No-call genotypes (“./.”) are not counted towards AN.
Allele Frequency (AF) -- the frequency of the alternate allele in the population that is the callset cohort. This is equivalent to AC/AN.
QUAL approximation (QUALapprox) -- the sum of the phred-scaled homozygous reference probability values across all samples, which is a proxy for the site-level QUAL score, but without the SNP or indel heterozygosity applied as a per-site prior probability of variation. For more information on the QUAL score, see the VCF specification.
Allele-specific QUAL approximations (AS_QUALapprox) -- a per-allele, phred-scaled quality score derived from the sum of homozygous reference probability values across samples when each allele is considered in isolation. This is an approximation of the QUAL score for each allele. For more information on the QUAL score, see the VCF specification.
Allele-specific variant quality score log-odds (AS_VQSLOD) -- for each alt allele, the log odds of being a true variant versus being false under the trained gaussian mixture model. Please see the GATK VQSR documentation or Genomics in the Cloud for more information on VQSR.

FILTER values (per site):

QUAL score does not meet threshold (LowQual) -- sites with this filter have a posterior probability of being variant that is equal to or below the probability of being variant by chance, represented by the expected heterozygosity for humans (QUALapprox lower than 60 for SNPs; 69 for Indels)

QUAL tells you how confident we are that there is some kind of variation at a given site. The variation may be present in one or more samples.

No high-quality genotypes (NO_HQ_GENOTYPES) -- sites with this filter do not have any genotypes that are considered high quality (GQ>=20, DP>=10, and AB>=0.2 for heterozygotes)

Allele Balance (AB) is calculated for each heterozygous variant as the number of reads supporting the least-represented allele over the total number of read observations. In other words, min(allele depth)/(total depth) for diploid GTs.

Excess Heterozygosity (ExcessHet) -- sites with this filter have more heterozygote genotypes than expected by chance under Hardy-Weinberg equilibrium. ExcessHet is a phred-scaled p-value. We cutoff anything more extreme than a z-score of -4.5 (p-value of 3.4e-06), which phred-scaled is 54.69

Array VCFs

Array VCFs in the All of Us genomic dataset will contain the following:

Header

The header field of the VCF contains many attributes which generally describe the processing of the sample in the array. Many of these are specific to a single sample.

arrayType - This contains the name of the genotyping array that was processed.
autocallDate - The date that the genotyping array was processed by ‘autocall’ (aka gencall), the Illumina genotype calling software.
autocallGender - The gender (sex) that autocall determined for the sample processed.
autocallVersion - The version of the autocall/gencall software used.
chipWellBarcode - The chip well barcode (a unique identifier for sample as processed on a specific location on the Illumina genotyping array).
clusterFile - The cluster file used.
extendedIlluminaManifestVersion - The version of the ‘extended Illumina manifest’ used by the VCF generation software.
extendedManifestFile - The filename of the ‘extended Illumina manifest’ used by the VCF generation software.
fingerprintGender - The gender (sex) determined using an orthogonal fingerprinting technology. This is populated by an optional parameter used by the VCF generation software.
gtcCallRate - The gtc call rate of the sample processed. This value is generated by the autocall/gencall software and represents the fraction of callable loci that had valid calls.
imagingDate - The date that the idats (raw image scans) for the chip well barcode were created.
manifestFile - The name of the Illumina manifest (.bpm) file used by the VCF generation software.
sampleAlias - The name of the sample.

Note that there are many other attributes in the header (Biotin*, DNP*, Extension*, Hyb*, NP*, NSB*, Restore, String*, TargetRemoval) that are populated with Illumina control values. They are not described here.

Filtered Sites (FILTER)

There are several filters specific to genotyping array content. These are:

DUPE - This filter is applied if there are multiple rows in the VCF for the same loci and alleles. That is, if there are two or more rows that share the same chromosome, position, ref allele and alternate alleles, all but one of them will have the ‘DUPE’ filter set.
TRIALLELIC - This filter is applied if there is a site at which there are two alternate alleles and neither of them is the same as the reference allele.
ZEROED_OUT_ASSAY - This filter is applied if the variant at the site was ‘zeroed out’ in the Illumina cluster file - this is typically done when the calls at the site are intentionally marked as unusual. Genotypes called sites that are ‘zeroed out’ will always be no-calls.

Genotype (sample level fields)

These fields describe attributes specific to the sample genotyped on the array. The FORMAT specifier in the VCF header describes these fields. They are:

GT - GENOTYPE. This field describes the genotype. It is a standard field, described in the VCF specification.
IGC - Illumina GenCall Confidence Score. A measure of the call confidence.
X - Raw X intensity as scanned from the original genotyping array
Y - Raw Y intensity as scanned from the original genotyping array
NORMX - Normalized X intensity
NORMY - Normalized Y intensity
R - Normalized R Value (one of the polar coordinates after the transformation of NORMX and NORMY)
THETA - Normalized Theta value (one of the polar coordinates after the transformation of NORMX and NORMY)
LRR - Log R Ratio
BAF - B Allele Frequency

INFO (site level fields)

These fields describe attributes specific to the probe on an array. The INFO specifier in the VCF header describes these fields. They are:

AC - Allele Count in genotypes, for each ALT allele. A standard field, described in the VCF specification
AF - Allele Frequency. A standard field, described in the VCF specification
AN - Allele Number. A standard field, described in the VCF specification
ALLELE_A - The A Allele, as annotated in the Illumina manifest (a *suffix indicates this is the reference allele)
ALLELE_B - The B Allele, as annotated in the Illumina manifest (a *suffix indicates this is the reference allele)
BEADSET_ID - The BeadSet ID. An Illumina identifier. Used for normalization.
GC_SCORE - The Illumina GenTrain Score. A quality score describing the probe design
ILLUMINA_BUILD - The Genome Build for the design probe sequence, as annotated in the Illumina manifest
ILLUMINA_CHR - The chromosome of the design probe sequence, as annotated in the Illumina manifest.
ILLUMINA_POS - The position of the design probe sequence (on ILLUMINA_CHR), as annotated in the Illumina manifest.
ILLUMINA_STRAND - The strand for the design probe sequence, as annotated in the Illumina manifest.
PROBE_A - The allele A probe sequence as annotated in the Illumina manifest.
PROBE_B - The allele B probe sequence as annotated in the Illumina manifest. Note that this is only present on strand ambiguous SNPs.
SOURCE - The probe source as annotated in the Illumina manifest.
refSNP - The dbSNP rsId for this probe

Hail MatrixTable

A Hail MatrixTable (MT) is a set of binary files describing a two-dimensional matrix of entry fields where each entry is indexed by row key (variants) and column key (samples). We provide Hail MatrixTables for both WGS and Array data. Please refer to the published notebooks on how to use Hail MatrixTables.

Array Hail MatrixTable

We have merged the Array VCFs into a dense Hail MatrixTable (MT) (i.e. no additional processing across samples). Each column corresponds to the research ID of the sample and each row corresponds to the variant. Since the single sample array VCFs have identical sites and FILTER values, the FILTER field is populated with the value from a single sample VCF.

In conversion, we have dropped all of the 548 variants from alternate, unlocalized, and unplaced contigs (453 variants from ALT contigs (e.g. chr19_KI270866v1_alt), 82 from random contigs (e.g. chr1_KI270706v1_random), and 13 from chrUn). Please refer to the published notebooks on how we generated the Hail MatrixTable from the VCFs.

WGS Hail MatrixTable

The WGS MT contains the result of the WGS joint callset. The WGS MT does not split multi-allelic variants, and all “pass” in the filter field is written as “.” (i.e. missing) in the Hail MatrixTable. The WGS Hail MatrixTable contains all the information presented in the VCFs. Please refer to the published notebooks on how we generated the Hail MatrixTable from the VCFs and on how to filter WGS variants.

Array PLINK 1.9 data

We provide binary PLINK 1.9 data (.bed / .bim / .fam) for Array data. The PLINK files are converted from the Hail MatrixTable using the export_plink command in Hail and contain all information in the Hail MatrixTable. The .bim file contains information on the participants, the .fam file contains information on the genetic markers, and the binary .bed file contains individual identifiers and genotypes. Please refer to the published notebooks on how to use the PLINK 1.9 data.

Please note that we will provide pgen files, instead of bed/bim/fam, in future callset releases.

Relatedness

We report the kinship score of any pair with a score over 0.1.

The kinship score is half of the fraction of the genetic material shared.

Parent-child or siblings: 0.25
Identical twins: 0.5

Please see the Hail pc_relate function documentation for more information, including interpretation.

We provide two tables:

Pairwise samples with kinship scores above 0.1. We will not provide identity kinship scores (i.e. kinship of a sample with itself).
- Note that a pair will only appear once. In other words, {sample1, sample2, 0.25} is equivalent to {sample2, sample1, 0.25}.

Field name	Type	Key?	Notes
i.s	string	yes	Sample ID of a sample in the pair
j.s	string	yes	Sample ID of the other sample in the pair
kin	float	no	Kinship score (0-0.5)

Column Explanations

- Field name -- The name of the field. In tsvs, this will appear on the first row of the file.

- Type -- Data type. Arrays are possible.

- Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
- Notes -- Any other relevant information.

A list of samples that indicate the samples to flag. This will be the maximal independent set for related samples. This set is the minimal set of samples to prune to remove related samples from the cohort.

Field name	Type	Key?	Notes
sample_id.s	string	Yes	Research ID of the sample

Column Explanations

- Field name -- The name of the field. In tsvs, this will appear on the first row of the file.

- Type -- Data type. Arrays are possible.

- Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
- Notes -- Any other relevant information.

Genetic predicted ancestry

We provide categorical genetic ancestry for all participants as a .tsv file. The ancestry categories are correspond directly to categorial ancestry definitions used within gnomAD, the Human Genome Diversity Project, and 1000 Genomes:

African/African American (afr), American Admixed/Latino (amr), East Asian (eas), European (eur), Middle Eastern (mid), South Asian (sas), and Other (oth; not belonging to one of the other ancestries or is a balanced admixture).

The report is a table, saved as a tsv, sorted by research ID.

Field Name	Key?	Type	Nullable?	Example Value	Notes
research_id	yes	String	No	1000055	This comes from sample metadata.
ancestry_pred	no	String	No	mid	The predicted ancestry for the sample, not including “other”.
probabilities	no	Array[number]	No	[0.10, 0.99, 0.001, … 0.0]	Confidence of each output class (i.e. computed ancestry). Each will have a length equal to the number of possible computed ancestry labels minus one (6). The ancestry “Other” is computed separately based on the confidence of the other classes.
pca_features	no	Array[number]	No	[8.1232, 0.01234, 3.1123, …, 0.00132]	The principal components of the projection for the sample. Each value is an array with a length of 16.
ancestry_pred_other	no	String	No	oth	The predicted ancestry for the sample, including “other”.

Column Explanations

- Field name -- The name of the field. In tsvs, this will appear on the first row of the file.

- Type -- Data type. Arrays are possible.

- Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.

- Notes -- Any other relevant information.

Flagged WGS samples

We provide a table listing samples that are flagged as part of the sample outlier QC for the WGS joint callset. This includes the specific residual tests that were failed.The schema is described in the table below. The table will be released as a tsv.

Flagged sample tsv schema

No fields can have a null value.
Count fields do not include filtered variants.
For all of the fail_* fields, a value of true indicates that the sample is an outlier and should be flagged.

Field Name	Type	Key?	Example Value	Notes
s	int	yes	1000000	Research ID
ancestry_pred	string	no	eur	The predicted ancestry for the sample, not including “other”.
probabilities	array<float>	no	[0.10, 0.99, 0.001, … 0.0]	Confidence of each output class (i.e. computed ancestry). Each will have a length equal to the number of possible computed ancestry labels minus one (6). The ancestry “Other” is computed separately based on the confidence of the other classes.
pca_features	array<float>	no	[8.1232, 0.01234, 3.1123, …, 0.00132]	Each will have a length of 16.
ancestry_pred_other	string	no	oth	The predicted ancestry for the sample, including “other”.
snp_count	int	no	3910035	Number of SNPs called in this sample.
ins_del_ratio	float	no	0.98814	Ratio of insertion to deletion counts.
del_count	int	no	427102
ins_count	float	no	456515
snp_het_homvar_ratio	float	no	2.1119
indel_het_homvar_ratio	float	no	2.3994
ti_tv_ratio	float	no	1.9967
singleton	int	no	15819	IMPORTANT: This is not the number of singletons in a sample. This field is a count of the number of variants not appearing in gnomAD 3.1.
fail_snp_count_residual	boolean	no	true
fail_ins_del_ratio_residual	boolean	no	false
fail_del_count_residual	boolean	no	true
fail_ins_count_residual	boolean	no	false
fail_snp_het_homvar_ratio_residual	boolean	no	true
fail_indel_het_homvar_ratio_residual	boolean	no	false
fail_ti_tv_ratio_residual	boolean	no	true
fail_singleton_residual	boolean	no	false
qc_metrics_filters	array<string>	no	["indel_het_homvar_ratio_residual", "snp_count_residual"]	A list of each failed test. These will correspond to all fail_* fields with a value of “true”.

Variant Annotation Table

We provide 102 functional annotations through the Variant Annotation Table. We provide the annotations as compressed tab-separated value text files ( “.tsv.gz”) that match each shard of the VCFs. Additionally, we provide a single, merged tsv file (“.tsv.bgz”).

Appendix A: WGS and Array data comparison for beta

Table A.1 -- Comparison between WGS and Array genomic data

Deliverables	WGS	Array	Notes
VCFs	Joint called VCFs sharded by genomic region	Single sample VCFs	For arrays, VCFs all have the same variants. All VCFs are sorted and block compressed (.vcf.gz) with a local tabix index (.vcf.gz.tbi).
Hail Matrix Table	Yes	Yes
PLINK files	No	Yes	Generated from the Hail MatrixTable.
AC, AN, and AF in the VCFs	Yes	No	Not meaningful in single sample VCFs (i.e. array VCFs).
AC, AN, and AF in the Hail MatrixTable	Yes	Yes	For arrays, we have created one Hail MT based on all of the array VCFs. We re-calculate the AC, AN, and AF.
Ancestry	Yes	No	WGS samples are a subset of array samples, so the intersecting array samples will have ancestry information.
Relatedness	Yes	No	WGS samples are a subset of array samples, so the intersecting array samples will have relatedness information.
Joint callset QC	Yes	No	Please see the Joint Callset section of the All of Us Beta Release Genomic Quality Report.
Functional Annotations	Yes	No

Appendix B: WGS VCF shard interval extent file format

We include a single tsv file that details the genomic extent of each VCF shard interval file. This allows researchers to prune the considered intervals up front in their analyses. The format of the files provided are in Table B.1:

Table B.1 -- Interval list extent tsv file

No fields can have a null value.

Field Name	Type	Key?	Example Value	Notes
filename	string	yes	0000000000-scattered.interval_list	The filename of the corresponding interval list file. Note that the shard number is in the filename.
start_contig	string	no	chr1	This will always be the same as the end_contig, since the interval lists do not span multiple contigs.
start_position	int	no	10001	Position on the contig for the start of the interval list file.
end_contig	string	no	chr1	This will always be the same as the start_contig, since the interval lists do not span multiple contigs.
end_position	int	no	124997	This will always be greater than the start_position. Position on the contig for the end of the last interval in the interval list file.

How the All of Us Genomic data are organized (ARCHIVED C2021Q3R6 CDR CT Dataset v5)

Variant Call Format (VCF)

WGS VCFs

We provide 102 functional annotations through the Variant Annotation Table. We provide the annotations as compressed tab-separated value text files ( “.tsv.gz”) that match each shard of the VCFs. Additionally, we provide a single, merged tsv file (“.tsv.bgz”).

Was this article helpful?

Comments

<%= previousTitle %>

<%= nextTitle %>

<%= block.name %>

<%= block.name %>

Have a question or would like to make a request?

Categories

Toggle navigation menu

<%= category.name %>

Search

Variant Call Format (VCF)

WGS VCFs

We provide 102 functional annotations through the Variant Annotation Table. We provide the annotations as compressed tab-separated value text files ( “.tsv.gz”) that match each shard of the VCFs. Additionally, we provide a single, merged tsv file (“.tsv.bgz”).

Was this article helpful?

<%= previousTitle %>

<%= nextTitle %>

<%= block.name %>

<%= block.name %>

Have a question or would like to make a request?

Categories

Toggle navigation menu

<%= category.name %>

Categories

Categories