The All of Us genomic dataset contains whole genome sequencing (WGS) data and microarray genotype data (Array). The genomic data are accessible through the Researcher Workbench Controlled Tier dataset (e.g., genomic data is not available through the Registered Tier). Bucket locations for accessing the data in analysis notebooks can be found in the Controlled CDR Directory. We provide variant data in Variant Call Format (VCF), Hail MatrixTables (MT), and PLINK 1.9 bed/bim/fam triplets. PLINK files are only available for the array variant data. Raw data is available in compressed CRAM format for the WGS reads and IDAT files for Array data. We also provide auxiliary tabular data, such as the joint callset QC flagged samples or related pairs, as tab-separated values (tsv), with the column headers in the first row. Please see Appendix A for a detailed comparison of the WGS data and the Array data. Please note that all variants are called against the hg38/GRCh38 reference. You can find public reference files here.
Variant Call Format (VCF)
The Variant Call Format (VCF) is a text file that stores genomic variant data in a tabular form (genomic position by sample ID) with a descriptive header. All of Us VCF files are based on VCF version 4.2 specification. Most genomic tools that handle variant data analyses (e.g. Hail, PLINK, Variant Effect Predictor (VEP), Genome Analysis Toolkit (GATK)) support VCF format.
The current WGS dataset includes 20,039 joint VCF files and corresponding index files (.tbi) for 98,590 participants. Each file is a separate region of the genome, with no file overlapping; internal lists for each VCF shard is available. The current Array data includes 165,127 single sample VCFs and corresponding index files (.tbi) for 165,127 participants.
The information within a VCF is broken into two basic categories: site-level (INFO field) and per genotype (FORMAT field). Additionally, each site has a list of failed filters (FILTER field). If the FILTER field is empty (“.”) or “PASS,” then researchers should assume that there is a call at this site (though see Genotype Filter for WGS data, below).
In the current genomic dataset, VCFs only include single-nucleotide variants (SNVs) and short insertion-deletion (Indel) information.
Please note that individual VCFs may contain substantially different information. For example, a VCF generated from WGS samples will typically have different fields than what would be found in a VCF from array data. The header in a VCF file will explain which fields are present and what their data type (e.g. number vs string) and descriptions are.
WGS VCFs
The WGS VCFs are sharded by genomic region, with no shard spanning multiple contigs. We also provide interval lists (.interval_list) that describe the genomic region covered by each shard and a summary file that details the extent of each shard (see Appendix B). Note that the filter tag (FT) on the genotypes is populated and analyses should take these into account by treating those genotypes as “no calls” (“./.”).
Please note that there are two known issues in the WGS VCF (and corresponding Hail MatrixTable). The Allelic Depth (AD) field has an unconventional format and there is an extraneous annotation in the INFO field. You can read more about this in Known Issues #4 and #5 in the All of Us Research Program Genomic Research Data Quality Report.
WGS VCFs in the All of Us genomic dataset contain the following:
FORMAT fields (per sample-site):
- Genotype (GT) -- The GT field specifies the alleles carried by the sample, encoded by a 0 for the reference (REF) allele, 1 for the first alternative (ALT) allele, 2 for the second ALT allele, etc. Since humans are diploid organisms, we expect two alleles (e.g. “0/1”). Please note that the GT calls on sex chromosomes will have two alleles, even in the case of chrY and chrX in males.
- Allelic Depth (AD) -- Allelic depths for the reference allele and the alternate allele(s) present at this site. The AD is only specified for variants with a non-ref allele (not homozygous reference). Heterozygous reference samples will have 2 values and heterozygous alternate samples will have 3 values because the reference count is always included. For more information about AD and which reads are counted, see this article on Allele Depth. The AD format in the WGS VCFs is not the conventional AD format. For more information about the unconventional format, see Known Issue #4 in the All of Us Research Program Genomic Research Data Quality Report.
- Genotype Quality (GQ) -- The phred-scaled confidence that the called genotype is correct. A higher score indicates a higher confidence. For more information on GQ, please see the GQ documentation. For more information on interpreting phred-scaled values, please see Phred-scaled quality scores.
- Reference Genotype Quality (RGQ) -- The phred-scaled confidence that the reference genotypes are correct. A higher score indicates a higher confidence. For more information on RGQ, please see the GQ documentation, but note that RGQ applies to the reference, not the variant. For more information on interpreting phred-scaled values, please see Phred-scaled quality scores.
-
Genotype Filter (FT) -- The WGS information has additional genotype-level filtering information. As part of our joint callset quality control processing, we run Allele-Specific Variant Quality Score Recalibration, and use the results to populate the genotype filter (FT) field. An example code snippet for filtering genotypes, in Hail, can be found in the Manipulating Hail Matrix Table tutorial notebook.
- low_VQSLOD_INDEL -- The site did not pass the indel model cutoff for the recalibrated variant quality score. The cutoff corresponds to a target sensitivity of 0.990.
- low_VQSLOD_SNP -- The site did not pass the SNP model cutoff for the recalibrated variant quality score. The cutoff corresponds to a target sensitivity of 0.997.
- Possible values:
INFO fields (per site):
Descriptions of the INFO fields can also be found in the header of the VCF.
- Allele Count (AC) -- the number of times we see each alternate allele in the sample. For example, a “1/1” genotype would count as 2 observations of the first alternate allele.
- Allele Number (AN) -- the total number of alleles seen. Usually, this will be the number of samples times two, since humans are diploid organisms. No-call genotypes (“./.”) are not counted towards AN.
- Allele Frequency (AF) -- the frequency of the alternate allele in the population that is the callset cohort. This is equivalent to AC/AN.
- QUAL approximation (QUALapprox) -- the sum of the phred-scaled homozygous reference probability values across all samples, which is a proxy for the site-level QUAL score, but without the SNP or indel heterozygosity applied as a per-site prior probability of variation. For more information on the QUAL score, see the VCF specification.
- Allele-specific QUAL approximations (AS_QUALapprox) -- a per-allele, phred-scaled quality score derived from the sum of homozygous reference probability values across samples when each allele is considered in isolation. This is an approximation of the QUAL score for each allele. For more information on the QUAL score, see the VCF specification.
- Allele-specific variant quality score log-odds (AS_VQSLOD) -- for each alt allele, the log odds of being a true variant versus being false under the trained gaussian mixture model. Please see the GATK VQSR documentation or Genomics in the Cloud for more information on VQSR.
FILTER values (per site):
-
QUAL score does not meet threshold (LowQual) -- sites with this filter have a posterior probability of being variant that is equal to or below the probability of being variant by chance, represented by the expected heterozygosity for humans (QUALapprox lower than 60 for SNPs; 69 for Indels)
- QUAL tells you how confident we are that there is some kind of variation at a given site. The variation may be present in one or more samples.
-
No high-quality genotypes (NO_HQ_GENOTYPES) -- sites with this filter do not have any genotypes that are considered high quality (GQ>=20, DP>=10, and AB>=0.2 for heterozygotes)
- Allele Balance (AB) is calculated for each heterozygous variant as the number of reads supporting the least-represented allele over the total number of read observations. In other words, min(allele depth)/(total depth) for diploid GTs.
- Excess Heterozygosity (ExcessHet) -- sites with this filter have more heterozygote genotypes than expected by chance under Hardy-Weinberg equilibrium. ExcessHet is a phred-scaled p-value. We cutoff anything more extreme than a z-score of -4.5 (p-value of 3.4e-06), which phred-scaled is 54.69
Array VCFs
Array VCFs in the All of Us genomic dataset will contain the following:
Header
The header field of the VCF contains many attributes which generally describe the processing of the sample in the array. Many of these are specific to a single sample.
- arrayType - This contains the name of the genotyping array that was processed.
- autocallDate - The date that the genotyping array was processed by ‘autocall’ (aka gencall), the Illumina genotype calling software.
- autocallGender - The gender (sex) that autocall determined for the sample processed.
- autocallVersion - The version of the autocall/gencall software used.
- chipWellBarcode - The chip well barcode (a unique identifier for sample as processed on a specific location on the Illumina genotyping array).
- clusterFile - The cluster file used.
- extendedIlluminaManifestVersion - The version of the ‘extended Illumina manifest’ used by the VCF generation software.
- extendedManifestFile - The filename of the ‘extended Illumina manifest’ used by the VCF generation software.
- fingerprintGender - The gender (sex) determined using an orthogonal fingerprinting technology. This is populated by an optional parameter used by the VCF generation software.
- gtcCallRate - The gtc call rate of the sample processed. This value is generated by the autocall/gencall software and represents the fraction of callable loci that had valid calls.
- imagingDate - The date that the IDAT files (raw image scans) for the chip well barcode were created.
- manifestFile - The name of the Illumina manifest (.bpm) file used by the VCF generation software.
- sampleAlias - The name of the sample.
Note that there are many other attributes in the header (Biotin*, DNP*, Extension*, Hyb*, NP*, NSB*, Restore, String*, TargetRemoval) that are populated with Illumina control values. They are not described here.
Filtered Sites (FILTER)
There are several filters specific to genotyping array content. These are:
- DUPE - This filter is applied if there are multiple rows in the VCF for the same loci and alleles. That is, if there are two or more rows that share the same chromosome, position, ref allele and alternate alleles, all but one of them will have the ‘DUPE’ filter set.
- TRIALLELIC - This filter is applied if there is a site at which there are two alternate alleles and neither of them is the same as the reference allele.
- ZEROED_OUT_ASSAY - This filter is applied if the variant at the site was ‘zeroed out’ in the Illumina cluster file - this is typically done when the calls at the site are intentionally marked as unusual. Genotypes called sites that are ‘zeroed out’ will always be no-calls.
Genotype (sample level fields)
These fields describe attributes specific to the sample genotyped on the array. The FORMAT specifier in the VCF header describes these fields. They are:
- GT - GENOTYPE. This field describes the genotype. It is a standard field, described in the VCF specification.
- IGC - Illumina GenCall Confidence Score. A measure of the call confidence.
- X - Raw X intensity as scanned from the original genotyping array
- Y - Raw Y intensity as scanned from the original genotyping array
- NORMX - Normalized X intensity
- NORMY - Normalized Y intensity
- R - Normalized R Value (one of the polar coordinates after the transformation of NORMX and NORMY)
- THETA - Normalized Theta value (one of the polar coordinates after the transformation of NORMX and NORMY)
- LRR - Log R Ratio
- BAF - B Allele Frequency
INFO (site level fields)
These fields describe attributes specific to the probe on an array. The INFO specifier in the VCF header describes these fields. They are:
- AC - Allele Count in genotypes, for each ALT allele. A standard field, described in the VCF specification
- AF - Allele Frequency. A standard field, described in the VCF specification
- AN - Allele Number. A standard field, described in the VCF specification
- ALLELE_A - The A Allele, as annotated in the Illumina manifest (a *suffix indicates this is the reference allele)
- ALLELE_B - The B Allele, as annotated in the Illumina manifest (a *suffix indicates this is the reference allele)
- BEADSET_ID - The BeadSet ID. An Illumina identifier. Used for normalization.
- GC_SCORE - The Illumina GenTrain Score. A quality score describing the probe design
- ILLUMINA_BUILD - The Genome Build for the design probe sequence, as annotated in the Illumina manifest
- ILLUMINA_CHR - The chromosome of the design probe sequence, as annotated in the Illumina manifest.
- ILLUMINA_POS - The position of the design probe sequence (on ILLUMINA_CHR), as annotated in the Illumina manifest.
- ILLUMINA_STRAND - The strand for the design probe sequence, as annotated in the Illumina manifest.
- PROBE_A - The allele A probe sequence as annotated in the Illumina manifest.
- PROBE_B - The allele B probe sequence as annotated in the Illumina manifest. Note that this is only present on strand ambiguous SNPs.
- SOURCE - The probe source as annotated in the Illumina manifest.
- refSNP - The dbSNP rsId for this probe
Hail MatrixTable
A Hail MatrixTable (MT) is a set of binary files describing a two-dimensional matrix of entry fields where each entry is indexed by row key (variants) and column key (samples). We provide Hail MatrixTables for both WGS and Array data. Please refer to the published notebooks on how to use Hail MatrixTables.
Array Hail MatrixTable
We have merged the Array VCFs into a dense Hail MatrixTable (MT) (i.e. no additional processing across samples). Each column corresponds to the research ID of the sample and each row corresponds to the variant. Since the single sample array VCFs have identical sites and FILTER values, the FILTER field is populated with the value from a single sample VCF.
In conversion, we have dropped all of the 548 variants from alternate, unlocalized, and unplaced contigs (453 variants from ALT contigs (e.g. chr19_KI270866v1_alt), 82 from random contigs (e.g. chr1_KI270706v1_random), and 13 from chrUn). These variants are still in the compressed array VCFs. Please refer to the published notebooks on how we generated the Hail MatrixTable from the VCFs.
WGS Hail MatrixTable
The WGS MT contains the result of the WGS joint callset. The WGS MT does not split multi-allelic variants, and all “pass” in the filter field is written as “.” (i.e. missing) in the Hail MatrixTable. The WGS Hail MatrixTable contains all the information presented in the VCFs. Please refer to the published notebooks on how we generated the Hail MatrixTable from the VCFs and on how to filter WGS variants.
CRAM Files
We provide CRAM files for WGS samples. CRAM files are compressed SAM (sequence alignment map) files and contain read data mapped to the hg38/GRCh38 reference. Once uncompressed, the SAM file format contains records describing the reads, their mapping information, and quality score information. The SAM and CRAM file formats are described in this specification doc. Refer to the Genomics Quality Report for more information on how variant calling was performed on these WGS CRAM files.
When comparing the CRAM file to research IDs in the variant files, look at the CRAM file name for the research ID.
Array PLINK 1.9 Data
We provide binary PLINK 1.9 data (.bed / .bim / .fam) for Array data. The PLINK files are converted from the Hail MatrixTable using the export_plink command in Hail and contain all information in the Hail MatrixTable. The .bim file contains information on the participants, the .fam file contains information on the genetic markers, and the binary .bed file contains individual identifiers and genotypes. Please refer to the published notebooks on how to use the PLINK 1.9 data.
Please note that we will provide pgen files, instead of bed/bim/fam, in future callset releases.
IDAT Files
We provide IDAT files for all Array samples. The IDAT file is a binary file containing raw BeadArray data directly from the scanner. There are two files for each sample, corresponding to the red and green intensity values. These values give information about specific nucleotides on the genome. You can read more about the steps to call variants from these IDAT files in the Genomic Quality Report.
For an in depth description and how to process these files, read more about the illuminaio tool.
Relatedness
We report the kinship score of any pair with a score over 0.1.
The kinship score is half of the fraction of the genetic material shared.
- Parent-child or siblings: 0.25
- Identical twins: 0.5
Please see the Hail pc_relate function documentation for more information, including interpretation.
We provide two tables:
1. Pairwise samples with kinship scores above 0.1. We will not provide identity kinship scores (i.e. kinship of a sample with itself).
-
- Note that a pair will only appear once. In other words, {sample1, sample2, 0.25} is equivalent to {sample2, sample1, 0.25}.
Field name |
Type |
Key? |
Notes |
i.s |
string |
yes |
Sample ID of a sample in the pair |
j.s |
string |
yes |
Sample ID of the other sample in the pair |
kin |
float |
no |
Kinship score (0-0.5) |
Column Explanations
- Field name -- The name of the field. In tsvs, this will appear on the first row of the file.
- Type -- Data type. Arrays are possible.
- Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
- Notes -- Any other relevant information.
2. A list of samples that indicate the samples to flag. This will be the maximal independent set for related samples. This set is the minimal set of samples to prune to remove related samples from the cohort.
Field name |
Type |
Key? |
Notes |
sample_id.s |
string |
Yes |
Research ID of the sample |
Column Explanations
- Field name -- The name of the field. In tsvs, this will appear on the first row of the file.
- Type -- Data type. Arrays are possible.
- Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
- Notes -- Any other relevant information.
Genetic predicted ancestry
We provide categorical genetic ancestry for all participants as a .tsv file. The ancestry categories are correspond directly to categorial ancestry definitions used within gnomAD, the Human Genome Diversity Project, and 1000 Genomes:
African/African American (afr), American Admixed/Latino (amr), East Asian (eas), European (eur), Middle Eastern (mid), South Asian (sas), and Other (oth; not belonging to one of the other ancestries or is a balanced admixture).
The report is a table, saved as a tsv, sorted by research ID.
Field Name |
Key? |
Type |
Nullable? |
Example Value |
Notes |
research_id |
yes |
String |
No |
1000055 |
This comes from sample metadata. |
ancestry_pred |
no |
String |
No |
mid |
The predicted ancestry for the sample, not including “other.” |
probabilities |
no |
Array[number] |
No |
[0.10, 0.99, 0.001, … 0.0] |
Confidence of each output class (i.e. computed ancestry). |
pca_features |
no |
Array[number] |
No |
[8.1232, 0.01234, 3.1123, …, 0.00132] |
The principal components of the projection for the sample. Each value is an array with a length of 16. |
ancestry_pred_other |
no |
String |
No |
oth |
The predicted ancestry for the sample, including “other.” |
Column Explanations
- Field name -- The name of the field. In tsvs, this will appear on the first row of the file.
- Type -- Data type. Arrays are possible.
- Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
- Notes -- Any other relevant information.
Flagged WGS samples
We provide a table listing samples that are flagged as part of the sample outlier QC for the WGS joint callset. This includes the specific residual tests that were failed.The schema is described in the table below. The table will be released as a tsv.
Flagged sample tsv schema
- No fields can have a null value.
- Count fields do not include filtered variants.
- For all of the fail_* fields, a value of true indicates that the sample is an outlier and should be flagged.
Field Name |
Type |
Key? |
Example Value |
Notes |
s |
int |
yes |
1000000 |
Research ID |
ancestry_pred |
string |
no |
eur |
The predicted ancestry for the sample, not including “other.” |
probabilities |
array<float> |
no |
[0.10, 0.99, 0.001, … 0.0] |
Confidence of each output class (i.e. computed ancestry). |
pca_features |
array<float> |
no |
[8.1232, 0.01234, 3.1123, …, 0.00132] |
Each will have a length of 16. |
ancestry_pred_other |
string |
no |
oth |
The predicted ancestry for the sample, including “other.” |
snp_count |
int |
no |
3910035 |
Number of SNPs called in this sample. |
ins_del_ratio |
float |
no |
0.98814 |
Ratio of insertion to deletion counts. |
del_count |
int |
no |
427102 |
|
ins_count |
float |
no |
456515 |
|
snp_het_homvar_ratio |
float |
no |
2.1119 |
|
indel_het_homvar_ratio |
float |
no |
2.3994 |
|
ti_tv_ratio |
float |
no |
1.9967 |
|
singleton |
int |
no |
15819 |
IMPORTANT: This is not the number of singletons in a sample. This field is a count of the number of variants not appearing in gnomAD 3.1. |
fail_snp_count_residual |
boolean |
no |
true |
|
fail_ins_del_ratio_residual |
boolean |
no |
false |
|
fail_del_count_residual |
boolean |
no |
true |
|
fail_ins_count_residual |
boolean |
no |
false |
|
fail_snp_het_homvar_ratio_residual |
boolean |
no |
true |
|
fail_indel_het_homvar_ratio_residual |
boolean |
no |
false |
|
fail_ti_tv_ratio_residual |
boolean |
no |
true |
|
fail_singleton_residual |
boolean |
no |
false |
|
qc_metrics_filters |
array<string> |
no |
["indel_het_homvar_ratio_residual", "snp_count_residual"] |
A list of each failed test. These will correspond to all fail_* fields with a value of “true.” |
Genomics Metrics
We provide a table with supplemental genomic QC metrics for each WGS sample. The schema is described in the table below. The table will be released as a tsv.
Genomic metrics tsv schema
- No fields can have a null value.
- No samples will be in the table if they do not pass the QC thresholds.
Field Name |
Type |
Key? |
Example Value |
Notes |
research_id |
int |
yes |
1000000 |
Unique identifier for each participant |
sample_source |
string |
no |
Whole Blood |
Sample source (blood or saliva) |
sex_at_birth |
string |
no |
Female |
Participant provided information for sex at birth |
dragen_sex_ploidy |
string |
no |
XX |
Ploidy output from DRAGEN |
mean_coverage |
float |
no |
107.69 |
Mean number of overlapping reads at every targeted base of the genome (threshold ≥30x) |
genome_coverage |
float |
no |
97.61 |
Percent of bases with at least 20x coverage (threshold ≥90% at 20x) |
aou_hdr_coverage |
float |
no |
100 |
Percent of bases in the All of Us Hereditary Disease Risk gene (AoUHDR) with at least 20x coverage (threshold ≥95% at 20x) |
dragen_contamination |
float |
no |
0.003 |
Cross-individual contamination rate from DRAGEN |
aligned_q30_bases |
float |
no |
174329894399 |
Aligned Q30 bases from DRAGEN (threshold ≥8e10) |
verify_bam_id2_contamination |
float |
no |
0.0000104116 |
Cross-individual contamination rate from VerifyBamID2 |
Column Explanations
- Field name -- The name of the field. In tsvs, this will appear on the first row of the file.
- Type -- Data type.
- Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
- Notes -- Any other relevant information.
WGS Sites
We provide a table with information about which Genome Center sequenced the samples for each WGS participant.
-
-
Research_id is the same as person_id in CDR
-
Acronyms: BI (Broad Institute), BCM (Baylor College of Medicine), UW (Northwest Genomics Center at the University of Washington)
-
Variant Annotation Table
We provide 102 functional annotations through the Variant Annotation Table. We provide the annotations as compressed tab-separated value text files ( “.tsv.gz”) that match each shard of the VCFs. Additionally, we provide a single, merged tsv file (“.tsv.bgz”).
Appendix A: WGS and Array data comparison for Q2 2022
Table A.1 -- Comparison between WGS and Array genomic data
Deliverables |
WGS |
Array |
Notes |
VCFs |
Joint called VCFs sharded by genomic region |
Single sample VCFs |
For arrays, VCFs all have the same variants. All VCFs are sorted and block compressed (.vcf.gz) with a local tabix index (.vcf.gz.tbi). |
Hail Matrix Table |
Yes |
Yes |
|
PLINK files |
No |
Yes |
Generated from the Hail MatrixTable |
AC, AN, and AF in the VCFs |
Yes |
No |
Not meaningful in single sample VCFs (i.e. array VCFs). |
AC, AN, and AF in the Hail MatrixTable |
Yes |
Yes |
For arrays, we have created one Hail MT based on all of the array VCFs. We re-calculate the AC, AN, and AF. |
IDAT files |
No |
Yes |
Two files for each sample, research ID appears in file name |
CRAM files |
Yes |
No |
One file for each sample, research ID appears in file name |
Ancestry |
Yes |
No |
WGS samples are a subset of array samples, so the intersecting array samples will have ancestry information. |
Relatedness |
Yes |
No |
WGS samples are a subset of array samples, so the intersecting array samples will have relatedness information. |
Joint callset QC |
Yes |
No |
Please see the Joint Callset section of the All of Us Beta Release Genomic Quality Report |
Functional Annotations |
Yes |
No |
Appendix B: WGS VCF shard interval extent file format
We include a single tsv file that details the genomic extent of each VCF shard interval file. This allows researchers to prune the considered intervals up front in their analyses. The format of the files provided are in Table B.1:
Table B.1 -- Interval list extent tsv file
- No fields can have a null value.
Field Name |
Type |
Key? |
Example Value |
Notes |
filename |
string |
yes |
0000000000-scattered.interval_list |
The filename of the corresponding interval list file. Note that the shard number is in the filename. |
start_contig |
string |
no |
chr1 |
This will always be the same as the end_contig, since the interval lists do not span multiple contigs. |
start_position |
int |
no |
10001 |
Position on the contig for the start of the interval list file. |
end_contig |
string |
no |
chr1 |
This will always be the same as the start_contig, since the interval lists do not span multiple contigs. |
end_position |
int |
no |
124997 |
This will always be greater than the start_position. Position on the contig for the end of the last interval in the interval list file. |
Appendix C: Frequently Asked Questions (FAQs) Regarding the Genomic Data Organization
-
Where can I find the research ID in the CRAM and IDAT files?
The research ID is in the file names of the CRAM and IDAT files. To correlate research IDs between the variant files and the raw data files, use the research IDs in the file name of the raw data files (CRAM and IDATs).
Comments
0 comments
Article is closed for comments.