Variant Annotation Table

Functional annotations for passing variants in the short read whole genome sequencing (srWGS) SNP and Indel dataset are available in the Variant Annotation Table (VAT). Sites with 50 or more alternate alleles are not included in the VAT. The VAT includes annotations like the gene symbol and protein change, delivered as a block compressed tab-separated value text file (.tsv.bgz), which can be loaded into Hail. Each row represents a variant-transcript combination and there is only one record per variant transcript combination. However, each variant can overlap multiple transcripts and thus have multiple records, representing different variant transcript combinations.

The variants are called against the hg38/GRCh38 reference; a detailed description about the variant calling analysis is in the Genomic Research Data Quality Report All of Us Genomic QC Report. We generate most of the functional annotations using NIRVANA 3.18, a functional annotation tool from Illumina that provides annotations of genomic variants based on the Sequence Ontology consequences and external data sources for additional context (ex. gnomAD, SpliceAI, ClinVar). The remaining annotations are the All of Us population metrics (fields: gvs_*), which we generate internally. Please note that all of the All of Us population annotations exclude filtered genotypes (FT tag populated with a non-missing or “PASS” value). Therefore, allele counts (fields: gvs_*_ac) and allele numbers (fields: gvs_*_an) of zero are possible. Table 1 details the fields in the VAT.

A variant is not a 1:1 correspondence to the genomic sites represented in a Variant Calling Format (VCF) file, since a single row in a VCF (“site”) can be multi-allelic. In other words, one site can have multiple variants in a VCF. A code snippet is provided in the featured notebook 01_Get Started with Genomic Data. For the exact locations of these files, please see the Controlled CDR Directory.

Table 1 -- VAT Schema

Field name	Type	Key?	Example value	Nullable?	transcript-specific?	Notes
vid	string	yes	1-3414320-G-A	No	No	Variant ID. Unique string for identifying a variant (as produced by NIRVANA based on a spec from Broad Institute). Note that a variant cannot have multiple alternate alleles -- only one. The vid is <contig>-<position>-<ref_allele>-<alt_allele>.
transcript	string	yes	ENST00000372090.5	Yes	Yes	Transcript ID Null indicates that this variant does not overlap any transcripts (i.e. the variant is in an intergenic region (IGR)) We include Ensembl transcripts only.
contig	string	no	chr1	No	No	Contig names match the hg38 reference.
position	integer	no	3414320	No	No	Must be positive; exact position for a SNP and the position before the alteration in an indel.
ref_allele	string	no	G	No	No	base(s). This should always be one base for SNPs and insertions. More than one base for deletions.
alt_allele	string	no	A	No	No	base(s). This should always be one base for SNPs and deletions. More than one base for insertions.
gvs_all_ac	integer	no	2	No	No	Alternate allele count across all available samples in the WGS joint callset.
gvs_all_an	integer	no	4	No	No	Allele number across all available samples in the WGS joint callset.
gvs_all_af	float	no	0.5	No	No	Alternate allele frequency (AC/AN) across all available samples in the WGS joint callset.
gvs_all_sc	integer	no	4	No	No	Sample count of heterozygous plus homozygous alternate genotypes. For rules on calculating this field, please see Appendix B.
gene_symbol	string	no	TESK2	Yes	Yes	Gene symbol. A variant can have more than one associated gene symbol, since about 3% of genes do overlap. Note that transcript to gene is still one-to-one, so this field is a single gene symbol. See FAQ #1 for more info on the relationship between transcripts, genes, and variants. Null value indicates that the variant is in an IGR. I.e. The variant has no associated gene. This should only happen when transcript is null.
transcript _source	string	no	Ensembl	Yes	Yes
aa_change	string	no	ENSP00000426975.1:p.(Ser534Pro)	Yes	Yes	HGVS p. nomenclature; Amino acid change.
consequence	array<string>	no	['splice_region_variant', 'intron_variant', 'non_coding_transcript_variant']	Yes	Yes	Amino acid change type.
dna_change_in_transcript	string	no	ENST00000352527.5: c.77+1714C>T	Yes	Yes	HGVS c. nomenclature; DNA change in transcript space.
variant_type	string	no	SNV	No	No	DNA change type (HGVS).
exon_number	string	no	1/4	Yes	Yes	Exon number
intron_number	string	no	3/9	Yes	Yes	Intron number
genomic_location	string	no	NC_000001.11: g.3128801A>G	No	No	HGVS g. nomenclature Variant location.
dbsnp_rsid	array	no	rs7549050	Yes	No	rsID
gene_id	string	no	ENSG00000228888	Yes	Yes	Gene ID for the transcript
gene_omim_id	integer	no	610972	Yes	No	OMIM ID
is_canonical_transcript	boolean	no	True	Yes	Yes	Primary Transcript ID (see canonical description in NIRVANA)
gnomad_all_af	float	no	0.25608	Yes	No	gnomAD: "Total" frequency
gnomad_all_ac	integer	no	8023	Yes	No	gnomAD: "Total" allele count
gnomad_all_an	integer	no	31330	Yes	No	gnomAD: "Total" allele number
gnomad_max_af	float	no	0.30391	Yes	No	gnomAD: Max subpopulation frequency. gnomad subpopulations can be found in Table 2. To calculate the max AF, see Appendix A
gnomad_max_ac	integer	no	2644	Yes	No	gnomAD: Max subpopulation allele count. gnomad subpopulations can be found in Table 2. To calculate the max AC, see Appendix A.
gnomad_max_an	integer	no	8700	Yes	No	gnomAD: Max subpopulation allele number. gnomad subpopulations can be found in Table 2. To calculate the max AN, see Appendix A.
gnomad_max_subpop	string	no	afr	Yes	No	gnomAD: Max subpopulation ethnicity. gnomad subpopulations can be found in Table 2. To calculate the max subpopulation, see Appendix A.
gnomad_<subpop>_ac	integer	no	8028	Yes	No	This is actually multiple fields -- one for each of the eight gnomad subpopulations. gnomad subpopulations can be found in Table 2.
gnomad_<subpop>_an	integer	no	31330	Yes	No	This is actually multiple fields -- one for each of the eight gnomad subpopulations. gnomAD: allele number in the <subpop>. gnomad subpopulations can be found in Table 2.
gnomad_<subpop>_af	float	no	0.30391	Yes	No	This is actually multiple fields -- one for each of the eight gnomad subpopulations (see above). gnomAD: allele frequency in the <subpop>. gnomad subpopulations can be found in Table 2.
gvs_max_af	float	no	0.500	No	No	All of Us: Max subpopulation frequency. All of Us subpopulations can be found in Table 2. To calculate the max AF, see Appendix A.
gvs_max_ac	integer	no	100	No	No	All of Us: Max subpopulation allele count. All of Us subpopulations can be found in Table 2. To calculate the max AC, see Appendix A.
gvs_max_an	integer	no	200	No	No	All of Us: Max subpopulation allele number. All of Us subpopulations can be found in Table 2. To calculate the max AN, see Appendix A.
gvs_max_sc	integer	no	200	No	No	Max subpopulation sample count. All of Us subpopulations can be found in Table 2. The max is calculated in the same way as max AN, max AC, etc. See Appendix A.
gvs_max_subpop	string	no	sas	No	No	All of Us: Max subpopulation ancestry. All of Us subpopulations can be found in Table 2. To calculate the max subpopulation, see Appendix A.
gvs_<subpop>_ac	integer	no	100	No	No	This is actually multiple fields -- one for each ancestry determination. All of Us subpopulations can be found in Table 2. For example, one field would be gvs_afr_ac. The subpop AC values for a site will need to be split into values for each alternate allele.
gvs_<subpop>_an	integer	no	1000	No	No	This is actually multiple fields -- one for each ancestry determination. All of Us subpopulations can be found in Table 2. For example, one field would be gvs_afr_an. Allele number for samples, in the <subpop>, in this instance of GVS.
gvs_<subpop>_af	float	no	0.100	No	No	This is actually multiple fields -- one for each ancestry determination. All of Us subpopulations can be found in Table 2. For example, one field would be gvs_afr_af. Alternate allele frequency (AC/AN) across samples, in the <subpop>, in this instance of GVS.
gvs_<subpop>_sc	integer	no	243	No		This is actually multiple fields -- one for each ancestry determination. All of Us subpopulations can be found in Table 2. For example, one field would be gvs_afr_sc. Sample count of heterozygous plus homozygous alternate genotypes in the subpopulation. For rules on calculating this field, please see Appendix B
revel	float	no	0.135	Yes	No	REVEL
splice_ai_acceptor_gain_score	float	no	0.3	Yes	No	Splice AI. All spliceAI values from the variant will be mapped to the proper transcript based on the gene_symbol/hgnc. This will cause some redundant info.
splice_ai_acceptor_gain_distance	integer	no	-8	Yes	No	Splice AI
splice_ai_acceptor_loss_score	float	no	0.5	Yes	No	Splice AI
splice_ai_acceptor_loss_distance	integer	no	0	Yes	No	Splice AI
splice_ai_donor_gain_score	float	no	0.9	Yes	No	Splice AI
splice_ai_donor_gain_distance	integer	no	12	Yes	No	Splice AI
splice_ai_donor_loss_score	float	no	0.1	Yes	No	Splice AI
splice_ai_donor_loss_distance	integer	no	-8	Yes	No	Splice AI
omim_phenotypes_id	array[integer]	no	[616781]	Yes	No	OMIM Disease ID. Must be ordered corresponding to omim_phenotypes_name.
omim_phenotypes_name	array[string]	no	[“Joubert syndrome 25”]	Yes	No	OMIM Disease Name. Must be ordered corresponding to omim_phenotypes_id.
clinvar_classification	array[string]	no	benign	Yes	No	ClinVar Classification Note that significance Is an array---so we will union all values ordered by: "benign", "likely benign", "uncertain significance", "likely pathogenic", “pathogenic”, "drug response", "association", "risk factor", "protective", "affects", "conflicting data from submitters", "other", "not provided", "'-'"
clinvar_last_updated	date	no	2020-08-27	Yes	No	ClinVar Classification Date
clinvar_phenotype	array[string]	no	[“Nephronophthisis 4”, “Nephronophthisis”]	Yes	No	ClinVar Disease Name

Column Explanations

Field name -- The name of the field in the database
Type -- Data type. Arrays are possible.
Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
Nullable? -- Whether this field is allowed to take a null value.
transcript-specific? -- Whether this field is unique to the transcript. In other words, whether this field can change value within a single variant. For example, a single variant can have different consequences depending on the transcript, but the gnomad population data will not change based on the selected transcript for a variant.
Notes -- Any other relevant information.

Frequently Asked Questions about the VAT

Why is a variant associated with multiple transcripts?

This is a question about biology and sequencing. A gene codes for a protein through the transcription process (i.e. DNA → mRNA and then downstream mRNA is synthesized into a protein through translation). In the wild, we observe that each gene actually has multiple mRNA isoforms (and therefore proteins). Scientists have mapped each mRNA isoform back into the genome, which are the “transcripts'' discussed in this document. This means that variants can have different effects on the coded protein depending on the transcript, and we have to determine the effect on each transcript separately. Note that some genes overlap each other, so overlapping transcripts can have multiple gene names as well.

The two most common data sources for transcripts are Ensembl and RefSeq. These data sources also have rules for canonical transcripts, which provide guidance for choosing an exemplar transcript when determining the most likely effect of a variant.

Why do we suggest downstream applications use Ensembl transcripts (over RefSeq)?

RefSeq transcripts are a subset of Ensembl. This makes Ensembl more appropriate for exploratory research.

https://bioinformatics.stackexchange.com/questions/21/feature-annotation-refseq-vs-ensembl-vs-gencode-whats-the-difference

How does a variant relate to a row in a Variant Call Format (VCF) file?

Each row in a VCF is a single single genomic location (site). A VCF can have multiple variants at a site. A variant is the site, plus a reference allele and a single alternate allele. The definition of variant ensures simplicity in the VAT table design and ensures row independence. The transcript is required, since certain annotations, such as gene or protein impact, can change depending on the transcript used for annotating a variant. For more information about the relationship between variants and transcripts, please see FAQ #1.

Why are so many fields null for some variants?

When a variant does not overlap a gene (and therefore does not overlap any transcripts), any annotation regarding the protein coding of the gene (e.g. aa_change) will be null, since there is no applicable value for the field without a transcript overlap. Note that we annotate variants as upstream or downstream of a transcript, which means that overlap with a transcript extends beyond the actual boundaries of the transcript (by up to 5,000 bases).

Can information in the VAT be shared publicly?

No, because variant counts can be under 20, which is against policy to share publicly. Please see the All Of Us Statistics Dissemination Policy.

In the VAT, what computed ancestries are included in gnomad (gnomad_*) and All Of Us allele metric annotations (gvs_*)?

Table 2 lists the computed ancestries with allele metric annotations in the VAT

Table 2 -- Computed ancestries available for allele metric annotations in the VAT.

Computed Ancestry (abbreviation)	gnomAD annotations	All Of Us annotations	Notes
African (afr)	Yes	Yes
Latino/Ad Mixed American (amr)	Yes	Yes
Ashkenazi Jew (asj)	Yes	No
East Asian (eas)	Yes	Yes
European (eur)	No (see note)	Yes	Includes both Finnish (fin) and Non-Finnish European (nfe)
Finnish (fin)	Yes	No
Middle-Eastern (mid)	No	Yes
Non-Finnish European (nfr)	No	No	The nfr subpopulation appears as nfr in the VAT but correlates to the nfe subpopulation in gnomad. Please note that the nfr annotations are missing in the VAT.
South Asian (sas)	Yes	Yes
Other (oth)	Yes	Yes

Appendix A: Calculating _max_ fields in the VAT

The VAT schema specifies several fields with a “_max_” in the name. These indicate the value corresponding to the subpopulation with the maximum AF for the variant in either the gnomad or gvs corpus.

For example, gnomad_max_ac, would be “In gnomad, the AC for the subpopulation with the highest AF”

For example, gvs_max_an, would be “In GVS, the AN for the subpopulation with the highest AF”

Example to populate gnomad_max_ac and gnomad_max_subpop

We have eight ancestry subpopulations in gnomad:

Therefore, we will already have the following AC fields in the VAT:

gnomad_afr_ac
gnomad_amr_ac
gnomad_eas_ac
gnomad_fin_ac
gnomad_nfe_ac
gnomad_asj_ac
gnomad_oth_ac
gnomad_sas_ac

And AF fields:

gnomad_afr_af
gnomad_amr_af
gnomad_eas_af
gnomad_fin_af
gnomad_nfe_af
gnomad_asj_af
gnomad_oth_af
gnomad_sas_af

Then let’s say that we have the following values for each AF field:

gnomad_afr_af: 0.75
gnomad_amr_af: 0.1
gnomad_eas_af: 0.1
gnomad_fin_af: 0.1
gnomad_nfe_af: 0.1
gnomad_asj_af: 0.1
gnomad_oth_af: 0.1
gnomad_sas_af: 0.1

Since gnomad_afr_af is the highest, then gnomad_max_subpop is “afr” and gnomad_max_ac is the value in gnomad_afr_ac. Ties are broken by alphabetical order of subpopulation (i.e. “afr” would come before “oth”)

Appendix B: Calculating sample_count (and n_hets, n_hom_alts)

The sample_count is defined as the number of samples having the alternate allele in the variant. This can be one copy (“heterozygous (het)”) or two copies (“homozygous alternate (homalt)”). We cannot calculate the sample_count from AC and AN, since we do not know how many samples are het or homalt. In this algorithm, we assume a ploidy of two. The sample_count is allele specific. We also calculate n_hets and n_hom_alts, but we currently only report sample_count (Table 1).

Please see example below from a VCF:

CHROM	POS	REF	ALT	SAMPLE1	SAMPLE2	SAMPLE3	SAMPLE4	SAMPLE5	SAMPLE6
chr1	100	C	T,CT	0/0	0/1	0/2	1/1	1/2	./.

We expect to have two variants:

vid	sample_count	n_het	n_homalt
chr1-100-C-T	3 (SAMPLE 2,4,5)	2 (SAMPLE 2,5)	1 (SAMPLE 4)
chr1-100-C-CT	2 (SAMPLE 3,5)	2 (SAMPLE 3,5)	0

sample_count = the number of genotypes with a non-ref call.

n_het = the number of genotypes with a het call including the alternate allele

n_homalt = the number of genotypes with a homalt call including the alternate allele. “0/0” is never a homalt call, this is homozygous reference (“homref”).

Other rules:

If a genotype is filtered (FT field exists and is not “.” nor “PASS”), then the genotype should not count towards sample_count, n_het, nor n_homalt.
If a genotype is a no_call (“./.” or missing), then the genotype should not count towards sample sample_count, n_het, nor n_homalt.
If a genotype is a partial call (e.g. “./1”), then the genotype should count towards sample_count, but not n_het, nor n_homalt.
AC = 2(n_homalt) + n_het + partial call
AN = 2(n_het + n_homalt + n_homref) + partial call
SC = AC_Homalt/2 + AC_Het + AC_Hemi

Variant Annotation Table

Table 1 -- VAT Schema

Frequently Asked Questions about the VAT

Table 2 -- Computed ancestries available for allele metric annotations in the VAT.

Appendix A: Calculating _max_ fields in the VAT

Example to populate gnomad_max_ac and gnomad_max_subpop

Appendix B: Calculating sample_count (and n_hets, n_hom_alts)

Was this article helpful?

Comments

<%= previousTitle %>

<%= nextTitle %>

<%= block.name %>

<%= block.name %>

Have a question or would like to make a request?

Categories

Toggle navigation menu

<%= category.name %>

Search

Table 1 -- VAT Schema

Frequently Asked Questions about the VAT

Table 2 -- Computed ancestries available for allele metric annotations in the VAT.

Appendix A: Calculating *_max_* fields in the VAT

Example to populate gnomad_max_ac and gnomad_max_subpop

Appendix B: Calculating sample_count (and n_hets, n_hom_alts)

Was this article helpful?

<%= previousTitle %>

<%= nextTitle %>

<%= block.name %>

<%= block.name %>

Have a question or would like to make a request?

Categories

Toggle navigation menu

<%= category.name %>

Categories

Categories

Appendix A: Calculating _max_ fields in the VAT