Variant Annotation Table

  • Updated

The All of Us genomic dataset includes accompanying metadata about functional annotations of the variants, such as the gene symbol and protein change.  This information is stored in the Variant Annotation Table (VAT) where each row represents a variant-transcript combination.  Note, that each variant can overlap multiple transcripts and will have multiple records associated with it but only one record per variant transcript combination. Please note that all variants are called against the hg/38/GRCh38 reference.

A variant is not a 1:1 correspondence to the genomic sites represented in a Variant Calling Format (VCF) file, since a single row in a VCF (“site”) can be multi-allelic (read: “one site can have multiple variants in a VCF”). The VAT is delivered as a block compressed tab-separated value text file (.tsv.bgz), which can be loaded into Hail.  We also provide a compressed, sharded tsv corresponding to the shards in WGS joint callset.  A code snippet is provided in the featured notebook 01_Get Started with Genomic Data.  For the exact locations of these files, please see the Controlled CDR Directory.

We generate most of the functional annotations using NIRVANA 3.14, a functional annotation tool from Illumina that provides annotations of genomic variants based on the Sequence Ontology consequences and external data sources for additional context (ex. gnomAD, ClinGen, 1000 Genomes Project). The remaining annotations are the All of Us population metrics (fields: gvs_*), which we generate internally.  Please note that all of the All of Us population annotations exclude filtered genotypes (FT tag populated with a non-missing or “PASS” value).  Therefore, allele counts (fields: gvs_*_ac) and allele numbers (fields: gvs_*_an) of zero are possible.  Table 1 details the fields in the VAT.

 

Table 1 -- VAT Schema

Field name

Type

Key?

Example value

Nullable?

transcript-specific?

Notes

vid

string

yes

1-3414320-G-A

No

No

Variant ID.  Unique string for identifying a variant (as produced by NIRVANA based on a spec from Broad Institute).

Note that a variant cannot have multiple alternate alleles -- only one.  The vid is <contig>-<position>-<ref_allele>-<alt_allele>.

transcript

string

yes

ENST00000372090.5

Yes

Yes

Transcript ID

Null indicates that this variant does not overlap any transcripts (i.e. the variant is in an intergenic region (IGR))

We include Ensembl transcripts only.

contig

string

no

chr1

No

No

Contig names match the hg38 reference

position

integer

no

3414320

No

No

Must be positive; exact position for a SNP and the position before the alteration in an indel.

ref_allele

string

no

G

No

No

base(s).  This should always be one base for SNPs and insertions.  More than one base for deletions.

alt_allele

string

no

A

No

No

base(s).  This should always be one base for SNPs and deletions.  More than one base for insertions.

gvs_all_ac

integer

no

2

No

No

Alternate allele count across all available samples in the WGS joint callset.

gvs_all_an

integer

no

4

No

No

Allele number across all available samples in the WGS joint callset.

gvs_all_af

float

no

0.5

No

No

Alternate allele frequency (AC/AN) across all available samples in the WGS joint callset.

gvs_all_sc

integer

no 

4

No

No

Sample count of heterozygous plus homozygous alternate genotypes.

For rules on calculating this field, please see Appendix B.

gene_symbol

string

no

TESK2

Yes

Yes

Gene symbol.  A variant can have more than one associated gene symbol, since about 3% of genes do overlap.  Note that transcript to gene is still one-to-one, so this field is a single gene symbol.  See FAQ #1 for more info on the relationship between transcripts, genes, and variants.

Null value indicates that the variant is in an IGR.  I.e. The variant has no associated gene.  This should only happen when transcript is null.

transcript _source

string

no

Ensembl

Yes

Yes

 

aa_change

string

no

ENSP00000426975.1:p.(Ser534Pro)

Yes

Yes

HGVS p. nomenclature; Amino acid change.

consequence

array<string>

no

['splice_region_variant', 'intron_variant', 'non_coding_transcript_variant']

Yes

Yes

Amino acid change type

dna_change_in_transcript

string

no

ENST00000352527.5:c.77+1714C>T

Yes

Yes

HGVS c. nomenclature; DNA change in transcript space.

variant_type

string

no

SNV

No

No

DNA change type (HGVS).

exon_number

string

no

1/4

Yes

Yes

Exon number

intron_number

string

no

3/9

Yes

Yes

Intron number

genomic_location

string

no

NC_000001.11:g.3128801A>G

No

No

HGVS g. nomenclature Variant location.

dbsnp_rsid

array

no

rs7549050

Yes

No

rsID

gene_id

string

no

ENSG00000228888

Yes

Yes

Gene ID for the transcript.

gene_omim_id

integer

no

610972

Yes

No

OMIM ID

is_canonical_transcript

boolean

no

True

Yes

Yes

Primary Transcript ID (see canonical description in NIRVANA)

gnomad_all_af

float

no

0.25608

Yes

No

gnomAD: "Total" frequency

gnomad_all_ac

integer

no

8023

Yes

No

gnomAD: "Total" allele count

gnomad_all_an

integer

no

31330

Yes

No

gnomAD: "Total" allele number

gnomad_max_af

float

no

0.30391

Yes

No

gnomAD: Max subpopulation frequency. 

gnomad subpopulations can be found in Table 2.

To calculate the max AF, see Appendix A.

gnomad_max_ac

integer

no

2644

Yes

No

gnomAD: Max subpopulation allele count.  

gnomad subpopulations can be found in Table 2.

To calculate the max AC, see Appendix A.

gnomad_max_an

integer

no

8700

Yes

No

gnomAD: Max subpopulation allele number.  

gnomad subpopulations can be found in Table 2.

To calculate the max AN, see Appendix A.

gnomad_max_subpop

string

no

afr

Yes

No

gnomAD: Max subpopulation ethnicity.  

gnomad subpopulations can be found in Table 2.

To calculate the max subpopulation, see Appendix A.

gnomad_<subpop>_ac

integer

no

8028

Yes

No

This is actually multiple fields -- one for each of the eight gnomad subpopulations.

gnomad subpopulations can be found in Table 2.

gnomad_<subpop>_an

integer

no

31330

Yes

No

This is actually multiple fields -- one for each of the eight gnomad subpopulations.

gnomAD: allele number in the <subpop>

gnomad subpopulations can be found in Table 2.

gnomad_<subpop>_af

float

no

0.30391

Yes

No

This is actually multiple fields -- one for each of the eight gnomad subpopulations (see above).

gnomAD: allele frequency in the <subpop>.

gnomad subpopulations can be found in Table 2.

gvs_max_af

float

no

0.500

No

No

AoU: Max subpopulation frequency.  

AoU subpopulations can be found in Table 2.

To calculate the max AF, see Appendix A.

gvs_max_ac

integer

no

100

No

No

AoU: Max subpopulation allele count.  

AoU subpopulations can be found in Table 2.

To calculate the max AC, see Appendix A.

gvs_max_an

integer

no

200

No

No

AoU: Max subpopulation allele number.  

AoU subpopulations can be found in Table 2.

To calculate the max AN, see Appendix A.

gvs_max_sc

integer

no

200

No

No

Max subpopulation sample count.  

AoU subpopulations can be found in Table 2.

The max is calculated in the same way as max AN, max AC, etc.  See Appendix A.

gvs_max_subpop

string

no

sas

No

No

AoU: Max subpopulation ancestry.  

AoU subpopulations can be found in Table 2.

To calculate the max subpopulation, see Appendix A.

gvs_<subpop>_ac

integer

no

100

No

No

This is actually multiple fields -- one for each ancestry determination.  

AoU subpopulations can be found in Table 2.

For example, one field would be gvs_afr_ac.

The subpop AC values for a site will need to be split into values for each alternate allele.

gvs_<subpop>_an

integer

no

1000

No

No

This is actually multiple fields -- one for each ancestry determination.  

AoU subpopulations can be found in Table 2.

For example, one field would be gvs_afr_an.

Allele number for samples, in the <subpop>, in this instance of GVS. 

gvs_<subpop>_af

float

no

0.100

No

No

This is actually multiple fields -- one for each ancestry determination. 

AoU subpopulations can be found in Table 2.

For example, one field would be gvs_afr_af.

Alternate allele frequency (AC/AN) across samples, in the <subpop>, in this instance of GVS.  

gvs_<subpop>_sc

integer

no 

243

No

 

This is actually multiple fields -- one for each ancestry determination.  

AoU subpopulations can be found in Table 2.

For example, one field would be gvs_afr_sc.

Sample count of heterozygous plus homozygous alternate genotypes in the subpopulation.

For rules on calculating this field, please see Appendix B.

revel

float

no

0.135

Yes

No

REVEL

splice_ai_acceptor_gain_score

float

no

0.3

Yes

No

Splice AI.  All spliceAI values from the variant will be mapped to the proper transcript based on the gene_symbol/hgnc. This will cause some redundant info.

splice_ai_acceptor_gain_distance

integer

no

-8

Yes

No

Splice AI

splice_ai_acceptor_loss_score

float

no

0.5

Yes

No

Splice AI

splice_ai_acceptor_loss_distance

integer

no

0

Yes

No

Splice AI

splice_ai_donor_gain_score

float

no

0.9

Yes

No

Splice AI

splice_ai_donor_gain_distance

integer

no

12

Yes

No

Splice AI

splice_ai_donor_loss_score

float

no

0.1

Yes

No

Splice AI

splice_ai_donor_loss_distance

integer

no

-8

Yes

No

Splice AI

omim_phenotypes_id

array[integer]

no

[616781]

Yes

No

OMIM Disease ID.  Must be ordered corresponding to omim_phenotypes_name.

omim_phenotypes_name

array[string]

no

[“Joubert syndrome 25”]

Yes

No

OMIM Disease Name.  Must be ordered corresponding to omim_phenotypes_id.

clinvar_classification

array[string]

no

benign

Yes

No

ClinVar Classification

Note that significance

Is an array---so we will union all values ordered by:
"Benign", "Likely Benign", "Uncertain significance", "Likely pathogenic", “Pathogenic”,
"drug response",

"association",

"risk factor",

"protective",

"Affects",

"conflicting data from submitters",

"other",

"not provided",

"'-'"

clinvar_last_updated

date

no

2020-08-27

Yes

No

ClinVar Classification Date

clinvar_phenotype

array[string]

no

[“Nephronophthisis 4”, “Nephronophthisis”]

Yes

No

ClinVar Disease Name

Column Explanations

    • Field name -- The name of the field in the database
    • Type -- Data type.  Arrays are possible.
    • Key? -- Whether this field makes up a unique key for the row.  Note that all key fields together make a unique key for the row.
    • Nullable?  -- Whether this field is allowed to take a null value.
    • transcript-specific? -- Whether this field is unique to the transcript.  In other words, whether this field can change value within a single variant.  For example, a single variant can have different consequences depending on the transcript, but the gnomad population data will not change based on the selected transcript for a variant.
    • Notes -- Any other relevant information.

FAQs

  • 1. Why is a variant associated with multiple transcripts?

This is a question about biology and sequencing.  A gene codes for a protein through the transcription process (i.e. DNA → mRNA and then downstream mRNA is synthesized into a protein through translation).  In the wild, we observe that each gene actually has multiple mRNA isoforms (and therefore proteins).  Scientists have mapped each mRNA isoform back into the genome, which are the “transcripts'' discussed in this document.  This means that variants can have different effects on the coded protein depending on the transcript, and we have to determine the effect on each transcript separately.  The two most common data sources for transcripts are Ensembl and RefSeq.  These data sources also have rules for canonical transcripts, which provide guidance for choosing an exemplar transcript when determining the most likely effect of a variant.

Note that some genes overlap each other, so overlapping transcripts can have multiple gene names as well.

  • 2. Why do we suggest downstream applications use Ensembl transcripts (over RefSeq)?

RefSeq transcripts are a subset of Ensembl.  This makes Ensembl more appropriate for exploratory research.

https://bioinformatics.stackexchange.com/questions/21/feature-annotation-refseq-vs-ensembl-vs-gencode-whats-the-difference

  • 3. How does a variant relate to a row in a Variant Call Format (VCF) file?

Each row in a VCF is a single single genomic location (site).  A VCF can have multiple variants at a site.  A variant is the site, plus a reference allele and a single alternate allele.  The definition of variant ensures simplicity in the VAT table design and ensures row independence.  The transcript is required, since certain annotations, such as gene or protein impact, can change depending on the transcript used for annotating a variant.  For more information about the relationship between variants and transcripts, please see FAQ #1.

  • 4. Why are so many fields null for some variants?

When a variant does not overlap a gene (and therefore does not overlap any transcripts), any annotation regarding the protein coding of the gene (e.g. aa_change) will be null, since there is no applicable value for the field without a transcript overlap.  Note that we annotate variants as upstream or downstream of a transcript, which means that overlap with a transcript extends beyond the actual boundaries of the transcript (by up to 5,000 bases).

  • 5. Can information in the VAT be shared publicly?

No, because variant counts can be under 20, which is against policy to share publicly.  Please see the All Of Us Statistics Dissemination Policy.

  • 6. In the VAT, what computed ancestries are included in gnomad (gnomad_*) and All Of Us allele metric annotations (gvs_*)?

Table 2 lists the computed ancestries with allele metric annotations in the VAT

Table 2 -- Computed ancestries available for allele metric annotations in the VAT.

Computed Ancestry (abbreviation)

gnomAD annotations

All Of Us annotations

Notes

African (afr)

Yes

Yes

 

Latino/Ad Mixed American (amr)

Yes

Yes

 

Ashkenazi Jew (asj)

Yes

No

 

East Asian (eas)

Yes

Yes

 

European (eur)

No (see note)

Yes

Includes both Finnish (fin) and Non-Finnish European (nfe)

Finnish (fin)

Yes

No

 

Middle-Eastern (mid)

No

Yes

 

Non-Finnish European (nfr)

Yes

No

 

South Asian (sas)

Yes

Yes

 

Other (oth)

Yes

Yes

 

 

Appendix A: Calculating *_max_* fields in the VAT

The VAT schema specifies several fields with a “_max_” in the name.  These indicate the value corresponding to the subpopulation with the maximum AF for the variant in either the gnomad or gvs corpus.

For example, gnomad_max_ac, would be “In gnomad, the AC for the subpopulation with the highest AF”

For example, gvs_max_an, would be “In GVS, the AN for the subpopulation with the highest AF”

Example to populate gnomad_max_ac and gnomad_max_subpop

We have eight ancestry subpopulations in gnomad:  

  • afr
  • amr
  • eas
  • fin
  • nfr
  • asj
  • oth
  • sas

Therefore, we will already have the following AC fields in the VAT:

  • gnomad_afr_ac
  • gnomad_amr_ac
  • gnomad_eas_ac
  • gnomad_fin_ac
  • gnomad_nfe_ac
  • gnomad_asj_ac
  • gnomad_oth_ac
  • gnomad_sas_ac

And AF fields:

  • gnomad_afr_af
  • gnomad_amr_af
  • gnomad_eas_af
  • gnomad_fin_af
  • gnomad_nfe_af
  • gnomad_asj_af
  • gnomad_oth_af
  • gnomad_sas_af

Then let’s say that we have the following values for each AF field:

  • gnomad_afr_af: 0.75
  • gnomad_amr_af: 0.1
  • gnomad_eas_af: 0.1
  • gnomad_fin_af: 0.1
  • gnomad_nfe_af: 0.1
  • gnomad_asj_af: 0.1
  • gnomad_oth_af: 0.1
  • gnomad_sas_af: 0.1

Since gnomad_afr_af is the highest, then gnomad_max_subpop is “afr” and gnomad_max_ac is the value in gnomad_afr_ac. 

Ties are broken by alphabetical order of subpopulation (i.e. “afr” would come before “oth”)

Appendix B: Calculating sample_count (and n_hets, n_hom_alts)

Notes:

  • The algorithms described here assume a ploidy of two.  
  • In the current design, only sample_count is ever surfaced, but this appendix covers the calculation of n_hets and n_hom_alts as well.  See Table 2

The sample_count is defined as the number of samples having the alternate allele in the variant.  This can be one copy (“heterozygous (het)”) or two copies (“homozygous alternate (homalt)”).  We cannot calculate the sample_count from AC and AN, since we do not know how many samples are het or homalt.  Please note that sample_count is allele specific.  Please see example below from a VCF:

CHROM

POS

REF

ALT

SAMPLE1

SAMPLE2

SAMPLE3

SAMPLE4

SAMPLE5

SAMPLE6

chr1

100

C

T,CT

0/0

0/1

0/2

1/1

1/2

./.

We expect to have two variants:

vid

sample_count

n_het

n_homalt

chr1-100-C-T

3 (SAMPLE 2,4,5)

2 (SAMPLE 2,5)

1 (SAMPLE 4)

chr1-100-C-CT

2 (SAMPLE 3,5)

2 (SAMPLE 3,5)

0

sample_count = the number of genotypes with a non-ref call.

n_het = the number of genotypes with a het call including the alternate allele 

n_homalt = the number of genotypes with a homalt call including the alternate allele.  “0/0” is never a homalt call, this is homozygous reference (“homref”).

Other rules

  • If a genotype is filtered (FT field exists and is not “.” nor “PASS”), then the genotype should not count towards sample_count, n_het, nor n_homalt.
  • If a genotype is a no_call (“./.” or missing), then the genotype should not count towards sample sample_count, n_het, nor n_homalt.
  • If a genotype is a partial call (e.g. “./1”), then the genotype should count towards sample_count, but not n_het, nor n_homalt.  
  • AC = 2(n_homalt) + n_het + partial call
  • AN = 2(n_het + n_homalt + n_homref) + partial call 
  • SC = AC_Homalt/2 + AC_Het + AC_Hemi

Was this article helpful?

1 out of 2 found this helpful

Have more questions? Submit a request