Variant Annotation Table

  • Updated

Functional annotations for passing variants in the short read whole genome sequencing (srWGS) SNP and Indel dataset are available in the Variant Annotation Table (VAT). Sites with 50 or more alternate alleles are not included in the VAT. The VAT includes annotations like the gene symbol and protein change, delivered as a block compressed tab-separated value text file (.tsv.bgz), which can be loaded into Hail. Each row represents a variant-transcript combination and there is only one record per variant transcript combination. However, each variant can overlap multiple transcripts and thus have multiple records, representing different variant transcript combinations. 

The variants are called against the hg38/GRCh38 reference; a detailed description about the variant calling analysis is in the Genomic Research Data Quality Report All of Us Genomic QC Report. We generate most of the functional annotations using NIRVANA 3.18, a functional annotation tool from Illumina that provides annotations of genomic variants based on the Sequence Ontology consequences and external data sources for additional context (ex. gnomAD, SpliceAI, ClinVar). The remaining annotations are the All of Us population metrics (fields: gvs_*), which we generate internally.  Please note that all of the All of Us population annotations exclude filtered genotypes (FT tag populated with a non-missing or “PASS” value).  Therefore, allele counts (fields: gvs_*_ac) and allele numbers (fields: gvs_*_an) of zero are possible.  Table 1 details the fields in the VAT.

 

A variant is not a 1:1 correspondence to the genomic sites represented in a Variant Calling Format (VCF) file, since a single row in a VCF (“site”) can be multi-allelic. In other words, one site can have multiple variants in a VCF.  A code snippet is provided in the featured notebook 01_Get Started with Genomic Data.  For the exact locations of these files, please see the Controlled CDR Directory.

 

Table 1 -- VAT Schema

Field name Type Key? Example value Nullable? transcript-specific? Notes
vid string yes 1-3414320-G-A No No

Variant ID.  Unique string for identifying a variant (as produced by NIRVANA based on a spec from Broad Institute).


Note that a variant cannot have multiple alternate alleles -- only one.  The vid is <contig>-<position>-<ref_allele>-<alt_allele>.

transcript string yes ENST00000372090.5 Yes Yes

Transcript ID

Null indicates that this variant does not overlap any transcripts (i.e. the variant is in an intergenic region (IGR))

We include Ensembl transcripts only.

contig string no chr1 No No Contig names match the hg38 reference.
position integer no 3414320 No No Must be positive; exact position for a SNP and the position before the alteration in an indel.
ref_allele string no G No No base(s).  This should always be one base for SNPs and insertions.  More than one base for deletions.
alt_allele string no A No No base(s).  This should always be one base for SNPs and deletions.  More than one base for insertions.
gvs_all_ac integer no 2 No No Alternate allele count across all available samples in the WGS joint callset.
gvs_all_an integer no 4 No No Allele number across all available samples in the WGS joint callset.
gvs_all_af float no 0.5 No No Alternate allele frequency (AC/AN) across all available samples in the WGS joint callset.
gvs_all_sc integer no  4 No No

Sample count of heterozygous plus homozygous alternate genotypes.


For rules on calculating this field, please see Appendix B.

gene_symbol string no TESK2 Yes Yes

Gene symbol.  A variant can have more than one associated gene symbol, since about 3% of genes do overlap.  Note that transcript to gene is still one-to-one, so this field is a single gene symbol.  See FAQ #1 for more info on the relationship between transcripts, genes, and variants.


Null value indicates that the variant is in an IGR.  I.e. The variant has no associated gene.  This should only happen when transcript is null.

transcript _source string no Ensembl Yes Yes  
aa_change string no ENSP00000426975.1:p.(Ser534Pro) Yes Yes HGVS p. nomenclature; Amino acid change.
consequence array<string> no ['splice_region_variant', 'intron_variant', 'non_coding_transcript_variant'] Yes Yes Amino acid change type.
dna_change_in_transcript string no

ENST00000352527.5:

c.77+1714C>T

Yes Yes HGVS c. nomenclature; DNA change in transcript space.
variant_type string no SNV No No DNA change type (HGVS).
exon_number string no 1/4 Yes Yes Exon number
intron_number string no 3/9 Yes Yes Intron number
genomic_location string no

NC_000001.11:

g.3128801A>G

No No HGVS g. nomenclature Variant location.
dbsnp_rsid array no rs7549050 Yes No rsID
gene_id string no ENSG00000228888 Yes Yes Gene ID for the transcript
gene_omim_id integer no 610972 Yes No OMIM ID
is_canonical_transcript boolean no True Yes Yes Primary Transcript ID (see canonical description in NIRVANA)
gnomad_all_af float no 0.25608 Yes No gnomAD: "Total" frequency
gnomad_all_ac integer no 8023 Yes No gnomAD: "Total" allele count
gnomad_all_an integer no 31330 Yes No gnomAD: "Total" allele number
gnomad_max_af float no 0.30391 Yes No

gnomAD: Max subpopulation frequency. 


gnomad subpopulations can be found in Table 2.


To calculate the max AF, see Appendix A

gnomad_max_ac integer no 2644 Yes No

gnomAD: Max subpopulation allele count.  


gnomad subpopulations can be found in Table 2.


To calculate the max AC, see Appendix A.

gnomad_max_an integer no 8700 Yes No

gnomAD: Max subpopulation allele number.  


gnomad subpopulations can be found in Table 2.


To calculate the max AN, see Appendix A.

gnomad_max_subpop string no afr Yes No

gnomAD: Max subpopulation ethnicity.  


gnomad subpopulations can be found in Table 2.


To calculate the max subpopulation, see Appendix A.

gnomad_<subpop>_ac integer no 8028 Yes No

This is actually multiple fields -- one for each of the eight gnomad subpopulations.


gnomad subpopulations can be found in Table 2.

gnomad_<subpop>_an integer no 31330 Yes No

This is actually multiple fields -- one for each of the eight gnomad subpopulations.


gnomAD: allele number in the <subpop>.


gnomad subpopulations can be found in Table 2.

gnomad_<subpop>_af float no 0.30391 Yes No

This is actually multiple fields -- one for each of the eight gnomad subpopulations (see above).


gnomAD: allele frequency in the <subpop>.


gnomad subpopulations can be found in Table 2.

gvs_max_af float no 0.500 No No

All of Us: Max subpopulation frequency.  


All of Us subpopulations can be found in Table 2.


To calculate the max AF, see Appendix A.

gvs_max_ac integer no 100 No No

All of Us: Max subpopulation allele count.  


All of Us subpopulations can be found in Table 2.


To calculate the max AC, see Appendix A.

gvs_max_an integer no 200 No No

All of Us: Max subpopulation allele number.  


All of Us subpopulations can be found in Table 2.


To calculate the max AN, see Appendix A.

gvs_max_sc integer no 200 No No

Max subpopulation sample count.  


All of Us subpopulations can be found in Table 2.


The max is calculated in the same way as max AN, max AC, etc.  See Appendix A.

gvs_max_subpop string no sas No No

All of Us: Max subpopulation ancestry.  


All of Us subpopulations can be found in Table 2.


To calculate the max subpopulation, see Appendix A.

gvs_<subpop>_ac integer no 100 No No

This is actually multiple fields -- one for each ancestry determination.  


All of Us subpopulations can be found in Table 2.


For example, one field would be gvs_afr_ac.


The subpop AC values for a site will need to be split into values for each alternate allele.

gvs_<subpop>_an integer no 1000 No No

This is actually multiple fields -- one for each ancestry determination.  


All of Us subpopulations can be found in Table 2.


For example, one field would be gvs_afr_an.


Allele number for samples, in the <subpop>, in this instance of GVS. 

gvs_<subpop>_af float no 0.100 No No

This is actually multiple fields -- one for each ancestry determination. 


All of Us subpopulations can be found in Table 2.


For example, one field would be gvs_afr_af.


Alternate allele frequency (AC/AN) across samples, in the <subpop>, in this instance of GVS.  

gvs_<subpop>_sc integer no  243 No  

This is actually multiple fields -- one for each ancestry determination.  


All of Us subpopulations can be found in Table 2.


For example, one field would be gvs_afr_sc.


Sample count of heterozygous plus homozygous alternate genotypes in the subpopulation.


For rules on calculating this field, please see Appendix B

revel float no 0.135 Yes No REVEL
splice_ai_acceptor_gain_score float no 0.3 Yes No Splice AI.  All spliceAI values from the variant will be mapped to the proper transcript based on the gene_symbol/hgnc. This will cause some redundant info.
splice_ai_acceptor_gain_distance integer no -8 Yes No Splice AI
splice_ai_acceptor_loss_score float no 0.5 Yes No Splice AI
splice_ai_acceptor_loss_distance integer no 0 Yes No Splice AI
splice_ai_donor_gain_score float no 0.9 Yes No Splice AI
splice_ai_donor_gain_distance integer no 12 Yes No Splice AI
splice_ai_donor_loss_score float no 0.1 Yes No Splice AI
splice_ai_donor_loss_distance integer no -8 Yes No Splice AI
omim_phenotypes_id array[integer] no [616781] Yes No OMIM Disease ID.  Must be ordered corresponding to omim_phenotypes_name.
omim_phenotypes_name array[string] no [“Joubert syndrome 25”] Yes No OMIM Disease Name.  Must be ordered corresponding to omim_phenotypes_id.
clinvar_classification array[string] no benign Yes No

ClinVar Classification

Note that significance

Is an array---so we will union all values ordered by:
"benign", "likely benign", "uncertain significance", "likely pathogenic", “pathogenic”,
"drug response",

"association",

"risk factor",

"protective",

"affects",

"conflicting data from submitters",

"other",

"not provided",

"'-'"

clinvar_last_updated date no 2020-08-27 Yes No ClinVar Classification Date
clinvar_phenotype array[string] no [“Nephronophthisis 4”, “Nephronophthisis”] Yes No ClinVar Disease Name

Column Explanations

  • Field name -- The name of the field in the database
  • Type -- Data type.  Arrays are possible.
  • Key? -- Whether this field makes up a unique key for the row.  Note that all key fields together make a unique key for the row.
  • Nullable?  -- Whether this field is allowed to take a null value.
  • transcript-specific? -- Whether this field is unique to the transcript.  In other words, whether this field can change value within a single variant.  For example, a single variant can have different consequences depending on the transcript, but the gnomad population data will not change based on the selected transcript for a variant.
  • Notes -- Any other relevant information.

 

 

 

Frequently Asked Questions about the VAT

  1. Why is a variant associated with multiple transcripts?

This is a question about biology and sequencing.  A gene codes for a protein through the transcription process (i.e. DNA → mRNA and then downstream mRNA is synthesized into a protein through translation).  In the wild, we observe that each gene actually has multiple mRNA isoforms (and therefore proteins).  Scientists have mapped each mRNA isoform back into the genome, which are the “transcripts'' discussed in this document.  This means that variants can have different effects on the coded protein depending on the transcript, and we have to determine the effect on each transcript separately.  Note that some genes overlap each other, so overlapping transcripts can have multiple gene names as well.

The two most common data sources for transcripts are Ensembl and RefSeq.  These data sources also have rules for canonical transcripts, which provide guidance for choosing an exemplar transcript when determining the most likely effect of a variant.

 

  1. Why do we suggest downstream applications use Ensembl transcripts (over RefSeq)?

RefSeq transcripts are a subset of Ensembl.  This makes Ensembl more appropriate for exploratory research.

https://bioinformatics.stackexchange.com/questions/21/feature-annotation-refseq-vs-ensembl-vs-gencode-whats-the-difference

 

  1. How does a variant relate to a row in a Variant Call Format (VCF) file?

Each row in a VCF is a single single genomic location (site).  A VCF can have multiple variants at a site.  A variant is the site, plus a reference allele and a single alternate allele.  The definition of variant ensures simplicity in the VAT table design and ensures row independence.  The transcript is required, since certain annotations, such as gene or protein impact, can change depending on the transcript used for annotating a variant.  For more information about the relationship between variants and transcripts, please see FAQ #1.

 

  1. Why are so many fields null for some variants?

When a variant does not overlap a gene (and therefore does not overlap any transcripts), any annotation regarding the protein coding of the gene (e.g. aa_change) will be null, since there is no applicable value for the field without a transcript overlap.  Note that we annotate variants as upstream or downstream of a transcript, which means that overlap with a transcript extends beyond the actual boundaries of the transcript (by up to 5,000 bases).

 

  1. Can information in the VAT be shared publicly?

No, because variant counts can be under 20, which is against policy to share publicly.  Please see the All Of Us Statistics Dissemination Policy.

 

  1. In the VAT, what computed ancestries are included in gnomad (gnomad_*) and All Of Us allele metric annotations (gvs_*)?

Table 2 lists the computed ancestries with allele metric annotations in the VAT

Table 2 -- Computed ancestries available for allele metric annotations in the VAT.

Computed Ancestry (abbreviation) gnomAD annotations All Of Us annotations Notes
African (afr) Yes Yes  
Latino/Ad Mixed American (amr) Yes Yes  
Ashkenazi Jew (asj) Yes No  
East Asian (eas) Yes Yes  
European (eur) No (see note) Yes Includes both Finnish (fin) and Non-Finnish European (nfe)
Finnish (fin) Yes No  
Middle-Eastern (mid) No Yes  
Non-Finnish European (nfr) No No

The nfr subpopulation appears as nfr in the VAT but correlates to the nfe subpopulation in gnomad.

Please note that the nfr annotations are missing in the VAT.

South Asian (sas) Yes Yes  
Other (oth) Yes Yes  

 

 

 

Appendix A: Calculating *_max_* fields in the VAT

The VAT schema specifies several fields with a “_max_” in the name.  These indicate the value corresponding to the subpopulation with the maximum AF for the variant in either the gnomad or gvs corpus.

For example, gnomad_max_ac, would be “In gnomad, the AC for the subpopulation with the highest AF”

For example, gvs_max_an, would be “In GVS, the AN for the subpopulation with the highest AF”

Example to populate gnomad_max_ac and gnomad_max_subpop

We have eight ancestry subpopulations in gnomad:  

  • afr
  • amr
  • eas
  • fin
  • nfr
  • asj
  • oth
  • sas

Therefore, we will already have the following AC fields in the VAT:

  • gnomad_afr_ac
  • gnomad_amr_ac
  • gnomad_eas_ac
  • gnomad_fin_ac
  • gnomad_nfe_ac
  • gnomad_asj_ac
  • gnomad_oth_ac
  • gnomad_sas_ac

And AF fields:

  • gnomad_afr_af
  • gnomad_amr_af
  • gnomad_eas_af
  • gnomad_fin_af
  • gnomad_nfe_af
  • gnomad_asj_af
  • gnomad_oth_af
  • gnomad_sas_af

Then let’s say that we have the following values for each AF field:

  • gnomad_afr_af: 0.75
  • gnomad_amr_af: 0.1
  • gnomad_eas_af: 0.1
  • gnomad_fin_af: 0.1
  • gnomad_nfe_af: 0.1
  • gnomad_asj_af: 0.1
  • gnomad_oth_af: 0.1
  • gnomad_sas_af: 0.1

Since gnomad_afr_af is the highest, then gnomad_max_subpop is “afr” and gnomad_max_ac is the value in gnomad_afr_ac. Ties are broken by alphabetical order of subpopulation (i.e. “afr” would come before “oth”)

 

 

Appendix B: Calculating sample_count (and n_hets, n_hom_alts)

The sample_count is defined as the number of samples having the alternate allele in the variant.  This can be one copy (“heterozygous (het)”) or two copies (“homozygous alternate (homalt)”). We cannot calculate the sample_count from AC and AN, since we do not know how many samples are het or homalt. In this algorithm, we assume a ploidy of two. The sample_count is allele specific. We also calculate n_hets and n_hom_alts, but we currently only report sample_count (Table 1).

Please see example below from a VCF:

CHROM POS REF ALT SAMPLE1 SAMPLE2 SAMPLE3 SAMPLE4 SAMPLE5 SAMPLE6
chr1 100 C T,CT 0/0 0/1 0/2 1/1 1/2 ./.

We expect to have two variants:

vid sample_count n_het n_homalt
chr1-100-C-T 3 (SAMPLE 2,4,5) 2 (SAMPLE 2,5) 1 (SAMPLE 4)
chr1-100-C-CT 2 (SAMPLE 3,5) 2 (SAMPLE 3,5) 0

sample_count = the number of genotypes with a non-ref call.

n_het = the number of genotypes with a het call including the alternate allele 

n_homalt = the number of genotypes with a homalt call including the alternate allele.  “0/0” is never a homalt call, this is homozygous reference (“homref”).

 

Other rules:

  • If a genotype is filtered (FT field exists and is not “.” nor “PASS”), then the genotype should not count towards sample_count, n_het, nor n_homalt.
  • If a genotype is a no_call (“./.” or missing), then the genotype should not count towards sample sample_count, n_het, nor n_homalt.
  • If a genotype is a partial call (e.g. “./1”), then the genotype should count towards sample_count, but not n_het, nor n_homalt.  
  • AC = 2(n_homalt) + n_het + partial call
  • AN = 2(n_het + n_homalt + n_homref) + partial call 
  • SC = AC_Homalt/2 + AC_Het + AC_Hemi

Was this article helpful?

5 out of 7 found this helpful

Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.