Functional annotations for passing variants in the short read whole genome sequencing (srWGS) SNP and Indel dataset are available in the Variant Annotation Table (VAT). Sites with 50 or more alternate alleles are not included in the VAT. The VAT includes annotations like the gene symbol and protein change, delivered as a block compressed tab-separated value text file (.tsv.bgz), which can be loaded into Hail. Each row represents a variant-transcript combination and there is only one record per variant transcript combination. However, each variant can overlap multiple transcripts and thus have multiple records, representing different variant transcript combinations.
The variants are called against the hg38/GRCh38 reference; a detailed description about the variant calling analysis is in the Genomic Research Data Quality Report All of Us Genomic QC Report. We generate most of the functional annotations using NIRVANA 3.18, a functional annotation tool from Illumina that provides annotations of genomic variants based on the Sequence Ontology consequences and external data sources for additional context (ex. gnomAD, SpliceAI, ClinVar). The remaining annotations are the All of Us population metrics (fields: gvs_*), which we generate internally. Please note that all of the All of Us population annotations exclude filtered genotypes (FT tag populated with a non-missing or “PASS” value). Therefore, allele counts (fields: gvs_*_ac) and allele numbers (fields: gvs_*_an) of zero are possible. Table 1 details the fields in the VAT.
A variant is not a 1:1 correspondence to the genomic sites represented in a Variant Calling Format (VCF) file, since a single row in a VCF (“site”) can be multi-allelic. In other words, one site can have multiple variants in a VCF. A code snippet is provided in the featured notebook 01_Get Started with Genomic Data. For the exact locations of these files, please see the Controlled CDR Directory.
Table 1 -- VAT Schema
Field name | Type | Key? | Example value | Nullable? | transcript-specific? | Notes |
vid | string | yes | 1-3414320-G-A | No | No |
Variant ID. Unique string for identifying a variant (as produced by NIRVANA based on a spec from Broad Institute). Note that a variant cannot have multiple alternate alleles -- only one. The vid is <contig>-<position>-<ref_allele>-<alt_allele>. |
transcript | string | yes | ENST00000372090.5 | Yes | Yes |
Transcript ID Null indicates that this variant does not overlap any transcripts (i.e. the variant is in an intergenic region (IGR)) We include Ensembl transcripts only. |
contig | string | no | chr1 | No | No | Contig names match the hg38 reference. |
position | integer | no | 3414320 | No | No | Must be positive; exact position for a SNP and the position before the alteration in an indel. |
ref_allele | string | no | G | No | No | base(s). This should always be one base for SNPs and insertions. More than one base for deletions. |
alt_allele | string | no | A | No | No | base(s). This should always be one base for SNPs and deletions. More than one base for insertions. |
gvs_all_ac | integer | no | 2 | No | No | Alternate allele count across all available samples in the WGS joint callset. |
gvs_all_an | integer | no | 4 | No | No | Allele number across all available samples in the WGS joint callset. |
gvs_all_af | float | no | 0.5 | No | No | Alternate allele frequency (AC/AN) across all available samples in the WGS joint callset. |
gvs_all_sc | integer | no | 4 | No | No |
Sample count of heterozygous plus homozygous alternate genotypes. For rules on calculating this field, please see Appendix B. |
gene_symbol | string | no | TESK2 | Yes | Yes |
Gene symbol. A variant can have more than one associated gene symbol, since about 3% of genes do overlap. Note that transcript to gene is still one-to-one, so this field is a single gene symbol. See FAQ #1 for more info on the relationship between transcripts, genes, and variants. Null value indicates that the variant is in an IGR. I.e. The variant has no associated gene. This should only happen when transcript is null. |
transcript _source | string | no | Ensembl | Yes | Yes | |
aa_change | string | no | ENSP00000426975.1:p.(Ser534Pro) | Yes | Yes | HGVS p. nomenclature; Amino acid change. |
consequence | array<string> | no | ['splice_region_variant', 'intron_variant', 'non_coding_transcript_variant'] | Yes | Yes | Amino acid change type. |
dna_change_in_transcript | string | no |
ENST00000352527.5: c.77+1714C>T |
Yes | Yes | HGVS c. nomenclature; DNA change in transcript space. |
variant_type | string | no | SNV | No | No | DNA change type (HGVS). |
exon_number | string | no | 1/4 | Yes | Yes | Exon number |
intron_number | string | no | 3/9 | Yes | Yes | Intron number |
genomic_location | string | no |
NC_000001.11: g.3128801A>G |
No | No | HGVS g. nomenclature Variant location. |
dbsnp_rsid | array | no | rs7549050 | Yes | No | rsID |
gene_id | string | no | ENSG00000228888 | Yes | Yes | Gene ID for the transcript |
gene_omim_id | integer | no | 610972 | Yes | No | OMIM ID |
is_canonical_transcript | boolean | no | True | Yes | Yes | Primary Transcript ID (see canonical description in NIRVANA) |
gnomad_all_af | float | no | 0.25608 | Yes | No | gnomAD: "Total" frequency |
gnomad_all_ac | integer | no | 8023 | Yes | No | gnomAD: "Total" allele count |
gnomad_all_an | integer | no | 31330 | Yes | No | gnomAD: "Total" allele number |
gnomad_max_af | float | no | 0.30391 | Yes | No |
gnomAD: Max subpopulation frequency. gnomad subpopulations can be found in Table 2.
|
gnomad_max_ac | integer | no | 2644 | Yes | No |
gnomAD: Max subpopulation allele count. gnomad subpopulations can be found in Table 2. To calculate the max AC, see Appendix A. |
gnomad_max_an | integer | no | 8700 | Yes | No |
gnomAD: Max subpopulation allele number. gnomad subpopulations can be found in Table 2. To calculate the max AN, see Appendix A. |
gnomad_max_subpop | string | no | afr | Yes | No |
gnomAD: Max subpopulation ethnicity. gnomad subpopulations can be found in Table 2. To calculate the max subpopulation, see Appendix A. |
gnomad_<subpop>_ac | integer | no | 8028 | Yes | No |
This is actually multiple fields -- one for each of the eight gnomad subpopulations. gnomad subpopulations can be found in Table 2. |
gnomad_<subpop>_an | integer | no | 31330 | Yes | No |
This is actually multiple fields -- one for each of the eight gnomad subpopulations. gnomAD: allele number in the <subpop>. gnomad subpopulations can be found in Table 2. |
gnomad_<subpop>_af | float | no | 0.30391 | Yes | No |
This is actually multiple fields -- one for each of the eight gnomad subpopulations (see above). gnomAD: allele frequency in the <subpop>. gnomad subpopulations can be found in Table 2. |
gvs_max_af | float | no | 0.500 | No | No |
All of Us: Max subpopulation frequency. All of Us subpopulations can be found in Table 2. To calculate the max AF, see Appendix A. |
gvs_max_ac | integer | no | 100 | No | No |
All of Us: Max subpopulation allele count. All of Us subpopulations can be found in Table 2. To calculate the max AC, see Appendix A. |
gvs_max_an | integer | no | 200 | No | No |
All of Us: Max subpopulation allele number. All of Us subpopulations can be found in Table 2. To calculate the max AN, see Appendix A. |
gvs_max_sc | integer | no | 200 | No | No |
Max subpopulation sample count. All of Us subpopulations can be found in Table 2. The max is calculated in the same way as max AN, max AC, etc. See Appendix A. |
gvs_max_subpop | string | no | sas | No | No |
All of Us: Max subpopulation ancestry. All of Us subpopulations can be found in Table 2. To calculate the max subpopulation, see Appendix A. |
gvs_<subpop>_ac | integer | no | 100 | No | No |
This is actually multiple fields -- one for each ancestry determination. All of Us subpopulations can be found in Table 2. For example, one field would be gvs_afr_ac. The subpop AC values for a site will need to be split into values for each alternate allele. |
gvs_<subpop>_an | integer | no | 1000 | No | No |
This is actually multiple fields -- one for each ancestry determination. All of Us subpopulations can be found in Table 2. For example, one field would be gvs_afr_an. Allele number for samples, in the <subpop>, in this instance of GVS. |
gvs_<subpop>_af | float | no | 0.100 | No | No |
This is actually multiple fields -- one for each ancestry determination. All of Us subpopulations can be found in Table 2. For example, one field would be gvs_afr_af. Alternate allele frequency (AC/AN) across samples, in the <subpop>, in this instance of GVS. |
gvs_<subpop>_sc | integer | no | 243 | No |
This is actually multiple fields -- one for each ancestry determination. All of Us subpopulations can be found in Table 2. For example, one field would be gvs_afr_sc. Sample count of heterozygous plus homozygous alternate genotypes in the subpopulation. For rules on calculating this field, please see Appendix B |
|
revel | float | no | 0.135 | Yes | No | REVEL |
splice_ai_acceptor_gain_score | float | no | 0.3 | Yes | No | Splice AI. All spliceAI values from the variant will be mapped to the proper transcript based on the gene_symbol/hgnc. This will cause some redundant info. |
splice_ai_acceptor_gain_distance | integer | no | -8 | Yes | No | Splice AI |
splice_ai_acceptor_loss_score | float | no | 0.5 | Yes | No | Splice AI |
splice_ai_acceptor_loss_distance | integer | no | 0 | Yes | No | Splice AI |
splice_ai_donor_gain_score | float | no | 0.9 | Yes | No | Splice AI |
splice_ai_donor_gain_distance | integer | no | 12 | Yes | No | Splice AI |
splice_ai_donor_loss_score | float | no | 0.1 | Yes | No | Splice AI |
splice_ai_donor_loss_distance | integer | no | -8 | Yes | No | Splice AI |
omim_phenotypes_id | array[integer] | no | [616781] | Yes | No | OMIM Disease ID. Must be ordered corresponding to omim_phenotypes_name. |
omim_phenotypes_name | array[string] | no | [“Joubert syndrome 25”] | Yes | No | OMIM Disease Name. Must be ordered corresponding to omim_phenotypes_id. |
clinvar_classification | array[string] | no | benign | Yes | No |
ClinVar Classification Note that significance Is an array---so we will union all values ordered by: "association", "risk factor", "protective", "affects", "conflicting data from submitters", "other", "not provided", "'-'" |
clinvar_last_updated | date | no | 2020-08-27 | Yes | No | ClinVar Classification Date |
clinvar_phenotype | array[string] | no | [“Nephronophthisis 4”, “Nephronophthisis”] | Yes | No | ClinVar Disease Name |
Column Explanations
- Field name -- The name of the field in the database
- Type -- Data type. Arrays are possible.
- Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
- Nullable? -- Whether this field is allowed to take a null value.
- transcript-specific? -- Whether this field is unique to the transcript. In other words, whether this field can change value within a single variant. For example, a single variant can have different consequences depending on the transcript, but the gnomad population data will not change based on the selected transcript for a variant.
- Notes -- Any other relevant information.
Frequently Asked Questions about the VAT
- Why is a variant associated with multiple transcripts?
This is a question about biology and sequencing. A gene codes for a protein through the transcription process (i.e. DNA → mRNA and then downstream mRNA is synthesized into a protein through translation). In the wild, we observe that each gene actually has multiple mRNA isoforms (and therefore proteins). Scientists have mapped each mRNA isoform back into the genome, which are the “transcripts'' discussed in this document. This means that variants can have different effects on the coded protein depending on the transcript, and we have to determine the effect on each transcript separately. Note that some genes overlap each other, so overlapping transcripts can have multiple gene names as well.
The two most common data sources for transcripts are Ensembl and RefSeq. These data sources also have rules for canonical transcripts, which provide guidance for choosing an exemplar transcript when determining the most likely effect of a variant.
- Why do we suggest downstream applications use Ensembl transcripts (over RefSeq)?
RefSeq transcripts are a subset of Ensembl. This makes Ensembl more appropriate for exploratory research.
- How does a variant relate to a row in a Variant Call Format (VCF) file?
Each row in a VCF is a single single genomic location (site). A VCF can have multiple variants at a site. A variant is the site, plus a reference allele and a single alternate allele. The definition of variant ensures simplicity in the VAT table design and ensures row independence. The transcript is required, since certain annotations, such as gene or protein impact, can change depending on the transcript used for annotating a variant. For more information about the relationship between variants and transcripts, please see FAQ #1.
- Why are so many fields null for some variants?
When a variant does not overlap a gene (and therefore does not overlap any transcripts), any annotation regarding the protein coding of the gene (e.g. aa_change) will be null, since there is no applicable value for the field without a transcript overlap. Note that we annotate variants as upstream or downstream of a transcript, which means that overlap with a transcript extends beyond the actual boundaries of the transcript (by up to 5,000 bases).
- Can information in the VAT be shared publicly?
No, because variant counts can be under 20, which is against policy to share publicly. Please see the All Of Us Statistics Dissemination Policy.
- In the VAT, what computed ancestries are included in gnomad (gnomad_*) and All Of Us allele metric annotations (gvs_*)?
Table 2 lists the computed ancestries with allele metric annotations in the VAT
Table 2 -- Computed ancestries available for allele metric annotations in the VAT.
Computed Ancestry (abbreviation) | gnomAD annotations | All Of Us annotations | Notes |
African (afr) | Yes | Yes | |
Latino/Ad Mixed American (amr) | Yes | Yes | |
Ashkenazi Jew (asj) | Yes | No | |
East Asian (eas) | Yes | Yes | |
European (eur) | No (see note) | Yes | Includes both Finnish (fin) and Non-Finnish European (nfe) |
Finnish (fin) | Yes | No | |
Middle-Eastern (mid) | No | Yes | |
Non-Finnish European (nfr) | No | No |
The nfr subpopulation appears as nfr in the VAT but correlates to the nfe subpopulation in gnomad. Please note that the nfr annotations are missing in the VAT. |
South Asian (sas) | Yes | Yes | |
Other (oth) | Yes | Yes |
Appendix A: Calculating *_max_* fields in the VAT
The VAT schema specifies several fields with a “_max_” in the name. These indicate the value corresponding to the subpopulation with the maximum AF for the variant in either the gnomad or gvs corpus.
For example, gnomad_max_ac, would be “In gnomad, the AC for the subpopulation with the highest AF”
For example, gvs_max_an, would be “In GVS, the AN for the subpopulation with the highest AF”
Example to populate gnomad_max_ac and gnomad_max_subpop
We have eight ancestry subpopulations in gnomad:
- afr
- amr
- eas
- fin
- nfr
- asj
- oth
- sas
Therefore, we will already have the following AC fields in the VAT:
- gnomad_afr_ac
- gnomad_amr_ac
- gnomad_eas_ac
- gnomad_fin_ac
- gnomad_nfe_ac
- gnomad_asj_ac
- gnomad_oth_ac
- gnomad_sas_ac
And AF fields:
- gnomad_afr_af
- gnomad_amr_af
- gnomad_eas_af
- gnomad_fin_af
- gnomad_nfe_af
- gnomad_asj_af
- gnomad_oth_af
- gnomad_sas_af
Then let’s say that we have the following values for each AF field:
- gnomad_afr_af: 0.75
- gnomad_amr_af: 0.1
- gnomad_eas_af: 0.1
- gnomad_fin_af: 0.1
- gnomad_nfe_af: 0.1
- gnomad_asj_af: 0.1
- gnomad_oth_af: 0.1
- gnomad_sas_af: 0.1
Since gnomad_afr_af is the highest, then gnomad_max_subpop is “afr” and gnomad_max_ac is the value in gnomad_afr_ac. Ties are broken by alphabetical order of subpopulation (i.e. “afr” would come before “oth”)
Appendix B: Calculating sample_count (and n_hets, n_hom_alts)
The sample_count is defined as the number of samples having the alternate allele in the variant. This can be one copy (“heterozygous (het)”) or two copies (“homozygous alternate (homalt)”). We cannot calculate the sample_count from AC and AN, since we do not know how many samples are het or homalt. In this algorithm, we assume a ploidy of two. The sample_count is allele specific. We also calculate n_hets and n_hom_alts, but we currently only report sample_count (Table 1).
Please see example below from a VCF:
CHROM | POS | REF | ALT | SAMPLE1 | SAMPLE2 | SAMPLE3 | SAMPLE4 | SAMPLE5 | SAMPLE6 |
chr1 | 100 | C | T,CT | 0/0 | 0/1 | 0/2 | 1/1 | 1/2 | ./. |
We expect to have two variants:
vid | sample_count | n_het | n_homalt |
chr1-100-C-T | 3 (SAMPLE 2,4,5) | 2 (SAMPLE 2,5) | 1 (SAMPLE 4) |
chr1-100-C-CT | 2 (SAMPLE 3,5) | 2 (SAMPLE 3,5) | 0 |
sample_count = the number of genotypes with a non-ref call.
n_het = the number of genotypes with a het call including the alternate allele
n_homalt = the number of genotypes with a homalt call including the alternate allele. “0/0” is never a homalt call, this is homozygous reference (“homref”).
Other rules:
- If a genotype is filtered (FT field exists and is not “.” nor “PASS”), then the genotype should not count towards sample_count, n_het, nor n_homalt.
- If a genotype is a no_call (“./.” or missing), then the genotype should not count towards sample sample_count, n_het, nor n_homalt.
- If a genotype is a partial call (e.g. “./1”), then the genotype should count towards sample_count, but not n_het, nor n_homalt.
- AC = 2(n_homalt) + n_het + partial call
- AN = 2(n_het + n_homalt + n_homref) + partial call
- SC = AC_Homalt/2 + AC_Het + AC_Hemi
Comments
0 comments
Article is closed for comments.