Smaller Callsets for Analyzing Short Read WGS SNP & Indel Data with Hail MT, VCF, and PLINK

  • Updated

We released the srWGS SNP and Indel callset in familiar data formats over limited genomic regions: VCF, Hail MT, BGEN, and PLINK bed formats. These reduced datasets are available in addition to the complete callset, which is available in VariantDataset (VDS) format. We recommend that all our researchers make use of the reduced datasets if possible, as downstream analysis will be easier with these formats than with the VDS.

The smaller callsets, described in Table 1, cover regions of the genome that are popular for All of Us researchers: an Allele Count/Allele Frequency (ACAF) threshold callset, an exome callset, and a ClinVar callset. Each callset is available in five data formats, described in Table 2. We provide the genomic territory used to create the three smaller callsets as UCSC BED files.

The Hail VDS format - described in depth in ‘The new VDS format for All of Us srWGS data’  is optimal for storing large callsets but does not currently have many analysis methods available from the VDS itself. Most downstream analyses involve converting the VDS to another format, like a VCF or Hail MT, which can be time consuming and expensive. Since many researchers do not need the entire genome for their analyses, we provide these pre-made reduced callsets to save researchers time and money.

It’s important to note that converting variant data from a VDS to a VCF or Hail MT greatly increases the storage footprint for the same amount of information. If you use the VDS, you will need to reduce the dataset before converting the format and using it for analyses - the v7 callset is very large!

The genomic data are all available through the Researcher Workbench and their locations can be found in the Controlled CDR Directory document. We encourage you to reach out to the Help Desk (support@researchallofus.org) with any questions if you are unsure of what data format to use for your research study.

Note: the smaller srWGS callsets were updated to improve usability and include some missing ACAF variants; more details are available in this article.

 

Table 1 -- srWGS SNP & Indel callset deliverables

Region Samples

Format (See Table 2 for descriptions)

Notes
ACAF Threshold All

- VCF

- Hail multi MT

- Hail split MT

- BGEN

- PLINK bed

  • Variants that are frequent in the AoU computed ancestry subpopulations.
  • The cutoff we use is population-specific allele frequency (AF) > 1% OR population-specific allele count (AC) > 100, in any computed ancestry subpopulations
  • The AC and AF values are stored in the Variant Annotation Table (VAT) fields, gvs_max_ac and gvs_max_af. The gvs_max_ac and gvs_max_af calculation is described in the VAT doc.  
Exome All

- VCF

- Hail multi MT

- Hail split MT

- BGEN

- PLINK bed

  • Variants that are within exon regions of the Gencode v42 basic transcripts
  • This includes protein_coding transcripts with a padding of 15 bases to capture splice site variants on both sides of each exon. In addition, PGx sites are included.
ClinVar All

- VCF

- Hail multi MT

- Hail split MT

- BGEN

- PLINK bed

  • Variants in ClinVar.  Note that these are not limited to  pathogenic or likely pathogenic variants.
Whole Genome All - VDS
  • All variants for all srWGS SNP & Indel samples
  • The Hail VDS is a sparse data storage format, it stores less data but more information 
    1. For each sample, a majority of the genome is stored as reference blocks
    2. Variant data at each site is only stored for samples with a variant allele
    3. Converting a VDS to a VCF or Hail MT involves rendering all variant calls, or “densifying” the VDS

 

Table 2 -- Smaller callset format details

Format FILTER sites removed? FT filtering? Multiallelelics split? Sharded by chromosome? Notes
Variant Call Format (VCF) No Yes No Yes bgz compressed
Multiallelic Hail MatrixTable (Hail Multi MT) No Yes No No  Multiple alternate alleles per site, so there is one row per genomic locus. 
Multiallelic split Hail MatrixTable (Hail split MT) Yes Yes Yes No Each alternate allele gets a separate site, there can be multiple rows for one genomic locus
Binary GEN format (BGEN) Yes Yes Yes Yes Hard calls only (probability values of 0.0 or 1.0) 
PLINK binary biallelic genotype table (bed) Yes Yes Yes Yes  
VDS No No NA No  

 

Notes:

  • Format -- description of each format we provide for the smaller callsets. More information about the data formats are covered in the article ‘How the All of Us Genomic Data are Organized’ on the User Support Hub.
  • FILTER sites removed -- Whether we removed sites flagged with a non-PASS (and non-missing) filter value (eg, “ExcessHet”).  If “Yes” these sites do not appear in the file(s).
  • FT filtering -- we filter out genotypes when they fail genotype filtering (FT). The failed genotypes appear as missing calls (“no calls”).
  • Split Multiallelics -- Whether we split multiallelic sites into separate records.  Eg, 1:1000 C→A,T becomes 1:1000 C→A and 1:1000 C→T
  • Sharded by chromosome -- Whether we provide separate files for each chromosome.
  • AS_YNG and AS_VQSLOD are not present in any smaller callsets (See Known Issue #2)
  • For all the smaller callset formats, we drop sites with more than 100 alternate alleles because they are typically not useful. Sites with more than 100 alternate alleles are available in the whole genome VDS: The new VariantDataset (VDS) format for All of Us short read WGS data

 

Table 3 -- Variant and site numbers for smaller callsets

Smaller callset

Number of sites

Number of variants

Number of samples

ACAF Threshold

48,314,438

99,250,816

 

245,394

Exome

30,013,262

34,807,589

 

245,394

ClinVar

921,988

1,281,259

 

245,394

 

Was this article helpful?

7 out of 8 found this helpful

Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.