Smaller Callsets for Analyzing Short Read WGS SNP & Indel Data with Hail MT, VCF, and PLINK

  • Updated

We released the srWGS SNP and Indel callset over limited genomic regions in familiar data formats: Variant Call Format (VCF), Hail MatrixTable (MT), Binary GEN format (BGEN), PLINK 2 binary genotype table (PGEN), and PLINK binary biallelic genotype table (PLINK bed). These reduced datasets are available in addition to the complete callset, which is available in VariantDataset (VDS) format. We recommend that all our researchers make use of the reduced datasets if possible, as downstream analysis will be easier with these formats than with the VDS.

The smaller callsets, described in Table 1, cover regions of the genome that are popular for All of Us researchers: an Allele Count/Allele Frequency (ACAF) threshold callset, an exome callset, and a ClinVar callset. Each callset is available in five data formats, described in Table 2. We provide the genomic territory used to create the three smaller callsets as UCSC BED files.

The Hail VDS format –described in depth in ‘The new VDS format for All of Us srWGS data’ –is optimal for storing large callsets but does not currently have many analysis methods available from the VDS itself. Most downstream analyses involve converting the VDS to another format, like a VCF or Hail MT, which can be time-consuming and expensive. Since many researchers do not need the entire genome for their analyses, we provide these pre-made reduced callsets to save researchers time and money.

It’s important to note that converting variant data from a VDS to a VCF or Hail MT greatly increases the storage footprint for the same amount of information. If you use the VDS, you will need to reduce the dataset before converting the format and using it for analyses–the CDRv8 callset is very large!

The genomic data are all available through the Researcher Workbench and their locations can be found in the Controlled CDR Directory document. We encourage you to reach out to the Help Desk (support@researchallofus.org) with any questions if you are unsure of what data format to use for your research study.

 

Table 1 -- srWGS SNP & Indel callset deliverables

Region Samples

Format (See Table 2 for descriptions)

Notes
ACAF Threshold All

- VCF

- Hail multi MT

- Hail split MT

- BGEN

- PGEN

- PLINK bed

  • Variants that are frequent in the AoU computed ancestry subpopulations.
  • The cutoff we use is population-specific allele frequency (AF) > 1% OR population-specific allele count (AC) > 100, in any computed ancestry subpopulations.
  • The AC and AF values are stored in the Variant Annotation Table (VAT) fields, gvs_max_ac and gvs_max_af. The gvs_max_ac and gvs_max_af calculation is described in the VAT doc.
Exome All

- VCF

- Hail multi MT

- Hail split MT

- BGEN

- PGEN

- PLINK bed

  • Variants that are within exon regions of the Gencode v42 basic transcripts
  • This includes protein_coding transcripts with a padding of 15 bases to capture splice site variants on both sides of each exon. In addition, PGx sites are included.
ClinVar All

- VCF

- Hail multi MT

- Hail split MT

- BGEN

- PGEN

- PLINK bed

  • Variants in ClinVar.  Note that these are not limited to  pathogenic or likely pathogenic variants.
  • Please see Known Issue #4 in the QC report, as a small number of variants in ClinVar are missing.
Whole Genome All - VDS
  • All variants for all srWGS SNP & Indel samples
  • The Hail VDS is a sparse data storage format; it stores less data but more information 
    1. For each sample, a majority of the genome is stored as reference blocks
    2. Variant data at each site is only stored for samples with a variant allele
    3. Converting a VDS to a VCF or Hail MT involves rendering all variant calls, or “densifying” the VDS

 

Table 2 -- Smaller callset format details

Format FILTER sites removed? FT filtering? Multiallelelics split? Sharded by chromosome? INFO calculations Notes
VCF No No No No* non-FAIL

bgz compressed

*Widely scattered with a manifest

Multiallelic Hail MT (Hail Multi MT) No No No No  All data Multiple alternate alleles per site, so there is one row per genomic locus. 
Multiallelic split Hail MT (Hail split MT) Yes No Yes No All data Each alternate allele gets a separate site, there can be multiple rows for one genomic locus
BGEN Yes Yes Yes Yes All data Hard calls only (probability values of 0.0 or 1.0) 
PGEN No Yes No Yes non-FAIL Hard calls only
PLINK bed Yes Yes Yes Yes All data Hard calls only. Variants with AC=0 are removed
VDS No No No No NA  

 

Notes:

  • Format -- description of each format we provide for the smaller callsets. More information about the data formats are covered in the article How the All of Us Genomic Data are Organized
  • FILTER sites removed -- whether we removed sites flagged with a non-PASS (and non-missing) filter value (eg, “ExcessHet”). If “Yes” these sites do not appear in the file(s).
  • FT filtering -- whether we filter out genotypes when they fail genotype filtering (FT). The failed genotypes appear as missing calls (“no calls”).
  • Split Multiallelics -- whether we split multiallelic sites into separate records. Eg, 1:1000 C→A,T becomes 1:1000 C→A and 1:1000 C→T
  • Sharded by chromosome -- whether we provide separate files for each chromosome.
  • INFO calculations -- whether we provide INFO field data for all sites or only sites that passed filtering
  • For all the smaller callset formats, we drop sites with more than 100 alternate alleles because they are typically not useful. Sites with more than 100 alternate alleles are available in the whole genome VDS: The new VariantDataset (VDS) format for All of Us short read WGS data.

 

Table 3 -- Variant and site numbers for smaller callsets.

The number of sites were counted from the Hail multi MT and the number of variants were counted from the Hail split MT.

Smaller callset

Number of sites

Number of variants

Number of samples

ACAF Threshold

56,993,186

116,456,419

414,830

Exome

38,150,243

45,704,594

414,830

ClinVar

1,510,012

2,180,727

414,830

 

Was this article helpful?

14 out of 17 found this helpful

Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.