Smaller Callsets for Analyzing Short Read WGS SNP & Indel Data with Hail MT, VCF, and PLINK

We released the srWGS SNP and Indel callset over limited genomic regions in familiar data formats: Variant Call Format (VCF), Hail MatrixTable (MT), Binary GEN format (BGEN), PLINK 2 binary genotype table (PGEN), and PLINK binary biallelic genotype table (PLINK bed). These reduced datasets are available in addition to the complete callset, which is available in VariantDataset (VDS) format. We recommend that all our researchers make use of the reduced datasets if possible, as downstream analysis will be easier with these formats than with the VDS.

The smaller callsets, described in Table 1, cover regions of the genome that are popular for All of Us researchers: an Allele Count/Allele Frequency (ACAF) threshold callset, an exome callset, and a ClinVar callset. Each callset is available in five data formats, described in Table 2. We provide the genomic territory used to create the three smaller callsets as UCSC BED files.

The Hail VDS format –described in depth in ‘The new VDS format for All of Us srWGS data’ –is optimal for storing large callsets but does not currently have many analysis methods available from the VDS itself. Most downstream analyses involve converting the VDS to another format, like a VCF or Hail MT, which can be time-consuming and expensive. Since many researchers do not need the entire genome for their analyses, we provide these pre-made reduced callsets to save researchers time and money.

It’s important to note that converting variant data from a VDS to a VCF or Hail MT greatly increases the storage footprint for the same amount of information. If you use the VDS, you will need to reduce the dataset before converting the format and using it for analyses–the CDRv8 callset is very large!

The genomic data are all available through the Researcher Workbench and their locations can be found in the Controlled CDR Directory document. We encourage you to reach out to the Help Desk (support@researchallofus.org) with any questions if you are unsure of what data format to use for your research study.

Table 1 -- srWGS SNP & Indel callset deliverables

Region	Samples	Format (See Table 2 for descriptions)	Notes
ACAF Threshold	All	- VCF - Hail multi MT - Hail split MT - BGEN - PGEN - PLINK bed	Variants that are frequent in the AoU computed ancestry subpopulations. The cutoff we use is population-specific allele frequency (AF) > 1% OR population-specific allele count (AC) > 100, in any computed ancestry subpopulations. The AC and AF values are stored in the Variant Annotation Table (VAT) fields, gvs_max_ac and gvs_max_af. The gvs_max_ac and gvs_max_af calculation is described in the VAT doc.
Exome	All	- VCF - Hail multi MT - Hail split MT - BGEN - PGEN - PLINK bed	Variants that are within exon regions of the Gencode v42 basic transcripts This includes protein_coding transcripts with a padding of 15 bases to capture splice site variants on both sides of each exon. In addition, PGx sites are included.
ClinVar	All	- VCF - Hail multi MT - Hail split MT - BGEN - PGEN - PLINK bed	Variants in ClinVar. Note that these are not limited to pathogenic or likely pathogenic variants. Please see Known Issue #4 in the QC report, as a small number of variants in ClinVar are missing.
Whole Genome	All	- VDS	All variants for all srWGS SNP & Indel samples The Hail VDS is a sparse data storage format; it stores less data but more information For each sample, a majority of the genome is stored as reference blocks Variant data at each site is only stored for samples with a variant allele Converting a VDS to a VCF or Hail MT involves rendering all variant calls, or “densifying” the VDS

Table 2 -- Smaller callset format details

Format	FILTER sites removed?	FT filtering?	Multiallelelics split?	Sharded by chromosome?	INFO calculations	Notes
VCF	No	No	No	No*	non-FAIL	bgz compressed *Widely scattered with a manifest
Multiallelic Hail MT (Hail Multi MT)	No	No	No	No	All data	Multiple alternate alleles per site, so there is one row per genomic locus.
Multiallelic split Hail MT (Hail split MT)	Yes	No	Yes	No	All data	Each alternate allele gets a separate site, there can be multiple rows for one genomic locus
BGEN	Yes	Yes	Yes	Yes	All data	Hard calls only (probability values of 0.0 or 1.0). Variants with AC=0 are removed. Please see Known Issue #11, as there is an issue affecting variants on the X and Y chromosomes.
PGEN	No	Yes	No	Yes	non-FAIL	Hard calls only
PLINK bed	Yes	Yes	Yes	Yes	All data	Hard calls only. Variants with AC=0 are removed. Please see Known Issue #11, as there is an issue affecting variants on the X and Y chromosomes.
VDS	No	No	No	No	NA

Notes:

Format -- description of each format we provide for the smaller callsets. More information about the data formats are covered in the article How the All of Us Genomic Data are Organized.
FILTER sites removed -- whether we removed sites flagged with a non-PASS (and non-missing) filter value (eg, “ExcessHet”). If “Yes” these sites do not appear in the file(s).
FT filtering -- whether we filter out genotypes when they fail genotype filtering (FT). The failed genotypes appear as missing calls (“no calls”).
Split Multiallelics -- whether we split multiallelic sites into separate records. Eg, 1:1000 C→A,T becomes 1:1000 C→A and 1:1000 C→T
Sharded by chromosome -- whether we provide separate files for each chromosome.
INFO calculations -- whether we provide INFO field data for all sites or only sites that passed filtering
For all the smaller callset formats, we drop sites with more than 100 alternate alleles because they are typically not useful. Sites with more than 100 alternate alleles are available in the whole genome VDS: The new VariantDataset (VDS) format for All of Us short read WGS data.

Table 3 -- Variant and site numbers for smaller callsets.

The number of sites were counted from the Hail multi MT and the number of variants were counted from the Hail split MT.

Smaller callset	Number of sites	Number of variants	Number of samples
ACAF Threshold	56,993,186	116,456,419	414,830
Exome	38,150,243	45,704,594	414,830
ClinVar	1,510,012	2,180,727	414,830

Smaller Callsets for Analyzing Short Read WGS SNP & Indel Data with Hail MT, VCF, and PLINK

Table 1 -- srWGS SNP & Indel callset deliverables

Table 2 -- Smaller callset format details

Table 3 -- Variant and site numbers for smaller callsets.

Was this article helpful?

Comments

<%= previousTitle %>

<%= nextTitle %>

<%= block.name %>

<%= block.name %>

Have a question or would like to make a request?

Categories

Toggle navigation menu

<%= category.name %>

Search

Table 1 -- srWGS SNP & Indel callset deliverables

Table 2 -- Smaller callset format details

Table 3 -- Variant and site numbers for smaller callsets.

Was this article helpful?

<%= previousTitle %>

<%= nextTitle %>

<%= block.name %>

<%= block.name %>

Have a question or would like to make a request?

Categories

Toggle navigation menu

<%= category.name %>

Categories

Categories