We released the srWGS SNP and Indel callset over limited genomic regions in familiar data formats: Variant Call Format (VCF), Hail MatrixTable (MT), Binary GEN format (BGEN), PLINK 2 binary genotype table (PGEN), and PLINK binary biallelic genotype table (PLINK bed). These reduced datasets are available in addition to the complete callset, which is available in VariantDataset (VDS) format. We recommend that all our researchers make use of the reduced datasets if possible, as downstream analysis will be easier with these formats than with the VDS.
The smaller callsets, described in Table 1, cover regions of the genome that are popular for All of Us researchers: an Allele Count/Allele Frequency (ACAF) threshold callset, an exome callset, and a ClinVar callset. Each callset is available in five data formats, described in Table 2. We provide the genomic territory used to create the three smaller callsets as UCSC BED files.
The Hail VDS format –described in depth in ‘The new VDS format for All of Us srWGS data’ –is optimal for storing large callsets but does not currently have many analysis methods available from the VDS itself. Most downstream analyses involve converting the VDS to another format, like a VCF or Hail MT, which can be time-consuming and expensive. Since many researchers do not need the entire genome for their analyses, we provide these pre-made reduced callsets to save researchers time and money.
It’s important to note that converting variant data from a VDS to a VCF or Hail MT greatly increases the storage footprint for the same amount of information. If you use the VDS, you will need to reduce the dataset before converting the format and using it for analyses–the CDRv8 callset is very large!
The genomic data are all available through the Researcher Workbench and their locations can be found in the Controlled CDR Directory document. We encourage you to reach out to the Help Desk (support@researchallofus.org) with any questions if you are unsure of what data format to use for your research study.
Table 1 -- srWGS SNP & Indel callset deliverables
Region | Samples |
Format (See Table 2 for descriptions) |
Notes |
ACAF Threshold | All |
- VCF - Hail multi MT - Hail split MT - BGEN - PGEN - PLINK bed |
|
Exome | All |
- VCF - Hail multi MT - Hail split MT - BGEN - PGEN - PLINK bed |
|
ClinVar | All |
- VCF - Hail multi MT - Hail split MT - BGEN - PGEN - PLINK bed |
|
Whole Genome | All | - VDS |
|
Table 2 -- Smaller callset format details
Format | FILTER sites removed? | FT filtering? | Multiallelelics split? | Sharded by chromosome? | INFO calculations | Notes |
VCF | No | No | No | No* | non-FAIL |
bgz compressed *Widely scattered with a manifest |
Multiallelic Hail MT (Hail Multi MT) | No | No | No | No | All data | Multiple alternate alleles per site, so there is one row per genomic locus. |
Multiallelic split Hail MT (Hail split MT) | Yes | No | Yes | No | All data | Each alternate allele gets a separate site, there can be multiple rows for one genomic locus |
BGEN | Yes | Yes | Yes | Yes | All data | Hard calls only (probability values of 0.0 or 1.0) |
PGEN | No | Yes | No | Yes | non-FAIL | Hard calls only |
PLINK bed | Yes | Yes | Yes | Yes | All data | Hard calls only. Variants with AC=0 are removed |
VDS | No | No | No | No | NA |
Notes:
- Format -- description of each format we provide for the smaller callsets. More information about the data formats are covered in the article How the All of Us Genomic Data are Organized.
- FILTER sites removed -- whether we removed sites flagged with a non-PASS (and non-missing) filter value (eg, “ExcessHet”). If “Yes” these sites do not appear in the file(s).
- FT filtering -- whether we filter out genotypes when they fail genotype filtering (FT). The failed genotypes appear as missing calls (“no calls”).
- Split Multiallelics -- whether we split multiallelic sites into separate records. Eg, 1:1000 C→A,T becomes 1:1000 C→A and 1:1000 C→T
- Sharded by chromosome -- whether we provide separate files for each chromosome.
- INFO calculations -- whether we provide INFO field data for all sites or only sites that passed filtering
- For all the smaller callset formats, we drop sites with more than 100 alternate alleles because they are typically not useful. Sites with more than 100 alternate alleles are available in the whole genome VDS: The new VariantDataset (VDS) format for All of Us short read WGS data.
Table 3 -- Variant and site numbers for smaller callsets.
The number of sites were counted from the Hail multi MT and the number of variants were counted from the Hail split MT.
Smaller callset |
Number of sites |
Number of variants |
Number of samples |
ACAF Threshold |
56,993,186 |
116,456,419 |
414,830 |
Exome |
38,150,243 |
45,704,594 |
414,830 |
ClinVar |
1,510,012 |
2,180,727 |
414,830 |
Comments
0 comments
Article is closed for comments.