We released the srWGS SNP and Indel callset in familiar data formats over limited genomic regions: VCF, Hail MT, BGEN, and PLINK bed formats. These reduced datasets are available in addition to the complete callset, which is available in VariantDataset (VDS) format. We recommend that all our researchers make use of the reduced datasets if possible, as downstream analysis will be easier with these formats than with the VDS.
The smaller callsets, described in Table 1, cover regions of the genome that are popular for All of Us researchers: an Allele Count/Allele Frequency (ACAF) threshold callset, an exome callset, and a ClinVar callset. Each callset is available in five data formats, described in Table 2. We provide the genomic territory used to create the three smaller callsets as UCSC BED files.
The Hail VDS format - described in depth in ‘The new VDS format for All of Us srWGS data’ is optimal for storing large callsets but does not currently have many analysis methods available from the VDS itself. Most downstream analyses involve converting the VDS to another format, like a VCF or Hail MT, which can be time consuming and expensive. Since many researchers do not need the entire genome for their analyses, we provide these pre-made reduced callsets to save researchers time and money.
It’s important to note that converting variant data from a VDS to a VCF or Hail MT greatly increases the storage footprint for the same amount of information. If you use the VDS, you will need to reduce the dataset before converting the format and using it for analyses - the v7 callset is very large!
The genomic data are all available through the Researcher Workbench and their locations can be found in the Controlled CDR Directory document. We encourage you to reach out to the Help Desk (support@researchallofus.org) with any questions if you are unsure of what data format to use for your research study.
Note: the smaller srWGS callsets were updated to improve usability and include some missing ACAF variants; more details are available in this article.
Table 1 -- srWGS SNP & Indel callset deliverables
Region | Samples |
Format (See Table 2 for descriptions) |
Notes |
ACAF Threshold | All |
- VCF - Hail multi MT - Hail split MT - BGEN - PLINK bed |
|
Exome | All |
- VCF - Hail multi MT - Hail split MT - BGEN - PLINK bed |
|
ClinVar | All |
- VCF - Hail multi MT - Hail split MT - BGEN - PLINK bed |
|
Whole Genome | All | - VDS |
|
Table 2 -- Smaller callset format details
Format | FILTER sites removed? | FT filtering? | Multiallelelics split? | Sharded by chromosome? | Notes |
Variant Call Format (VCF) | No | Yes | No | Yes | bgz compressed |
Multiallelic Hail MatrixTable (Hail Multi MT) | No | Yes | No | No | Multiple alternate alleles per site, so there is one row per genomic locus. |
Multiallelic split Hail MatrixTable (Hail split MT) | Yes | Yes | Yes | No | Each alternate allele gets a separate site, there can be multiple rows for one genomic locus |
Binary GEN format (BGEN) | Yes | Yes | Yes | Yes | Hard calls only (probability values of 0.0 or 1.0) |
PLINK binary biallelic genotype table (bed) | Yes | Yes | Yes | Yes | |
VDS | No | No | NA | No |
Notes:
- Format -- description of each format we provide for the smaller callsets. More information about the data formats are covered in the article ‘How the All of Us Genomic Data are Organized’ on the User Support Hub.
- FILTER sites removed -- Whether we removed sites flagged with a non-PASS (and non-missing) filter value (eg, “ExcessHet”). If “Yes” these sites do not appear in the file(s).
- FT filtering -- we filter out genotypes when they fail genotype filtering (FT). The failed genotypes appear as missing calls (“no calls”).
- Split Multiallelics -- Whether we split multiallelic sites into separate records. Eg, 1:1000 C→A,T becomes 1:1000 C→A and 1:1000 C→T
- Sharded by chromosome -- Whether we provide separate files for each chromosome.
- AS_YNG and AS_VQSLOD are not present in any smaller callsets (See Known Issue #2)
- For all the smaller callset formats, we drop sites with more than 100 alternate alleles because they are typically not useful. Sites with more than 100 alternate alleles are available in the whole genome VDS: The new VariantDataset (VDS) format for All of Us short read WGS data.
Table 3 -- Variant and site numbers for smaller callsets
Smaller callset |
Number of sites |
Number of variants |
Number of samples |
ACAF Threshold |
48,314,438 |
99,250,816
|
245,394 |
Exome |
30,013,262 |
34,807,589
|
245,394 |
ClinVar |
921,988 |
1,281,259
|
245,394 |
Comments
0 comments
Article is closed for comments.