Starting with the v7 data release, the short read whole genome sequencing (srWGS) SNP and Indel (“small”) variants dataset was released as a Hail VariantDataset (VDS). We know that many of the All of Us researchers are familiar and comfortable with variant call format (VCF) and Hail MatrixTable (MT) format, so this is a significant change. In this article, we will walk through why we are changing to a VDS for the complete dataset, a summary of the data storage in the VDS, and how you can update your analyses to utilize this new format.
|Before we continue describing the VDS, we wanted to point out that we are still releasing the srWGS SNP and Indel dataset in VCF, Hail MT, and PLINK formats. These formats will cover limited regions of the genome that are highly used by the All of Us researchers. You can read more about this in the article Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK.|
The All of Us research study is aiming to build the largest cohort of srWGS samples. The more samples we have in this joint cohort, the more samples we will have for statistical analysis to understand the human genome. In the v6 All of Us genomics data release, we released 98,590 samples in the joint srWGS callset and in the most recent v7 genomics data release, we released 245,394 samples, increasing the size of the callset by around 250%. Because we are continuing to build the size of our sample cohort, we are actively studying and investing in the best methods for variant analysis and storage. Unfortunately, the VCF and Hail MT have not been able to scale with the cohort size we have. Many researchers already saw issues when trying to utilize the 98,590 joint srWGS dataset because of the time and money it took to analyze the dataset in VCF or Hail MT format.
Because of the issues both in generation and utilization of the srWGS dataset in VCF and Hail MT format, we are making the change to use VDS format for the complete srWGS callset. VariantDataset (VDS) is a new sparse Hail format that stores less data but more information, leading to incredible improvements for storing and analyzing variant data. You might be familiar with using Hail and the Hail dense MT format, which stores data similarly to a VCF.
There are Hail functions for the VDS, including calculating QC, filtering samples, filtering variants, and filtering intervals. Make sure you’re looking at Hail 0.2 in the docs - older Hail docs refer to an outdated VariantDataset definition. In addition, the VDS can be manipulated in Hail and then “densified”, which is the process of rendering all the variant calls for all samples.
The VDS uses new concepts “local alleles” and “global alleles”. Global alleles are the reference allele and all alternate alleles across all samples in the cohort. The global alleles are stored in the alleles array. Local alleles are the reference allele and alternate allele for each sample, describing one genotype. Local alleles are stored in the VDS as an array (LA) for each genotype. The LA array maps a local allele to a corresponding global allele.
There are also genotype-level fields that use local alleles and by convention these fields start with “L”. For example, the LAD is the local allele depth and the LGT is the local genotype. The LAD for each genotype can have at most three numbers: the depth of the reference allele and the depth of at most two alternate alleles for the genotype. If the genotype is homozygous reference, the LAD only has one number.
Let’s consider a concrete example. Suppose we have a variant whose reference allele is A and whose alternate alleles are AA, AAA, …, through to an allele with 100 adenines. These are the global alleles. Consider two samples whose genotypes are AAA/AAA and AA/AAAA. Their respective local alleles are: A, AAA and A, AA, AAAA. Their respective local allele arrays (LA) are [0, 2] and [0, 1, 3]. Their local genotypes (LGT) are 1/1 and 1/2. Their LADs may respectively be [0, 30] and [0, 13, 16].
To decrease the amount of data needed to store large portions of the genome, the VDS preserves reference blocks. The reference blocks are stored in a component table reference_data. The row key is the locus and the ref_allele denotes the reference base at the genomic coordinate. Columns are keyed by the sample ID, if there is data at a particular location for a sample, then that location’s variant call matches the reference.
The entries are grouped together and many sites are stored as one record, with the start and end locations of the reference blocks indicating the grouping. The group is created by the genotype quality so that non-variant sites with lower or higher genotype quality are distinguishable for analysis. When the non-variant sites for each sample are stored together, variant data for fewer individual sites per sample is stored, reducing the storage footprint.
The drawbacks of VCF for variant storage
The VCF does not scale well to store variants and there are two main issues that cause exponential growth.
First, each unique variant in a sample, or a “singleton”, causes a new row to be added in the VCF with columns for every sample. This data storage method causes the byte size of the VCF to grow faster than the sample size. For example, in a VCF with 100,000 samples, a singleton will not only add one new row, but also 99,999 columns with repeated data describing non-variant positions. The VDS solves this problem by storing reference blocks for each sample, removing the need to store individual entries for non-variant sites in all samples, but preserving the benefits of adding non-variant data to a callset.
The second drawback of the way VCF stores variant data is the exponential growth of genotype metadata. For each sample, a diploid genotype has at most two distinct alleles. However, a VCF requires storing genotype metadata for all the global alleles across all samples. If a site has a high number of alternate alleles, this genotype metadata can become extremely large, while metadata for only two alleles per sample is actually necessary. The number of data entries is proportional to the square of the total number of alleles across for a variant site. The VDS stores genotype metadata for each sample only for the alleles that occur in the sample (local alleles).
Although VCF compression partially addresses these problems during storage, analyzing data in this dense format is untenable. The Hail MatrixTable is also a dense representation and therefore suffers from the same issues. The VCFs produced by the gnomAD project have grown super-linearly in the number of samples. The size of their VCF callset far exceeds the sum total size of their GVCFs. In response, the gnomAD project abandoned VCF for the v4 callset.
A 100,000 sample compressed VCF is 17 TB whereas a 250,000 sample VDS is 24 TB, which is a 41% increase in bytes but a 250% increase in samples. We estimate that the 250,000 All of Us srWGS samples would be well over 50 TB as a VCF, which is not usable.
Using the All of Us VDS
There are multiple options for using the new VDS format. If you usually start your analysis with a VCF or Hail MatrixTable, you can densify the VDS and convert it to a VCF or Hail MatrixTable. At this time, there are limited downstream analysis tools already available for using the sparse VDS format without densifying. However, some benefits to the VDS format allow for analyses that are not popular with any of the dense variant storage options.
Because many of our researchers are interested in specific variants, like variants within the exome, clinical variants, or population-specific frequent variants, we are releasing VCF, Hail MT, and PLINK datasets covering these limited regions of the genome (read more in Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK). We recommend that researchers use these pre-made VCF, Hail MT, and PLINK datasets because they cover the most popular variants of interest and they will save researchers time and money. The densification step from the VDS to a VCF or Hail MatrixTable can be time consuming and expensive.
We describe the All of Us VDS in depth in the document ‘How the All of Us Genomic data are organized’, which you can find on the User Support Hub . Some features of the All of Us VDS are not standard, so you may see differences from this documentation if you are using a different VDS.
Analyses directly from the VDS
Some analyses can be executed directly against the sparse VDS format. In gnomAD’s sex inference metric, described in Karczewski, et al , the reference blocks in the sparse VDS format were used to find an accurate read depth. Briefly described under “Sex imputation”, chromosome Y typically has too few variant calls for accurate read depth. The mean read depth on chromosome 20 is used to normalize the read depth values and then estimate sex chromosomal ploidy and ultimately sex karyotype. Supplementary figure 4 from Karzewski, et al. scatter plots the sex ploidy estimates with points colored by estimated sex karyotype.
Another example is CHARR, a soon-to-be-published contamination estimation method by Lu et al. CHARR estimates sample DNA contamination directly from the sparse VDS format. In fact, CHARR only relies on the non-reference variant data thereby avoiding reading the vast majority of the VDS data. As a result, compared to current methods based on CRAMs or BAMs, CHARR is three orders of magnitude faster and cheaper.
Densifying the VDS
Another option for using the complete srWGS dataset and starting with the VDS is to extract the samples and regions of interest for your study and convert the VDS to a VCF or Hail MatrixTable. We provide examples for how to do this in our new tutorial notebook about using the VDS format on the Researcher Workbench (RW). Since the densification step is costly, please make sure you read our instructions carefully and ask any questions if you have them.
The densification steps can be run in Hail, and are described below. Though you will need to make some modifications depending on if you create a VCF or Hail MT. For example, if creating a VCF, you will need to also render the header, or if creating a MT, you will have to drop some fields. Please also see Known Issue #7 in the QC report, if you filter any participants from the VDS, you will need to recalculate GT before densifying.
- vds = hl.vds.read_vds(original_vds)
- vds = vds.filter_samples(original_vds, samples)
- vds = vds.filter_intervals(original_vds, intervals, split_reference_blocks = True)
- mt = hl.vds.to_dense_mt(hl.vds.VariantDataset(vds.reference_data, mt))
- mt = mt.transmute_rows(info=mt.info.annotate(AF=mt.info.AF[1:], AC=mt.info.AC[1:]))
- hl.export_vcf(mt, out_path)
We do not recommend trying to analyze the full (all samples and entire genome) srWGS callset in VCF or Matrix Table format. At the scale of the All of Us genomic dataset, most downstream tools will not work for the scale and will be costly if you try. We have three pre-made datasets covering the most popular regions of the genome in VCF, Hail MT, and PLINK formats. See Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK for more information.
Even though this is a big change, we hope you can join in our excitement about the number of srWGS samples we have and getting closer to our goal of 1,000,000 samples! We are always available to help and also welcome any feedback about how we can make your research easier. Reach out to us at the User Support Hub  to work with our support team.
 Hail 0.2 Documentation https://hail.is/docs/0.2/vds/index.html
 All Of Us User Support Hub https://aousupporthelp.zendesk.com/hc/en-us
 Karczewski, K.J., Solomonson, M., Tiao, G. et al. Systematic single-variant and gene-based association testing of thousands of phenotypes in 394,841 UK Biobank exomes. Cell Genomics (2022). https://doi.org/10.1016/j.xgen.2022.100168