Filtering the VDS can be expensive, especially if the variants are spread across the entire genome. You may need to utilize a larger cluster and allocate more time for the task even for a small number of variants if the variants are across the entire genome. It takes approximately 5.5 hours with 50 workers and 500 preemptibles, costing around $330 at a rate of ~$60 per hour, to generate a dense MatrixTable containing 73,545,961 variants and 245,394 samples from the v7 VDS. Please ensure that preemptibles constitute less than 50% of the total number of all workers (primary plus secondary) in your cluster, as exceeding this limit may lead to job failures.
It is important to note that the costs don't scale linearly with the number of variants or samples, but rather depend on how scattered the filtering regions are. If the variants are limited to specific chromosomes rather than spanning the entire genome, you can filter the VDS by chromosome (filter_chromosme) first and then apply a filter_interval step using a bed file. This approach can significantly reduce the processing time. When dealing with large interval sets, filter_rows is faster than filter_intervals and less likely to encounter memory issues. Details about the All of US VDS can be found in this support article: The new VariantDataset (VDS) format for All of Us short read WGS data , and examples on analyzing the VDS can be found in the tutorial notebook: 03_Manipulate Hail VariantDataset (Researcher Workbench login required).
We recommend starting by checking if the variants of interest are present in the smaller callset before using the VDS. You can find more details about the smaller callsets in the support article: Smaller Callsets for Analyzing Short Read WGS SNP & Indel Data with Hail MT, VCF, and PLINK.
Comments
0 comments
Article is closed for comments.