Suggested Optimizations for Cost and Compute Time

Understanding your analysis before creating your cloud environment

Optimizing your workspace can be as simple as having a good understanding of the analyses you intend to perform beforehand. Having a good idea of the computational needs of your analyses will allow you to customize your cloud environment. In return you will be able to meet your compute needs accurately and precisely, avoiding unnecessary costs. For example, if the analysis you’re running will only submit one job at a time, meaning it cannot be parallelized, then you should run the standard virtual machine (VM). However, if your analysis can be parallelized via Hail or Spark, then running a spark cluster will create the most optimized cloud analysis environment.

If your analysis can be executed and/or parallelized via dsub, Nextflow, or Cromwell, then you should run the standard virtual machine (VM).

Test your analysis on a small sample and/or a small region of the genome

One of the greatest assets of the All of Us dataset is its large sample size. However, the larger the sample, the larger the compute cost and time. If the outcome of your analysis is not desired or produces an error along the way, unnecessary compute cost and time will result. 

To avoid these unnecessary compute costs we recommend for researchers to: 

  1. Try to run the analysis against a smaller sample and/or a smaller region of the genome

  2. Verify the outcome before utilizing more compute resources 

Some cases where this is relevant is in notebooks designed to run to completion or those designed to run in the background.

Tip: for a Hail analysis, use filter_intervals to limit your analysis to a small region of the genome.

test_intervals = ['chr1:100M-200M', 'chr16:29.1M-30.2M']

mt = hl.filter_intervals(

    mt,

    [hl.parse_locus_interval(x,)

     for x in test_intervals])

  • With filter_rows, even if the final set of rows is a small number, Hail still reads every data partition.
  • With filter_intervals, Hail is able to skip the data partitions that are not relevant which is much faster.
  • We recommend to use filter_intervals as you can and use it before any filter_rows operation.
  • See examples of how to specify genomic region intervals in the Hail documentation.

 

Tip: for Hail analysis, you can start with a small Spark cluster (e.g. one master and two workers) to test your analysis.

  • After you have confirmed that your analysis is working correctly and are ready to run it on the full data, you can increase the number of workers or preemptibles.
  • As long as you do not change cpu/ram/disk settings, those additional VMs will be added to your existing Spark cluster.

 

Combine analyses and utilize Google Buckets

As you progress in the Researcher Workbench and create your workspace, you may develop numerous Jupyter Notebooks to conduct your analysis. We recommend combining analyses located across multiple notebooks into one, as long as you are using a consistent coding language across all notebooks (R or Python). Having analyses located in fewer notebooks cuts down on navigation time and prevents your cloud analysis environment from stalling out due to inactivity.

By combining analyses within a single notebook, you can end up with multiple dataset queries in that notebook. Instead of running multiple queries every time you access the notebook, you can save each one to a Google Bucket for optimized access. A Google Bucket is a cloud storage space for personal file saving from your notebook. After running a query or a command, you can save these files to your Google Bucket which you can access at any time, from any other notebook in the same workspace, without re-running any commands. The support article, How do I use Google buckets to save my files in notebooks?, gives instructions and a tutorial video on how to do this.

 

Run notebook in the background

Another way to optimize your time spent running analyses is to run them in the background. The benefit of running a notebook in the background is that you can work in other notebooks, workspaces, or step away from your computer, and not have to worry about your cloud analysis environment becoming inactive and stalling out. The virtual machine (VM) will run until the job is done. Once the job is finished, the output will be saved to the specified location. We have provided a tutorial workspace on how to perform this activity, which can be accessed here

Acknowledge your current cloud analysis environments

There are two phases to a cloud analysis environment, active and paused. 

  • An active environment should be utilized while directly working in a notebook performing analysis. 

  • A paused environment should be utilized when you are not directly working in a particular notebook, but would like to keep it running in order to return to it after a short period of time. A paused environment uses less computing power and therefore accrues less cost. 

  • It is also important to end (delete) large Spark clusters when you are finished working in a notebook to stop using compute power entirely and avoid unnecessary costs. For standard VM, it does not cost much to keep them unless you have a large disk (eg. you transferred a very large file like the WGS plink data to the persistent disk and want to keep it).

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request