Understanding your analysis before creating your cloud environment
Optimizing your workspace can be as simple as having a good understanding of the analyses you intend to perform beforehand. Having a good idea of the computational needs of your analyses will allow you to customize your cloud environment. In return you will be able to meet your compute needs accurately and precisely, avoiding unnecessary costs. For example, if the analysis you’re running will only submit one job at a time, meaning it cannot be parallelized, then you should run the standard virtual machine (VM). However, if your analysis can be parallelized via Hail or Spark, then running a spark cluster will create the most optimized cloud analysis environment.
If your analysis can be executed and/or parallelized via dsub, Nextflow, or Cromwell, then you should run the standard virtual machine (VM).
Test your analysis on a small sample and/or a small region of the genome
One of the greatest assets of the All of Us dataset is its large sample size. However, the larger the sample, the larger the compute cost and time. If the outcome of your analysis is not desired or produces an error along the way, unnecessary compute cost and time will result.
To avoid these unnecessary compute costs we recommend for researchers to:
-
Try to run the analysis against a smaller sample and/or a smaller region of the genome
-
Verify the outcome before utilizing more compute resources
Some cases where this is relevant is in notebooks designed to run to completion or those designed to run in the background.
Tip: for a Hail analysis, use filter_intervals to limit your analysis to a small region of the genome. test_intervals = ['chr1:100M-200M', 'chr16:29.1M-30.2M'] mt = hl.filter_intervals( mt, [hl.parse_locus_interval(x,) for x in test_intervals])
|
Tip: for Hail analysis, you can start with a small Spark cluster (e.g. one master and two workers) to test your analysis.
|
Combine analyses and utilize Google Buckets
As you progress in the Researcher Workbench and create your workspace, you may develop numerous Jupyter Notebooks to conduct your analysis. We recommend combining analyses located across multiple notebooks into one, as long as you are using a consistent coding language across all notebooks (R or Python). Having analyses located in fewer notebooks cuts down on navigation time and prevents your cloud analysis environment from stalling out due to inactivity.
By combining analyses within a single notebook, you can end up with multiple dataset queries in that notebook. Instead of running multiple queries every time you access the notebook, you can save each one to a Google Bucket for optimized access. A Google Bucket is a cloud storage space for personal file saving from your notebook. After running a query or a command, you can save these files to your Google Bucket which you can access at any time, from any other notebook in the same workspace, without re-running any commands. The support article, How do I use Google buckets to save my files in notebooks?, gives instructions and a tutorial video on how to do this.
Run notebook in the background
Another way to optimize your time spent running analyses is to run them in the background. The benefit of running a notebook in the background is that you can work in other notebooks, workspaces, or step away from your computer, and not have to worry about your cloud analysis environment becoming inactive and stalling out. The virtual machine (VM) will run until the job is done. Once the job is finished, the output will be saved to the specified location. We have provided a tutorial workspace on how to perform this activity, which can be accessed here.
Acknowledge your current cloud analysis environments
There are two phases to a cloud analysis environment, active and paused.
-
An active environment should be utilized while directly working in a notebook performing analysis.
-
A paused environment should be utilized when you are not directly working in a particular notebook, but would like to keep it running in order to return to it after a short period of time. A paused environment uses less computing power and therefore accrues less cost.
-
It is also important to end (delete) large Spark clusters when you are finished working in a notebook to stop using compute power entirely and avoid unnecessary costs. For standard VM, it does not cost much to keep them unless you have a large disk (eg. you transferred a very large file like the WGS plink data to the persistent disk and want to keep it).