We understand cloud analysis can come at a cost. We hope the recommendations in this article will assist you in your project planning and minimize potential cost of your analysis. To understand more about billing and credits, please visit our support materials here.
Roughly calculating cost
Cost = Data + Compute Resources * Time + Storage * Time
Data: the amount of data you are pulling from BigQuery.
Compute Resources: the amount of CPUs/GPUs/RAM, etc.
Time: the amount of time in active or idle environments
Storage: the amount of storage space used in persistent disk or workspace bucket
The equation above can be used as a general understanding and breakdown of cost accumulated in the Researcher Workbench. More detailed factors can come into play as you begin to customize your Cloud Analysis Environments and work with specific kinds of data. We have this additional support article that includes more information on what exactly you are paying for - What exactly am I paying for?
Understanding your analysis before creating your cloud environment
Optimizing your workspace can be as simple as having a good understanding of the analyses you intend to perform beforehand. Having a good idea of the computational needs of your analyses will allow you to customize your cloud environment. In return you will be able to meet your compute needs accurately and precisely, avoiding unnecessary costs. For example, if the analysis you’re running will only submit one job at a time, meaning it cannot be paralleled, then you should run the standard virtual machine (VM). However, if your analysis can be paralleled, then running a spark cluster will create the most optimized cloud analysis environment. Another consideration is to initially start with a larger environment (more compute resources) to fit the needs of your analysis, and scale back as needed. These support articles will help when considering the size of you VM, when to increase your compute resources, and understanding storage options:
- Can I increase the size of my virtual machine (VM)?
- Prevent your Kernel from dying
- Storage Options Explained
- Optimizing your cloud environment (video)
Test your analysis on a small sample and/or a small region of the genome
One of the greatest assets of the All of Us dataset is its large sample size. However, the larger the sample, the larger the compute cost and time. If the outcome of your analysis is not desired or produces an error along the way, unnecessary compute cost and time will result.
To avoid these unnecessary compute costs we recommend for researchers to:
Try to run the analysis against a smaller sample and/or a smaller region of the genome
Verify the outcome before utilizing more compute resources
Some cases where this is relevant is in notebooks designed to run to completion or those designed to run in the background. You can learn more about controlling cloud cost using sample use cases in this Terra Support article: Controlling Cloud costs - sample use cases.
Tips for using Hail
For a Hail analysis, use filter_intervals to limit your analysis to a small region of the genome:
test_intervals = ['chr1:100M-200M', 'chr16:29.1M-30.2M']
mt = hl.filter_intervals(
for x in test_intervals])
- With filter_rows, even if the final set of rows is a small number, Hail still reads every data partition.
- With filter_intervals, Hail is able to skip the data partitions that are not relevant which is much faster.
- We recommend to use filter_intervals as you can and use it before any filter_rows operation.
- See examples of how to specify genomic region intervals in the Hail documentation.
With Hail analysis, you can start with a small Spark cluster (e.g. one master and two workers) to test your analysis.
- After you have confirmed that your analysis is working correctly and are ready to run it on the full data, you can increase the number of workers or preemptibles.
- As long as you do not change cpu/ram/disk settings, those additional VMs will be added to your existing Spark cluster.
Combine analyses and utilize Workspace Buckets
As you progress in the Researcher Workbench and create your workspace, you may develop numerous Jupyter Notebooks to conduct your analysis. We recommend combining analyses located across multiple notebooks into one, as long as you are using a consistent coding language across all notebooks (R or Python). Having analyses located in fewer notebooks cuts down on navigation time and prevents your cloud analysis environment from stalling out due to inactivity.
By combining analyses within a single notebook, you can end up with multiple dataset queries in that notebook. Instead of running multiple queries every time you access the notebook, you can save each one to a Google Bucket for optimized access. A Workspace Bucket is a cloud storage space for personal file saving from your notebook. After running a query or a command, you can save these files to your Google Bucket which you can access at any time, from any other notebook in the same workspace, without re-running any commands. The support article, How do I access the workspace bucket and copy data to and from it? gives instructions on how to do this.
Run notebook in the background
Another way to optimize your time spent running analyses is to run them in the background. The benefit of running a notebook in the background is that you can work in other notebooks, workspaces, or step away from your computer, and not have to worry about your cloud analysis environment becoming inactive and stalling out. The virtual machine (VM) will run until the job is done. Once the job is finished, the output will be saved to the specified location. We have provided a tutorial workspace on how to perform this activity, which can be accessed here.
Acknowledge your current cloud analysis environments
There are two phases to a cloud analysis environment, active and paused. These phases interact differently between the apps available in the workbench (Jupyter Notebooks & Cromwell).
- An active environment should be utilized while directly working in a Jupyter Notebook performing analysis.
- A paused environment should be utilized when you are not directly working in a particular Jupyter Notebook, but would like to keep it running in order to return to it after a short period of time. A paused environment uses less computing power and therefore accrues less cost.
In the Cromwell Application, environments can only be active and are not subject to automatic environment deletion.
- It is also important to end (delete) your environment when you are finished working in a notebook to stop using compute power entirely and avoid unnecessary costs.