Getting Started with Genomic Analyses on the Researcher Workbench

Efficient genomic analysis requires strategic planning and adherence to best practices, especially when dealing with large datasets and complex computational tools like Hail. This article outlines general recommendations for all genomic analyses and provides specific guidance for using Hail in genomic data processing on the Workbench.

General Recommendations for Genomic Analyses

Start small before scaling up

Before committing a significant amount of expensive computational resources to a full-scale analysis, it’s crucial to validate your workflow on a smaller dataset. Testing on a limited subset helps identify potential issues early and ensures that your code functions correctly.

Testing Sample: Subset your dataset to a small portion for testing, such as a small genomic interval or one of the smaller chromosomes, like chromosome 22. By extracting a small portion of your data to test your pipeline on, you're able to prove the functionality of your pipeline without spending the money required to activate many compute resources that are not ultimately used due to unexpected errors in your pipeline.
Iterative Debugging: Run your code manually when testing, if possible, making sure to move through the pipeline step-by-step to locate any bottlenecks or errors, allowing for timely optimizations.
Resource Management: By starting small, you can estimate the necessary computational resources for your full analysis, preventing over-allocation or under-utilization. To aid in this estimation, we recommend monitoring your compute environment while testing, and we will discuss the monitoring resources later.

Run Different Processes in Separate Notebooks

Oftentimes genomic analysis pipelines involve multiple complex steps of sample processing. Each of these steps can consume large amounts of computational resources. This means that if all steps are performed in the same notebook or by the same script then that notebook will take quite a long time to fully complete all of its operations, and it can be incredibly difficult to debug. Instead of performing all steps of the pipeline in the same notebook, perform each task in separate notebooks to enhance the clarity and time efficiency of each step in the process. By separating each step of the pipeline into their own notebooks or scripts, you allow for a more modular testing phase that is easier to monitor and maintain over time.

Phenotypic Data Preparation: It is best practice to use a dedicated notebook for processing each step in your analysis pipelines. For example, when processing phenotypic data—such as Electronic Health Record data (EHR)—the phenotypic processing steps should be kept separate from any genotypic data processing steps.
VAT Processing: The VAT is incredibly large, so it is best practice to handle VAT queries and manipulations in a separate notebook. This modularity of using multiple notebooks helps in isolating issues and optimizing each part of the workflow individually.
Keep Notebooks Small: Complex genomic analysis not only processes large amounts of data, but the code by which such data are processed can be quite large too! On the Workbench there are many security measures in place to preserve the privacy of the Program’s participants. Sometimes these security measures can interfere with normal Workbench operations, such as opening a notebook. This means that if a notebook gets large enough, it has the potential to trigger our security system and possibly lead to the temporary suspension of your account in an event known as an ‘egress alert’. For this reason, it is important to not only split up the steps of your analysis among different notebooks, but also to make sure that you clear any cells with large outputs (1000+ rows) before saving and halting the notebook and environment.
Considerations with dense formats: At this time we do not recommend converting an entire callset (srWGS, etc) into a dense format such as PLINK bed, VCF, etc. This is computationally expensive, with minimal return of value. Instead, we recommend you consider using a smaller callset, as noted in Smaller Callsets for Analyzing Short Read WGS SNP & Indel Data with Hail MT, VCF, and PLINK, or filter to regions or samples of interest, then convert to the format of your choice.

Monitor Analyses in Progress

If you are running a time-intensive analysis that can take days to complete, or if you think the run time for a certain command is taking a lot longer than it should be, then we suggest monitoring your Cloud Analysis Environment as the process runs to inform your computer resource use. By monitoring your environment as it performs computations, you’re able to see in real-time whether you have enough resources or if increasing your resources or the size of your cluster could help expedite the command’s completion. This Google Cloud Project article provides information on monitoring CPU utilization using the GCP Metrics Explorer.

The GCP Metrics Explorer page will help you track both CPU and memory usage (and many other metrics) so you can optimize your environment for time and cost.

Resource Utilization Tracking: Use monitoring tools provided through the GCP Metrics Explorer to keep an eye on CPU, memory, and disk usage. Note: available metrics will change based on your environment’s configuration and whether the environment is active.
Auto-Pause Considerations: Be aware the cluster will auto-pause after 24 hours. To prevent your cluster from shutting down if your background job takes longer than 24 hours, log in and start any notebook in the workspace to reset the auto-pause timer. We highly recommend you monitor your long running jobs closely for any interruptions.

Optimize Computational Resources

Scaling up your resources is often necessary to make hail genomic analyses more time efficient. While we recommend starting small for testing as discussed in the first section, some full analyses are unavoidably resource-intensive and will require large dataproc clusters to complete. When scaling your dataproc clusters and running longer analyses, please keep in mind the following:

More expensive environments can make large scale analyses cheaper: Sometimes it is cheaper to run a job on a larger cluster (more $/hr) and have it complete faster than it would be to let the job run in the background on a smaller cluster over a longer timeframe.
Preemptible Recommendations: Google provides some recommendations for best results with preemptibles. The number of preemptible or secondary workers in your cluster should be less than 50% of the total number of all workers (primary plus all secondary workers) in your cluster. So if you have 100 Workers in a cluster you should be using no more than 50 preemptibles.
Random Resource Reallocation: Google can reallocate preemptible Workers at any time, so if you are running an analysis or a command that takes multiple days to complete on a large dataproc cluster then your jobs are at a higher risk of failure if Google randomly reallocates your preemptibles. Additionally, because of this possibility of random reallocation, analyses can randomly fail despite functioning correctly without errors, and when this happens the job simply needs to be re-submitted or re-ran.

Manage long running processes

Since many commands, like hail commands, can take a while to execute, we have a number of recommendations for running the process and trouble-shooting longer execution times. For managing the large data burden of the WGS data and the lengthy analysis times, you have the ability to run notebooks in the background and monitor their progress. The 06_Run_Notebooks_in_the_Background notebook will allow you to run any notebooks for your analyses in the background. This notebook is great for running other notebooks that take longer than 30 minutes to run so you can use a reasonably priced environment and still complete your analysis.

Please note: clusters on the Workbench will auto-pause after 24 hours, so please be sure to log in at least once a day and click around in an open notebook and in the title area of the notebook at the top to reset the auto-pause timer. While the background harness notebook should output a timestamped notebook when complete, whether successful or in error, it's possible your Workspace environment will not shut down despite the cessation of activity in the background. This is also why we recommend checking on your background run and monitoring your compute metrics every day, if not as often as possible.

Workflow Tools

Some larger, more complex genomics analysis may require the use of workflow tools such as Cromwell, Nextflow, or even dsub. These tools can be a great way to optimize your analysis pipeline. The use of these tools may incur additional cost on top of the environment resources you use. Additionally, please note that these tools are third-party software. While they are compatible with our platform, All of Us does not develop, maintain, or own them, and some features may not be available within the All of Us Workbench. To learn more, see these external resources, along with the Featured Workspace All of Us Tutorial Workspace: Getting Started with Controlled Tier Data (v8)

Summary

Implementing these best practices in your genomic analyses can lead to more efficient, reliable, and cost-effective workflows. Starting small, modularizing your processes, and monitoring your analyses are foundational steps applicable to all genomic work. When working with Hail, understanding its unique features like lazy evaluation and optimizing your code accordingly can significantly enhance performance.

By thoughtfully managing computational resources, separating concerns in your analysis pipeline, and utilizing tools effectively, you can navigate the complexities of genomic data analysis with greater confidence and success.

For additional information, please review these resources:

Feel free to reach out to support team at support@researchallofus.org for specific questions regarding your analysis workflow.

Getting Started with Genomic Analyses on the Researcher Workbench

General Recommendations for Genomic Analyses

Start small before scaling up

Run Different Processes in Separate Notebooks

Monitor Analyses in Progress

Optimize Computational Resources

Manage long running processes

Workflow Tools

Summary

Was this article helpful?

Comments

<%= previousTitle %>

<%= nextTitle %>

<%= block.name %>

<%= block.name %>

Have a question or would like to make a request?

Categories

Toggle navigation menu

<%= category.name %>

Search

General Recommendations for Genomic Analyses

Start small before scaling up

Run Different Processes in Separate Notebooks

Monitor Analyses in Progress

Optimize Computational Resources

Manage long running processes

Workflow Tools

Summary

Was this article helpful?

<%= previousTitle %>

<%= nextTitle %>

<%= block.name %>

<%= block.name %>

Have a question or would like to make a request?

Categories

Toggle navigation menu

<%= category.name %>

Categories

Categories