Efficient genomic analysis requires strategic planning and adherence to best practices, especially when dealing with large datasets and complex computational tools like Hail. This article outlines general recommendations for all genomic analyses and provides specific guidance for using Hail in genomic data processing on the Workbench.
General Recommendations for Genomic Analyses
Start Small Before Scaling Up
Before committing a significant amount of expensive computational resources to a full-scale analysis, it’s crucial to validate your workflow on a smaller dataset. Testing on a limited subset helps identify potential issues early and ensures that your code functions correctly.
- Testing Sample: Subset your dataset to a small portion for testing, such as a small genomic interval or one of the smaller chromosomes, like chromosome 22. By extracting a small portion of your data to test your pipeline on, you're able to prove the functionality of your pipeline without spending the money required to activate many compute resources that are not ultimately used due to unexpected errors in your pipeline.
- Iterative Debugging: Run your code manually when testing, if possible, making sure to move through the pipeline step-by-step to locate any bottlenecks or errors, allowing for timely optimizations.
- Resource Management: By starting small, you can estimate the necessary computational resources for your full analysis, preventing over-allocation or under-utilization. To aid in this estimation, we recommend monitoring your compute environment while testing, and we will discuss the monitoring resources later.
Run Different Processes in Separate Notebooks
Oftentimes genomic analysis pipelines involve multiple complex steps of sample processing. Each of these steps can consume large amounts of computational resources. This means that if all steps are performed in the same notebook or by the same script then that notebook will take quite a long time to fully complete all of its operations, and it can be incredibly difficult to debug. Instead of performing all steps of the pipeline in the same notebook, perform each task in separate notebooks to enhance the clarity and time efficiency of each step in the process. By separating each step of the pipeline into their own notebooks or scripts, you allow for a more modular testing phase that is easier to monitor and maintain over time.
- Phenotypic Data Preparation: It is best practice to use a dedicated notebook for processing each step in your analysis pipelines. For example, when processing phenotypic data—such as Electronic Health Record data (EHR)—the phenotypic processing steps should be kept separate from any genotypic data processing steps.
- VAT Processing: The VAT is incredibly large, so it is best practice to handle VAT queries and manipulations in a separate notebook. This modularity of using multiple notebooks helps in isolating issues and optimizing each part of the workflow individually.
- Keep Notebooks Small: Complex genomic analysis not only processes large amounts of data, but the code by which such data are processed can be quite large too! On the Workbench there are many security measures in place to preserve the privacy of the Program’s participants. Sometimes these security measures can interfere with normal Workbench operations, such as opening a notebook. This means that if a notebook gets large enough, it has the potential to trigger our security system and possibly lead to the temporary suspension of your account in an event known as an ‘egress alert’. For this reason, it is important to not only split up the steps of your analysis among different notebooks, but also to make sure that you clear any cells with large outputs (1000+ rows) before saving and halting the notebook and environment.
Monitor Analyses in Progress
If you are running a time-intensive analysis that can take days to complete, or if you think the run time for a certain command is taking a lot longer than it should be, then we suggest monitoring your Cloud Analysis Environment as the process runs to inform your computer resource use. By monitoring your environment as it performs computations, you’re able to see in real-time whether you have enough resources or if increasing your resources or the size of your cluster could help expedite the command’s completion. Please reference the 05_Monitor_Cloud_Analysis_Environment.ipynb notebook and this Google Cloud Project article on monitoring CPU utilization using the GCP Metrics Explorer, and here is a tutorial video for monitoring environments on the Workbench. To use the 05 Monitoring notebook, copy it to your Workspace so you can run it in edit mode to generate reusable links to the Metrics Explorer graphs for your Workspace.
The GCP Metrics Explorer page will help you track both CPU and memory usage (and many other metrics) so you can optimize your environment for time and cost.
- Resource Utilization Tracking: Use monitoring tools provided through the GCP Metrics Explorer to keep an eye on CPU, memory, and disk usage. Note: available metrics will change based on your environment’s configuration and whether the environment is active.
- Auto-Pause Considerations: Be aware the cluster will auto-pause after 24 hours. To prevent your cluster from shutting down if your background job takes longer than 24 hours, log in and start any notebook in the workspace to reset the auto-pause timer.
Optimize Computational Resources
Scaling up your resources is often necessary to make hail genomic analyses more time efficient. While we recommend starting small for testing as discussed in the first section, some full analyses are unavoidably resource-intensive and will require large dataproc clusters to complete. When scaling your dataproc clusters and running longer analyses, please keep in mind the following:
- More expensive environments can make large scale analyses cheaper: Sometimes it is cheaper to run a job on a larger cluster (more $/hr) and have it complete faster than it would be to let the job run in the background on a smaller cluster over a longer timeframe.
- Preemptible Recommendations: Google provides some recommendations for best results with preemptibles. The number of preemptible or secondary workers in your cluster should be less than 50% of the total number of all workers (primary plus all secondary workers) in your cluster. So if you have 100 Workers in a cluster you should be using no more than 50 preemptibles.
- Random Resource Reallocation: Google can reallocate preemptible Workers at any time, so if you are running an analysis or a command that takes multiple days to complete on a large dataproc cluster then your jobs are at a higher risk of failure if Google randomly reallocates your preemptibles. Additionally, because of this possibility of random reallocation, analyses can randomly fail despite functioning correctly without errors, and when this happens the job simply needs to be re-submitted or re-ran.
Manage Long-Running Processes
Since many commands, like hail commands, can take a while to execute, we have a number of recommendations for running the process and trouble-shooting longer execution times. For managing the large data burden of the WGS data and the lengthy analysis times, you have the ability to run notebooks in the background and monitor their progress. The 05_How to Run Notebooks in the Background.ipynb notebook will allow you to run any notebooks for your analyses in the background. This notebook is great for running other notebooks that take longer than 30 minutes to run so you can use a reasonably priced environment and still complete your analysis. To use the 05 background notebook, copy it to your Workspace so you can edit and use it for your analysis.
Please note: dataproc clusters on the Workbench will auto-pause after 24 hours, so please be sure to log in at least once a day and click around in an open notebook and in the title area of the notebook at the top to reset the auto-pause timer. While the background harness notebook should output a timestamped notebook when complete, whether successful or in error, it's possible your Workspace environment will not shut down despite the cessation of activity in the background. This is also why we recommend checking on your background run and monitoring your compute metrics every day, if not as often as possible.
Summary
Implementing these best practices in your genomic analyses can lead to more efficient, reliable, and cost-effective workflows. Starting small, modularizing your processes, and monitoring your analyses are foundational steps applicable to all genomic work. When working with Hail, understanding its unique features like lazy evaluation and optimizing your code accordingly can significantly enhance performance.
By thoughtfully managing computational resources, separating concerns in your analysis pipeline, and utilizing tools effectively, you can navigate the complexities of genomic data analysis with greater confidence and success.
For additional information, please review these resources:
- Hail Documentation
- Hail Cheatsheets
- Dataproc Secondary Worker (Preemptible) Recommendations
- Google Cloud Monitoring with the Metrics Explorer
- Batch Processing with dsub
Feel free to reach out to support team at support@researchallofus.org for specific questions regarding your analysis workflow.
Comments
0 comments
Please sign in to leave a comment.