Introduction to batch processing in the All of Us Researcher Workbench
With the launch of genomic data in the All of Us Researcher Workbench, the amount and complexity of data now available provides the opportunity for larger-scale genomic research, both in breadth (types of analyses) and scale (eg, number of samples). The All of Us Researcher Workbench now includes pre-alpha support for running batch processes, allowing you to automate and parallelize repetitive steps in your analyses.
We believe researchers will have other downstream use cases, such as filtering, format conversions (eg, PLINK bed to GENESIS Genomic Data Storage), and/or custom QC. Since these are bulk processing steps that vary between researchers, we recommend researchers use batch processes to automate this work. This will help users avoid repeating manual steps and help apply the same processing steps to different sets of data. If you have a task or script you want to run multiple times or to process a lot of files (e.g. all of the VCF files in the WGS joint callset), you may find benefits of using a batch process in lieu of an analysis notebook. Note: Hail is not supported by any workflow tools in the Researcher Workbench.
Note: we suggest using us-central region when launching Google Lifescience API batch jobs because our CDR bucket and your buckets live in us-central1. If you launch API jobs in other regions, you will incur network egress charges.
Three tools have been added to the Researcher Workbench to support batch processing:
Which batch processing tool should you use?
- If you want to run a workflow, refer to our introduction to workflows with instructions for getting started with Cromwell and Nextflow via our featured workspaces:
- If you have a bash or python script you would like to parallelize, we recommend dsub: How to use dsub in the Researcher Workbench
Batch Processing Limitations
Internet Access Restriction
Within the Researcher Workbench, internet access is restricted from batch VMs. With the exception of Google APIs, VMs are unable to send or receive network traffic including files, APIs, or packages/code.
If you need to load additional data, tools, or scripts from your batch VMs, you can first load these to any accessible Cloud Storage location (including your workspace bucket). You can then download these files onto your batch VM using dsub’s data localization or by using gsutil from within your batch process script.
Docker Hub
To use a Docker image from DockerHub, you will use the Google Artifact Registry remote repository feature to pull the image into the Researcher Workbench. To learn more about this process, please see this support article, Using Docker Images on the Workbench.
Docker container images in GCR
Each of these docker images is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings. Below is a list of tools and the associated public docker images in GCR that will work with batch processes:
- All public GCR images: https://console.cloud.google.com/gcr/images/google-containers/GLOBAL
- All of Us public GCR images: https://console.cloud.google.com/gcr/images/broad-dsp-gcr-public/us/terra-jupyter-aou
-
GATK: us.gcr.io/broad-gatk/gatk:{version} (eg, us.gcr.io/broad-gatk/gatk-4.2.5.0)
- The canonical image for GATK
-
Tools available in Researcher Workbench (RW) jupyter notebooks: us.gcr.io/broad-dsp-gcr-public/terra-jupyter-aou:{version}
- You can find vcftools, PLINK, and other tools here. A complete list can be found in here: https://github.com/DataBiosphere/terra-docker/blob/master/terra-jupyter-aou/Dockerfile
- This image only functions in Cromwell and dsub. Nextflow cannot run tasks with this docker image. If you need to run tools from the RW analysis notebooks in Nextflow, we recommend identifying docker containers for each tool. We are investigating the possibility of making this image compatible with Nextflow.
Compute cost must be estimated manually
There is no instrumentation in the Researcher Workbench to track workflow cost. In order to determine the compute cost of a workflow, researchers will need to estimate the cost of the VMs and the time spent on each VM. Please note that storage cost is a separate cost and RW uses standard storage for workspace buckets.
Comments
0 comments
Article is closed for comments.