Introduction to batch processing in the All of Us Researcher Workbench
With the launch of genomic data in the All of Us Researcher Workbench, the amount and complexity of data now available provides the opportunity for larger-scale genomic research, both in breadth (types of analyses) and scale (eg, number of samples). The All of Us Researcher Workbench now includes pre-alpha support for running batch processes, allowing you to automate and parallelize repetitive steps in your analyses.
We believe researchers will have other downstream use cases, such as filtering, format conversions (eg, PLINK bed to GENESIS Genomic Data Storage), and/or custom QC. Since these are bulk processing steps that vary between researchers, we recommend researchers use batch processes to automate this work. This will help users avoid repeating manual steps and help apply the same processing steps to different sets of data. If you have a task or script you want to run multiple times or to process a lot of files (e.g. all of the VCF files in the WGS joint callset), you may find benefits of using a batch process in lieu of an analysis notebook.
Note: we suggest using us-central region when launching Google Lifescience API batch jobs because our CDR bucket and your buckets live in us-central1. If you launch API jobs in other regions, you will incur network egress charges.
Three tools have been added to the Researcher Workbench to support batch processing:
Which batch processing tool should you use?
- If you want to run a workflow, refer to our introduction to workflows with instructions for getting started with Cromwell and Nextflow via our featured workspaces:
- If you have a bash or python script you would like to parallelize, we recommend dsub: How to use dsub in the Researcher Workbench
Batch Processing Limitations
Internet Access Restriction
Within the Researcher Workbench, internet access is restricted from batch VMs. With the exception of Google APIs, VMs are unable to send or receive network traffic including files, APIs, or packages/code.
If you need to load additional data, tools, or scripts from your batch VMs, you can first load these to any accessible Cloud Storage location (including your workspace bucket). You can then download these files onto your batch VM using dsub’s data localization or by using gsutil from within your batch process script.
Only Google Container Registry is supported for docker images
Due to the Internet access restriction on Workbench batch VMs, standard docker repositories such as Docker Hub will not be accessible. It is instead recommended to configure all tasks to run public docker images from Google Container Registry (GCR). Typically, GCR docker URLs start with `us.gcr.io/` (eg, "us.gcr.io/broad-gatk/gatk:4.2.5.0" for the GATK 4.2.5.0 docker image in GCR) as opposed to a string (eg, “broadinstitute/gatk:4.2.0.0” for the GATK 4.2.5.0 in dockerhub).
When using third-party workflows in the Researcher Workbench, researchers must check that each task uses GCR images. In WDL, this can be found in the docker tag in the runtime block of each task. In Nextflow, this can be found in the process.container tag in the configuration file.
If you have a specific image that you want to use, and cannot find it in GCR, please reach out to our support team support@researchallofus.org.
Note: images in GCP Artifact Registry are not currently available.
Docker container images in GCR
Each of these docker images is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings. Below is a list of tools and the associated public docker images in GCR that will work with batch processes:
- All public GCR images: https://console.cloud.google.com/gcr/images/google-containers/GLOBAL
- All of Us public GCR images: https://console.cloud.google.com/gcr/images/broad-dsp-gcr-public/us/terra-jupyter-aou
- GATK: us.gcr.io/broad-gatk/gatk:{version} (eg, us.gcr.io/broad-gatk/gatk-4.2.5.0)
- The canonical image for GATK
- Tools available in Researcher Workbench (RW) jupyter notebooks: us.gcr.io/broad-dsp-gcr-public/terra-jupyter-aou:{version}
- You can find vcftools, PLINK, and other tools here. A complete list can be found in here: https://github.com/DataBiosphere/terra-docker/blob/master/terra-jupyter-aou/Dockerfile
- This image only functions in Cromwell and dsub. Nextflow cannot run tasks with this docker image. If you need to run tools from the RW analysis notebooks in Nextflow, we recommend identifying docker containers for each tool. We are investigating the possibility of making this image compatible with Nextflow.
Hail in workflows is unsupported
Hail can be used in limited capacity, if the Hail code can be run on a single node (ie, no dataproc cluster is needed). If you need to do analyses with more than a single node, we do not recommend using Hail in workflows at this time. Hail in the interactive notebook environment supports use of Dataproc clusters, but the support in AoU workflows is rudimentary.
Compute cost must be estimated manually
There is no instrumentation in the Researcher Workbench to track workflow cost. In order to determine the compute cost of a workflow, researchers will need to estimate the cost of the VMs and the time spent on each VM. Please note that storage cost is a separate cost and RW uses standard storage for workspace buckets.