Overview of Batch Processing

  • Updated

Introduction to batch processing in the All of Us Researcher Workbench

With the launch of genomic data in the All of Us Researcher Workbench, the amount and complexity of data now available provides the opportunity for larger-scale genomic research, both in breadth (types of analyses) and scale (eg, number of samples). The All of Us Researcher Workbench now includes pre-alpha support for running batch processes, allowing you to automate and parallelize repetitive steps in your analyses. 

We believe researchers will have other downstream use cases, such as filtering, format conversions (eg, PLINK bed to GENESIS Genomic Data Storage), and/or custom QC.  Since these are bulk processing steps that vary between researchers, we recommend researchers use batch processes to automate this work. This will help users avoid repeating manual steps and help apply the same processing steps to different sets of data. If you have a task or script you want to run multiple times or to process a lot of files (e.g. all of the VCF files in the WGS joint callset), you may find benefits of using a batch process in lieu of an analysis notebook.

Note: we suggest using us-central region when launching Google Lifescience API batch jobs because our CDR bucket and your buckets live in us-central1. If you launch API jobs in other regions, you will incur network egress charges.

Three tools have been added to the Researcher Workbench to support batch processing: 

  1. Cromwell + WDL
  2. Nextflow
  3. dsub

Which batch processing tool should you use?

Batch Processing Limitations

Internet Access Restriction

Within the Researcher Workbench, internet access is restricted from batch VMs. With the exception of Google APIs, VMs are unable to send or receive network traffic including files, APIs, or packages/code.

If you need to load additional data, tools, or scripts from your batch VMs, you can first load these to any accessible Cloud Storage location (including your workspace bucket). You can then download these files onto your batch VM using dsub’s data localization or by using gsutil from within your batch process script.

 

Only Google Container Registry is supported for docker images 

Due to the Internet access restriction on Workbench batch VMs, standard docker repositories such as Docker Hub will not be accessible. It is instead recommended to configure all tasks to run public docker images from Google Container Registry (GCR).  Typically, GCR docker URLs start with `us.gcr.io/` (eg, "us.gcr.io/broad-gatk/gatk:4.2.5.0" for the GATK 4.2.5.0 docker image in GCR) as opposed to a string (eg, “broadinstitute/gatk:4.2.0.0” for the GATK 4.2.5.0 in dockerhub).  

When using third-party workflows in the Researcher Workbench, researchers must check that each task uses GCR images.  In WDL, this can be found in the docker tag in the runtime block of each task.  In Nextflow, this can be found in the process.container tag in the configuration file.

If you have a specific image that you want to use, and cannot find it in GCR, please reach out to our support team support@researchallofus.org.  

Note: images in GCP Artifact Registry are not currently available.

 

Docker container images in GCR

Each of these docker images is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings.  Below is a list of tools and the associated public docker images in GCR that will work with batch processes:  

 

Hail in workflows is unsupported

Hail can be used in limited capacity, if the Hail code can be run on a single node (ie, no dataproc cluster is needed). If you need to do analyses with more than a single node, we do not recommend using Hail in workflows at this time. Hail in the interactive notebook environment supports use of Dataproc clusters, but the support in AoU workflows is rudimentary.

 

Compute cost must be estimated manually

There is no instrumentation in the Researcher Workbench to track workflow cost. In order to determine the compute cost of a workflow, researchers will need to estimate the cost of the VMs and the time spent on each VM. Please note that storage cost is a separate cost and RW uses standard storage for workspace buckets.

Was this article helpful?

1 out of 4 found this helpful

Have more questions? Submit a request