Use dsub in the All of Us Researcher Workbench

  • Updated

This document describes dsub support in All Of Us Researcher Workbench, and is intended for users with Controlled Tier access who are processing large volumes of data.  This document assumes basic familiarity with the Google cloud, including buckets and docker.  

This document focuses on using dsub for genomic data, but it is applicable to processing any data that is accessible in Researcher Workbench.

Note: we suggest using us-central region when launching Google Lifescience API batch jobs because our CDR bucket and your buckets live in us-central1. If you launch API jobs in other regions, you will incur network egress charges.

Introduction 

dsub is a command-line tool that makes it easy to submit and run batch scripts in the cloud. With dsub, you can write a shell script and then submit it to a job scheduler from Jupyter. Unlike Cromwell WDL and Nextflow, dsub does not use a DSL. dsub supports Google Cloud as the backend batch job runner. Refer to the dsub documentation for additional guidance on writing dsub commands and more example dsub scripts. 

Refer to the dsub Tutorial Notebook for more guidance on getting started with dsub. This Notebook will walk through:

  • Set up dsub within the Researcher Workbench.
  • Best practices with dsub and how to debug dsub workflows.
  • Extract sample IDs from a VCF file (from Alpha 3) with dsub, using a bash script from the DataBiosphere repository.
  • Access PLINK files (from Alpha 3) in parallel and count the number of lines in each.

Within the Researcher Workbench, the Google Cloud Life Sciences API is the executor of dsub tasks. 

If you have any feedback or questions on using dsub, reach out to support@researchallofus.org 

Suggestion while running dsub tasks

With dsub, you can check the status of a job at any time by running the dstat command. This will work both in a Notebook and in a Terminal session. 

See https://github.com/DataBiosphere/dsub/blob/main/docs/troubleshooting.md for more examples.

Potential Limitations

Public docker images from Google Container Registry are the only types of images officially supported for tasks, but dockerhub images may also work in Researcher Workbench workflows. Typically, GCR docker URLs start with `us.gcr.io/` (eg, "us.gcr.io/broad-gatk/gatk:4.2.5.0" for the GATK 4.2.5.0 docker image in GCR) as opposed to a string (eg, “broadinstitute/gatk:4.2.0.0” for the GATK 4.2.5.0 in dockerhub).  

If you have a specific image that you want to use, and cannot find it in GCR, please reach out to our support team support@researchallofus.org.  

For a full list of limitations applicable to all batch processing, refer to Overview of Batch Processing.

Docker container images in GCR

Each of these docker images are a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings.  Below is a list of common tools and the associated public docker images in GCR that will work with dsub:  

Was this article helpful?

1 out of 7 found this helpful

Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.