Note: While we still support using Cromwell as described below, if you're interested in using Cromwell we suggest you utilize Cromwell environments available within the Researcher Workbench as described in this article.
This document describes workflow support in All Of Us Researcher Workbench. This includes guidance on when to use workflows and what workflows are, and to aid users with Controlled Tier access who are processing large volumes of data. This document assumes basic familiarity with Google Cloud, including Cloud Storage and Docker.
This document focuses on workflows for genomic data, but workflows are applicable to processing any data that is accessible in the Researcher Workbench (RW).
Note: we suggest using us-central region when launching Google Lifescience API batch jobs because our CDR bucket and your buckets live in us-central1. If you launch API jobs in other regions, you will incur network egress charges
Introduction
The Researcher Workbench supports two workflow engines, Nextflow (version 21.03.0-edge and above) and Cromwell/WDL (version 76 and above). Both of these workflow engines require knowledge of Google Cloud Storage, virtual machines (VMs), Google cloud cost model for both compute and storage, command-line interface (CLI) in the RW, and Docker to operate effectively.
Within the Researcher Workbench, the Google Cloud Life Sciences API is the executor of Cromwell and Nextflow workflows.
For an overview of batch processing in the Researcher Workbench (as well as limitations relevant to all batch processing), refer to Overview of Batch Processing
What are workflows?
Workflows consist of multiple processing steps that are performed by an external compute engine. A workflow typically includes defining analysis tasks, chaining them together, and parallelizing their execution. In the workbench, each task is run in a docker container, on its own virtual machine (VM), which is started when the task starts and shuts down when the task completes. Files are read from, and written to cloud buckets. Before a task starts, input files are copied to the VM (“localization”) and when a task completes running, output files are copied to a destination bucket (“delocalization”). Each task can have separate runtime characteristics, which describe the VM specs (eg, RAM, num CPUs) and the Docker image.
Important features of workflows
Workflow engines provide automation for manual tasks that do not scale. For example, the workflow engines detailed in this document do not require users to manually copy files from cloud buckets, as required by many analysis software packages (eg, PLINK, vcftools).
Feature | Cromwell? | Nextflow? | Why is this important? | Notes |
Localization of cloud files | Yes | Yes | Users do not have to manage the file copying when running tasks. For example, the file inputs for a task can be cloud URLs, since the workflow engine will automatically copy the file locally to the tool. | Cromwell also supports optional localization when the tool itself can read files directly from a cloud location (eg, Genome Analysis ToolKit (GATK)) |
Automated parallelization | Yes | Yes | The workflow engine will automatically figure out which tasks can be run in parallel. Users need only map the inputs and outputs. | |
Metadata | Yes | Yes | Users can track the status of each task in a workflow, even after it completes. | |
Output writes to a bucket | Yes | Yes | Output of the workflows is saved when the cloud environment is deleted. | This cannot be disabled. |
Optional workflow and task inputs | Yes | Yes | Allows for default values, including files. | |
Separate output bucket for successfully completed workflows and failed workflows | Yes | No | Keep outputs from failing workflows separate. This will allow easier cleanup. | Cromwell can automatically separate failed outputs to a separate directory than successful ones. Nextflow will output to buckets for successfully completed workflows, but cannot separate failed and successful workflows. |
Re-entrancy | Yes* | Yes | The workflow engine will not rerun successful tasks when a downstream task failed or was changed. | Currently, call-caching is disabled for Cromwell in the Researcher Workbench (a file based DB is required and this is not yet enabled). Checkpointing is enabled for Cromwell. Call-caching and checkpointing can be enabled for Nextflow. |
Picking a workflow engine
If you decide to run a workflow within the Researcher Workbench, you have the choice of using Cromwell or Nextflow to execute that workflow. We recommend the following criteria to determine which workflow to use:
-
If you already have a pipeline that uses Cromwell or Nextflow, we recommend starting with that pipeline. We recommend searching for existing pipelines that do what you want or is close enough that you can modify it to fit your use case.
-
If there are two equivalent pipelines that exist for both Nextflow and Cromwell, we recommend that you choose based on comfort level with each and what others in your institution use.
Workflow options in the Researcher Workbench
With the initial launch of workflow support in the RW, two workflow engines are available via Jupyter Notebook or Terminal: Cromwell and Nextflow. Please note that these workflow engines are not compatible with each other.
Cromwell + WDL
Cromwell is a Workflow Management System geared towards scientific workflows. Documentation for Cromwell can be found at the Cromwell wiki. Cromwell executes scripts written in a language called Workflow Description Language (WDL), a community-driven domain specific language (DSL) designed for data-intensive workflows. WDL allows users to define tasks, including scripts written in bash, and specify connections between tasks.
Refer to the Cromwell Tutorial Notebook for more guidance on getting started with Cromwell. This notebook will walk through:
-
Setting up Cromwell within the RW
-
Using GATK to validate variant call format (VCF) files
For information on the structure of WDL, please see the WDL documentation website.
Nextflow
Nextflow is a workflow engine that uses a DSL (Groovy with workflow-specific extensions). Processes describe a task to be run, these can be written in any scripting language (supported by Linux) and include a task for each input set. Channels manipulate the flow of data from one process to the next. Workflows define the interaction between processes and channels. Documentation for Nextflow can be found at the Nextflow wiki.
Refer to the Nextflow Tutorial Notebook for more guidance on getting started with Nextflow. This notebook will walk through:
-
Setting up Nextflow within RW
-
Using GATK to validate variant call format (VCF) files
Workflow Limitations
For a full list of limitations applicable to all batch processing, refer to Overview of Batch Processing
Manual cleanup of runs
In the default configuration, both Cromwell and Nextflow will keep intermediate files in workflow runs. While this can be useful for debugging or as useful output, each workflow run will generate files, even if it fails. This will increase storage costs for data that may not be useful. We recommend periodically deleting the execution buckets of failed/obviated workflow runs once these are no longer useful.
Cromwell Specific Limitations
Cromwell within the Researcher Workbench does not support full Cromwell as a service functionality. Note that jobs are not tracked for your Cromwell workflow as would be expected when running a Cromwell specific server.
Nextflow Specific Limitations
Nextflow recommends only launching one Nextflow instance in a single directory at a time. More details are available here: Demystifying Nextflow resume
Suggestions for running workflows
Managing timeouts when running a workflow
By default, the RW pauses after 30 minutes of inactivity. You can change the auto pause setting in the Cloud Analysis Environment panel to extend that timeout period, which we recommend doing while running genomic workflows.
Use screen with Cromwell
You can use the screen command with Cromwell to allow Cromwell to run in the background. This will allow Cromwell to run if the notebook times out. Note that the deletion of a cloud environment will end the Cromwell process. Here are instructions via Terminal:
-
screen -S cromwell
-
start Cromwell
-
Detach: while holding Control, press A and then D
-
Do other things in terminal
-
screen -ls - shows running sessions
-
screen -r cromwell - reconnect to Cromwell
-
Control D - exit session when Cromwell completes
Use screen with Nextflow
You can use the screen command with Nextflow to allow Nextflow to run in the background. This will allow Nextflow to run if the notebook times out. Note that the deletion of a cloud environment will end the Nextflow process. Here are instructions via Terminal:
-
screen -S Nextflow
-
start Nextflow
-
Detach: while holding Control, press A and then D
-
Do other things in terminal
-
screen -ls - shows running sessions
-
screen -r Nextflow - reconnect to Nextflow
-
Control D - exit session when Nextflow completes
FAQ
Q: Do all tasks require a docker image?
Yes. If the docker image is left blank, the task will run using the default image. Please see FAQ #2 below
Q: If I do not specify a worker configuration for a task, what is the default?
- Cromwell: refer to https://cromwell.readthedocs.io/en/stable/RuntimeAttributes/
- Nextflow: refer to Google Cloud — Nextflow 21.10.0 documentation
Comments
0 comments
Article is closed for comments.