The All of Us Researcher Workbench (RW) added Cromwell for workflow management. Using a workflow manager can be helpful when running complex workflows; the workflow manager will submit and manage jobs for you, coordinating the interdependencies between tasks. Cromwell runs workflows written in the Workflow Description Language (WDL) and was specifically designed for scientific workflows, though it can run any type of workflow.
In the RW, you can submit workflows to Cromwell through Cromshell, a command line tool for interacting with Cromwell. Cromshell can be run in a Jupyter environment, either through a Jupyter notebook or a Jupyter terminal.
Using Cromwell will make it easier to manage complex workflows in the RW, especially when dealing with large-scale data such as the genomics data available with All of Us Research Program. See this video to learn more about Cromwell in the Researcher Workbench.
Note: we suggest using us-central region when launching Google Lifescience API batch jobs because our CDR bucket and your buckets live in us-central1. If you launch API jobs in other regions, you will incur network egress charges.
Create the Cromwell and Jupyter Environments
Cromwell is run within a workspace and is an application in the workspace. To get started, go into a workspace or create a new workspace. Once you are within a workspace, select the Cromwell icon (a pink pig) on the right hand menu. The Cromwell Cloud Environment information screen will open and at the bottom right of the screen, you will select ‘Start’.
Once your Cromwell Cloud Environment is created, you will access Cromshell through a Jupyter terminal or notebook. While your Cromwell Cloud Environment is creating, you will create a Jupyter environment in the same workspace. In the same right hand menu, select the Jupyter icon. The default Jupyter configuration will be fine for this step and you can select ‘Start’.
You can check the status of the Cromwell and Jupyter environments by selecting the cloud icon (a cloud and lighting bolt) on the same right hand menu. This brings up a summary of your active applications.
Run snippets to link the Cromwell and Jupyter environments
Once you have created the Cromwell environment and Jupyter environment, you can create a new Jupyter notebook where you can submit your workflow with Cromwell. In the top menu, select ‘Snippets’ and then select the ‘All of Us Cromwell Setup Python snippets’. This snippet will set up the network connection between the Jupyter and Cromwell environments. In addition, it will run a status check to verify that Cromwell and Jupyter are set up correctly.
Run this snippet in the Jupyter notebook. Once it has been run, you are ready to run workflows managed by Cromwell. You can submit workflows and interact with Cromwell through the command line tool Cromshell in a Jupyter terminal or notebook. You can use Jupyter in the terminal by launching a new terminal from the right hand menu.
If you see any problems with your submissions, you should try re-running the ‘All of Us Cromwell Setup Python snippets’. If the snippet fails, you can try fixing the problem by restarting or recreating (delete and create) Cromwell.
You can submit a workflow to Cromwell using the following Cromshell command: cromshell-alpha submit workflow.wdl parameters.json. Cromshell submits the workflow written in WDL to Cromwell along with the configuration options in the JSON file.
After you submit a workflow, you can find the submission ID at the bottom of the output:
You can check the status of the workflow with the following Cromshell command: cromshell-alpha status <submissionID>.
You can abort a workflow with following command: cromshell-alpha abort <submissionID>
Refer to https://github.com/broadinstitute/cromshell for additional commands. Please note that you need to use cromshell-alpha as the command to call Cromshell in a command line on the RW.
WDL File Configuration
When configuring your WDL file, you can use following docker image:
Saving a WDL or JSON file
There are multiple options to save a WDL and a JSON to your notebook:
1. In a Jupyter notebook, use the %%writefile <filename> command followed by the WDL or JSON file
2. From a Jupyter notebook, select the Jupyter icon in the upper left corner to access the file browser. You can use the Jupyter file browser to upload existing JSON or WDL files. The root folder here is available at /home/jupyter.
3. In a Jupyter terminal, use vim <filename> to create (and edit) a file
We also recommend saving your WDL and JSON files to your workspace bucket. This allows you to easily access the files to run your workflow and to use the files again even after you delete your cloud environment. You can save files to your bucket with the following command:
gsutil cp <filename> $WORKSPACE_BUCKET
The workspace bucket is attached to your workspace. You can share the workspace bucket with your colleagues by sharing the workspace. See this article on workspace buckets for more details about workspace storage.
We recommend pausing and deleting your Cromwell environment when you are not actively running workflows in order to reduce cost. Auto-pause and auto-delete are currently not supported for Cromwell environments so you must actively control the status of your Cromwell environments.
After completing your analysis, you can pause or delete your Cromwell environment from the right hand menu. Select the cloud icon to see a summary of your active applications. From there, you can pause or delete the Cromwell environment.
When you delete a Cromwell environment, you lose workflow metadata from that environment. Any of your results or data can be saved in your workspace bucket and will not get deleted when you delete the Cromwell environment.
Cromwell incurs a per-workspace cost when both running and paused of $0.20/hour. Each Cromwell instance also incurs a per-app cost when running of $0.20/hour.
- When running: $0.20/hour + ($0.20/hour x number of Cromwell apps running).
- When paused: $0.20/hour
If you have one Cromwell application in your workspace: Cromwell costs $296/month when running and $148/month when paused. Each additional Cromwell application will cost $148/month when running or paused.
Note: these costs do not include your persistent disk.
We recommend always pausing your Cromwell environment when you are not running workflows in order to avoid the cost to run Cromwell. As a reminder, auto-pause and auto-delete are not supported.
Known issues and limitations for users
Two users cannot start applications at the same time in a workspace
In workspaces with more than one active user, two users cannot attempt to start an application (Cromwell, Jupyter, etc) at the same time (within a few minutes of each other). If this happens, the application will not start for one of the users.
- Mitigation: The solution is to wait a few minutes and try again.
- Remediation: This will be addressed with a clearer error message before shipping to prod.
Changing the combination of Jupyter and Cromwell environments in a workspace
The ‘All of Us Cromwell Setup Python snippets’ must be run anytime there is a change in the combination of Jupyter and Cromwell environments in order to correctly link the environments. We recommend re-running the snippets whenever there is an unexpected error.
Only Google Container Registry is supported for docker images
Due to the Internet access restriction on Workbench batch VMs, standard docker repositories such as Docker Hub will not be accessible to WDLs. It is instead recommended to configure all tasks in your WDLs to run public docker images from Google Container Registry (GCR). Typically, GCR docker URLs start with us.gcr.io/. As an example, the GATK 188.8.131.52 docker image in GCR is us.gcr.io/broad-gatk/gatk:184.108.40.206 as opposed to broadinstitute/gatk:220.127.116.11 in dockerhub. You can learn more about this limitation in the Overview of Batch Processing on the All of Us User Support Hub.