In this article, we will explain how to use publicly available docker images from DockerHub on the Researcher Workbench. Using the Google Artifact Registry remote repository feature, the process of using images from DockerHub enables quick customization of the software and tools you use for your analyses.
Introduction
Docker images are packages of customized software, libraries, and environment variables that can be loaded and executed into your machine in order to set up your compute environment and dependencies. Using docker images can save time and effort by eliminating the need to recreate common software environments from scratch.
One common use case for Docker images is to set up software and a compute environment that is necessary for your analysis but that is not available in the base environment on the Workbench.
There are multiple repositories where you can find Docker images. In this article, we are focusing on publicly available images on DockerHub. DockerHub is a commonly used repository for Docker images, where you can access or store images. To learn more about creating your own Docker image, see the Docker documentation.
Another common use case for Docker images on the Researcher Workbench is within batch processing, including Cromwell and Nextflow. You can learn more about how to use Docker images within your batch processing analyses in these resources:
- WDL resources on the Terra Support site
- WDL resources on the Dockstore site
- dsub resources on the User Support Hub
Using the All of Us artifact registry repository to pull Docker images
To use a Docker image from DockerHub, you will use the Google Artifact Registry remote repository feature to pull the image into the Researcher Workbench.
To use this feature, you will use the environment variable ARTIFACT_REGISTRY_DOCKER_REPO. This variable corresponds to us-central1-docker.pkg.dev/all-of-us-rw-prod/aou-rw-gar-remote-repo-docker-prod. When you are setting up the Docker image, you will refer to the location by appending the artifact registry environment variable to the location of your Docker image.
For example, if you want to use the latest ubuntu image from DockerHub, the base location is ubuntu:latest. You will append the ubuntu latest location to the artifact registry variable. The way that you append these locations depends on the analysis tool you are using. Generally, you will append the DockerHub image location following a backslash to the ARTIFACT_REGISTRY_DOCKER_REPO variable: os.environ["ARTIFACT_REGISTRY_DOCKER_REPO"]/ubuntu:latest.
The following examples demonstrate how you would use this in different analysis tools.
Setting up a Docker image within a WDL
In an example setting up a Docker image within a Workflow Description Language file (or WDL for short), we set up a variable within the docker runtime variable building the ARTIFACT_REGISTRY_DOCKER_REPO and append /ubuntu:latest.
This WDL script can be used in batch analyses using tools like Cromwell.
wdl_filename = "hello.wdl"
WDL_content = """
task hello {
String addressee
command {
echo "Hello ${addressee}!"
}
output {
String salutation = read_string(stdout())
}
runtime {
docker: '""" + os.environ["ARTIFACT_REGISTRY_DOCKER_REPO"] + """/ubuntu:latest'
}
}
workflow wf_hello {
call hello
output {
hello.salutation
}
}
"""
fp = open(wdl_filename, 'w')
fp.write(WDL_content)
fp.close()
Setting up a Docker image using Nextflow
When running a Nextflow batch analysis, you can set up the Docker image within the Nextflow run command.
Here is an example of the command in a Python Jupyter notebook. We are using the hla latest Docker image posted by zlskidmore.
!nextflow run test.nf -c ~/.nextflow/config -profile gls
-process.container="${ARTIFACT_REGISTRY_DOCKER_REPO}/zlskidmore/hla-la:latest"
Using GCR to access Docker images
We generally recommend that you use the previous steps using the google artifact registry to access a Docker image. However, as a workaround, you can manually upload public images to the Google Container Registry (GCR) and then use them on the Researcher Workbench.
Because the All of Us Researcher Workbench is built on Google Cloud Platform (GCP) architecture, you can push images to GCR to use them on the Workbench. For a GCR image to be usable on the Workbench, the project or bucket that stores or hosts the GCR image must be public; only public GCR images can be used on the Workbench. You cannot use the GCP project or bucket associated with a workspace to host a docker image since workspace buckets have many limitations for security reasons, and hosting images is one of them; this generally means you cannot use your @researchallofus.org account for this process and you will need a personal or institutional GCP account that can create public projects to host GCR images.
To summarize the caveats, this process has several requirements:
1. You will need to create a Google Cloud account that is separate from your @researchallofus.org account.
2. Using that Google Cloud account, you will need to make a new, public project to which you can push the image. If you have a private project, you will need to create a new project that can be made public so images pushed there can be accessed by environments on the Workbench. Consult the GCP Documentation about this process or talk to your Project’s admin members to ensure you have the correct permissions.
3. You will need to install the Google Cloud SDK to use the Google Cloud Command Line Interface (CLI) to run commands in a terminal session of your local machine.
4. You will need to install docker on your local machine.
5. You need a docker image on Docker Hub you want to pull and push to GCR. For this example, we will use the zlskidmore/hla-la image as an example.
Once the above requirements are satisfied, here is how to go about transferring that docker from Docker Hub to GCR so you can access it in the Workbench:
1. Open a command line or terminal session on your local machine.
2. Install and Authenticate the Google Cloud SDK on your local machine. Once the SDK is installed, authenticate your terminal session:
gcloud auth login
3. Configure Docker with Google Cloud SDK: To configure Docker to use gcloud as a credential helper, run:
gcloud auth configure-docker
4. Pull the Docker Image: With Docker Desktop fully installed and open on your local machine, pull the Docker image from Docker Hub:
docker pull zlskidmore/hla-la
5. Tag the Image for GCR: You’ll need to tag the Docker image with a registry name that includes your GCR path. The registry name is the Project ID of a public Google Cloud Project that you created on GCP, and the Project ID is accessible on the GCP ‘console’ page for your project. Click ‘console’ in the main GCP menu to reach this page.
Remember: you cannot use the Project ID associated with any Workspace Google Projects on the All of Us Researcher Workbench for this operation; the Project must be created by you using a Google Cloud account that is distinct from your @researchallofus.org account. The path will typically look like gcr.io/[YOUR_PROJECT_ID]/[IMAGE_NAME]:[TAG]. Here is an example tagging command, replace [YOUR_PROJECT_ID] with your actual GCP project ID and [TAG] with the tag of the image you want to use:
docker tag zlskidmore/hla-la gcr.io/YOUR_PROJECT_ID/hla-la:latest
6. Push the Docker Image to GCR: Push the Docker image to Google Container Registry:
docker push gcr.io/YOUR_PROJECT_ID/hla-la:latest
7. Verify the Image in GCR: Visit the GCP Console's Container Registry section to verify that your image was successfully pushed. Remember to replace YOUR_PROJECT_ID with your actual GCP project ID in the commands above.
8. Once verified, you can now use the image in operations on the Workbench using the ‘gcr.io/YOUR_PROJECT_ID/hla-la:latest’ path.
Comments
0 comments
Article is closed for comments.