Researcher Workbench 2.0 introduces powerful new tools for data exploration and analysis. Its modern interface gives you access to enhanced features such as JupyterLab, improved batch workflow support, and a more organized storage system. To help you compare these updates with the Researcher Workbench 1.0, the below glossary of terms outlines what is new and what is carried over.
Glossary of Terms
Many of the terms used to describe the All of Us Research Program and dataset will remain unchanged. However, as you explore the Researcher Workbench 2.0, you may encounter new terminology. This glossary is designed to help you become familiar with the language and concepts introduced in the new platform. For a complete list of terms specific to the Verily Workbench, please refer to the glossary here.
| Researcher Workbench 2.0 Term | Researcher Workbench 1.0 Comparison |
| Pods - pods connect workspaces with the cloud billing. Billing pods are created to connect to the GCP billing account, and workspace costs are charged to the billing pod. One pod can be used for many workspaces, and you can invite other users to be part of your pod. Note, All of Us researchers have a pod_manager role which allows you to create your own billing pod and link to your designated GCP billing account. A billing pod can only be connected to one GCP billing account. | GCP Billing Account |
| Data Catalog - This is a catalog of data collections you have access to. In the Researcher Workbench 2.0, your @researchallofus.org username will allow you access to data in Registered Tier or Controlled Tier data collections. |
Access Tier CDR version |
| Data Collection - curated datasets that are published in the Verily Workbench catalog. All of Us Research Program data collections are synonymous with the Curated Data Repository (CDR). There are two data collections available in the Researcher Workbench 2.0: “All of Us Registered Tier” and “All of Us Controlled Tier.” |
All of Us Dataset Registered Tier Controlled Tier Curated Data Repository |
| Data Collection Policies - restrictions that may be attached to workspaces and data collections and dictate how the data can be accessed and used, chiefly for the purposes of privacy and legal compliance. There are several data collection policies that are attached to the All of Us Data Collections and workspaces, such as perimeter policy, region policy, group policy, network policy and “Researcher Use Statement Questions.” |
|
| Group policy - A group policy limits workspace access and data sharing to users of the selected groups. The All of Us data collections group policy only allows users with the affiliated @researchallofus.org usernames to access a workspace that has an All of Us data collection in it. To collaborate on a workspace, all users must be approved for access to the data collection attached to the workspace. For example, only users who have access to Controlled Tier data can access workspaces with the All of Us Controlled Tier data collection in it. Users are automatically added to the All of Us data collections once they complete applicable data access requirements. Similar to the legacy Researcher Workbench, users are not allowed to add both Registered and Controlled Tier data collections to the workspace, meaning you are only able to add Registered Tier or Controlled Tier data collections to a given workspace. |
|
| Parameter policy - A perimeter policy restricts data movement - such as copy, transfer, and retrieval of data - to the cloud boundaries. It limits copy, transfer and retrieval of data. In the Researcher Workbench 2.0, data collections and workspaces can be placed within a perimeter to enforce these limits. The All of Us Research Program requires workspaces using All of Us data collections (Registered Tier and Controlled Tier) to be restricted within a perimeter, and each workspace can belong to only one perimeter. A workspace perimeter is automatically and permanently assigned when you add an All of Us data collection to your workspace. |
|
| Region Policy -A region policy is a type of policy that limits which regions of a platform, like Workbench 2.0, may be used to create cloud resources and apps. The Researcher Workbench 2.0 uses Google Cloud Platform (GCP) and will utilize regions within GCP. All of Us data collections and workspaces are restricted and automatically assigned to the region us-central1(Iowa). When you create a workspace in Researcher Workbench 2.0, it will automatically keep cloud resources and apps created in the workspace within this region. This is the same region restriction that exists in the legacy Researcher Workbench. |
|
| Network Policy -A network policy is a type of policy that disables direct internet access for virtual machines (VMs) that run batch jobs. For example, you will be unable to access certain SSH commands or externally access the VM through the internet. |
|
| Data Explorer - Data Explorer is a point-and-click interface within the Researcher Workbench 2.0 that enables users to build datasets using All of Us Data Collections. It combines and modernizes the functionality of the Cohort Builder, Concept Selector, and Dataset Builder into a single streamlined tool. | Cohort Builder and Dataset Builder |
| Researcher Use Statement Questions - The All of Us Data User Code of Conduct (DUCC) requires researchers to provide transparency into their study plans for each workspace. Before you can create a workspace, you must provide a thorough, meaningful description of your research project and study plans in the “Workspace Description Form.” These questions are the same as the prompts provided in the Workspace Description Form in the legacy Researcher Workbench. | Workspace description |
| Applications (Apps) - Apps is a general term to refer to any cloud-hosted software applications available when the Researcher Workbench. There are a variety of applications available in the Researcher Workbench 2.0. | Jupyter Environments, SAS Studio, RStudio |
| Cloud Resource - A cloud resource is any computing component or service provided by a cloud platform that can be provisioned, managed, and consumed on demand. These resources typically include compute (virtual machines, containers, etc), storage (workbench bucket, persistent disk, object storage), and platform services (APIs, orchestration services). It is a broad term to describe resources that can be added or created via cloud storage or services. In the Researcher Workbench 2.0, under the “Resources” tab of a workspace, there are two main types of cloud resources: referenced or controlled resources. |
|
| Notebook - A notebook resides within JupyterLab and provides an interface for writing and running code in languages such as R, Python, and SQL. It lets you work within a single file for tasks like cleaning and transforming data, exploring datasets, and building machine learning models, while also supporting visualizations and narrative text. | Jupyter Notebook |
| JupyterLab - JupyterLab is an open-source web application that provides an interactive environment for notebooks, code, and data. In the Researcher Workbench, selecting “JupyterLab” will create a standard Google Compute Engine (GCE) instance. |
General Analysis Environment Standard Environment |
| JupyterLab Spark cluster - this is a JupyterLab extension on dataproc cluster service on Google Cloud Platform. In the Researcher Workbench, selecting “JupyterLab Spark cluster” will create a dataproc cluster environment. |
Hail Genomics Analysis Dataproc cluster |
| JupyterLab (NVIDIA NeMo) - NVIDIA NeMo is a NVIDIA software suite leveraging a GPU-accelerated environment for large-scale generative AI tasks, such as training, customization, and deployment of your custom AI, LLMs, and multimodal models. In the Researcher Workbench, selecting “JupyterLab (NVIDIA NeMo)” will create a Compute Engine (GCE) instance with resourced GPUs for AI development. |
|
| JupyterLab (NVIDIA Parabricks and CUDA-X Data Science) - NVIDIA Parabricks and CUDA-X Data Science is an NVIDIA software and libraries suite that leverages GPU-accelerated environments for secondary and tertiary genomics analysis. Leveraging this environment allows for large scale genomic and multi-omics analyses much faster than on a CPU-only environment. In the Researcher Workbench, selecting “JupyterLab (NVIDIA Parabricks and CUDA-X Data Science)” will create a Compute Engine (GCE) instance with resourced GPUs tailored for genomics analysis. |
|
| R analysis environment - this is a compute engine instance used to launch the RStudio interface in Researcher Workbench 2.0. | RStudio |
| Controlled resources - A controlled resource is a cloud resource created within a specific workspace, such as a cloud storage bucket or bucket object. It is specific to that workspace and is deleted if the workspace or resource inside the workspace is deleted. To use it in another workspace, a reference to the original controlled resource must be created. | Workspace Bucket |
| Workflows - This term broadly refers to computational workflows that automate multi-stage data processing, streamlining tasks by executing them autonomously. Researcher Workbench 2.0 supports workflows on Cromwell, dsub, and Nextflow, and can be accessed through the "Workflows" section of a workspace. |
Crowell Dsub Nextflow
|
| Verily Pre - Verily Pre is an AI-native precision health data platform from Verily, purpose-built to accelerate biomedical research and deploy AI solutions in healthcare environments. It is what powers the Researcher Workbench 2.0. Within Verily Pre, the All of Us Researcher Workbench leverages the Workbench and Exchange. |
|
| Reference Resource - A referenced resource (or reference) is a pointer to data or elements that exist outside your workspace, allowing you to use them without altering the original source. For example, creating a reference to a BigQuery dataset lets you analyze it in your workspace while keeping the source intact. References can be safely deleted or duplicated across workspaces without affecting the original resource, provided you maintain access to the source. |
|
| Exchange - this is a data collection catalog available in the Researcher Workbench 2.0 where researchers find and access additional biomedical data available on the Verily Pre platform. |
|
| Git repository - A git repository is a version-controlled directory that stores project files and their change history. Typically, repositories are hosted on platforms such as GitHub. You can add Git repositories to your Researcher Workbench 2.0 workspace as references. When you create a cloud app for analysis, it will clone your repository to the app. |
|
| Command-line interface (CLI) - A command-line interface (CLI) is a text-based interface that uses defined commands to execute user actions. Using the command line requires more computational knowledge than a graphical user interface (GUI) - such as JupyterLab. You can leverage the CLI from Linux or from a variety of virtual machines, including apps in Researcher Workbench 2.0 itself. The "Workbench CLI” package is pre-installed in all app images offered by Researcher Workbench 2.0. |
|
Feature Comparison
| Feature | Researcher Workbench 1.0 | Researcher Workbench 2.0 |
| Jupyter Notebooks | ✔️ | ✔️ |
| JupyterLab | ✔️ | |
| Workspace Bucket | ✔️ | ✔️ |
| Re-attachable Persistent Disk | ✔️ | |
| RStudio | ✔️ | ✔️ |
| SAS Studio | ✔️ | ✔️ |
| Visual Studio Code | ✔️ (coming soon) | |
| GPUs | ✔️ | ✔️ |
| Hail | ✔️ | ✔️ |
| Docker Hub images | ✔️ | ✔️ |
| Plink | ✔️ | ✔️ |
| NVIDIA software/libraries | ✔️ | |
| Git repo integration | ✔️ | ✔️ |
|
Data Explorer (aka Cohort Builder/DatasetBuilder) |
✔️ | ✔️ |
| Dataproc clusters | ✔️ | ✔️ |
| Customizable Compute Engine resource (CPUs, RAM) | ✔️ | ✔️ |
| Nexflow | ✔️ | ✔️ |
| Cromwell | ✔️ | ✔️ |
| dsub | ✔️ | ✔️ |
| Featured Workspaces | ✔️ | ✔️ |
| Terminal access | ✔️ | ✔️ |
| Variant Search Tool | ✔️ | ✔️ |
Comments
0 comments
Please sign in to leave a comment.