In Researcher Workbench 2.0, cohorts and datasets can be created through the new Data Explorer, powered by Verily Pre. This updated tool is similar to the Cohort Builder and Dataset Builder in Researcher Workbench 1.0, and lets you visually explore data, design custom cohorts, and export datasets directly to your workspaces.
Accessing Data Explorer in your workspace
After adding an All of Us Data Collection, you can start working with Data Explorer by creating a cohort in the Resources tab of your workspace.
- In your workspace, navigate to the Resource tab.
- Select the ‘New Resource’ button, and then ‘New Cohort’.
-
Choose the All of Us data collection that was previously selected when adding a data collection to your workspace. For example, for All of Us Registered Tier data, you will select R2024Q3R8, which is CDRv8 Registered Tier. A full list of CDR versions are noted in the Data Dictionary here.
- CDR v8 Registered Tier = R2024Q3R8
- CDR v8 Controlled Tier = C2024Q3R8
- Review the data collection policies, complete the “Researcher Use Statement Questions,” and select “I'm sure. I understand that all policies and terms above will be permanently applied to this workspace.”
- Enter the details about your cohort, to include a cohort name, description of your cohort, and designate a workspace bucket folder to add the cohort to. Then select ‘Add to your workspace.’
- You will automatically be redirected to the Data Explorer interface to begin building your cohort and dataset.
For a full step-by-step guide to creating a cohort and dataset in Data Explorer, see the article Get started with Data Explorer or this interactive tutorial here.
Overview of Data Explorer Features
Creating a cohort
The point‑and‑click interface in Data Explorer lets you apply criteria to select the participants you want to include in your research project. Below are a few key features of the updated interface.
- Create Custom Cohorts - The ‘Cohort filter’ allows you to filter your cohort by inclusion or exclusion criteria. You can create one or more groups based on certain criteria. Click ‘Add some criteria’ to display the various options for your filter criteria such as the various EHR domain, program data, or source code fields. To apply new filter criteria, you must select “Apply” at the bottom to refresh the cohort.
- Cohort Visualization - Cohort visualizations are a series of bar graphs that show demographic breakdowns of your cohort, to include age, sex assigned at birth, and top conditions of your cohort. As you select your cohort criteria, the cohort visualizations will update to reflect the results. You can hover over each of the bars to see a more detailed breakdown.
- Review individuals - Similar to the ‘Cohort Review’ in the Researcher Workbench 1.0, selecting ‘Review Individuals’ allows you to examine the individual participants included in your cohort. You can generate a review set that provides a snapshot of participant data, helping you confirm that your chosen criteria produced the expected results. This feature also enables you to add annotations to document details about your cohort.
- Save data snapshot - After you’ve created a cohort with the appropriate inclusion and exclusion criteria, select ‘Save data snapshot’ to save your cohort and proceed to building your dataset.
Build a data snapshot to export
After creating your cohort, you can generate a data snapshot, which lets you apply additional concept criteria to assemble a dataset for export. You can create data snapshots and notebooks to export directly to your workspace. You'll also be able to view the SQL queries needed to generate the data snapshot. These queries are available in Python and R, and can be run in Jupyter notebooks in your workspace.
-
Data snapshot steps - The main steps in generating a data snapshot include
- Adding any additional concept sets about your cohort
-
Select the file format you want to export. Tables and SQL queries can be exported in the following formats:
- Zipped or unzipped .csv files
- Queries for the cohort (IPYNB) with R Notebook
- Queries for the cohort (IPYNB) with Python Notebook
- Select destination in workspace bucket
-
Add concept sets - You can select concepts from various data domains that you want to examine in your cohort under ‘Browse more data.’ Some prepackaged concepts are available to you, such as common demographics from the person table. Once you add the concept set, select the ‘Apply’ button to refresh the preview.
-
Manage Columns and table views - After selecting concept sets, you can manage the columns available in your tables for export. By default, all table columns will be selected, but you can deselect any you want to exclude by either using the checkbox feature next to a column name or by selecting Manage columns and using the toggle function. We recommend only including the columns you need for your analysis. Additionally, there are various views available to review or copy prior to your export.
- Tables - this view shows you the CDR tables (i.e person, condition_occurrence, measurement, etc), used in the query in a structured format.
- Queries for each table - this view displays the SQL query to access each CDR table.
- Queries for cohort - this view displays the SQL query generated for your cohort based on your inclusion and exclusion criteria.
- Summary - this view provides a high level summary of the tables, fields, and criteria information about your data snapshot prior to export.
Work with your data
Once you successfully export your data snapshot, it will be available to access within your specified workspace bucket destination. In order to interact with your data snapshot created in the Data Explorer, you’ll need to create a cloud application such as the Jupyter Lab environment.
- From your workspace's Apps tab, select ‘New app instance’ > ‘JupyterLab’
- Customize your cloud environment as needed.
- Open JupyterLab once the cloud app environment has been created by clicking on the name of your app. An app has successfully been created when it notes ‘Running’ in green.
- Search for the data snapshot under the workspace bucket file destination. Select the file format of your choice to run.
Tips for using All of Us data collections with Data Explorer
Data Explorer is a new tool powered by Verily Pre. While it shares some features with the Cohort Builder and Dataset Builder, its underlying methods don’t always match those used in the original Researcher Workbench. The guidance below offers recommendations and tips to help you use the Data Explorer effectively with All of Us data collections.
-
Cohorts autosave, including when you rename your cohort. However, to apply new filter criteria, you must select “Apply” at the bottom to refresh the cohort.
-
Selecting ‘Meets any criteria’ applies an “OR” operator, while ‘Meets all criteria’ applies an “AND” operator. For example, choosing ‘Meets any criteria’ will include participants who satisfy either criteria A or criteria B, while choosing ‘Meets all criteria’ will include participants who satisfy both criteria A and criteria B.
-
We recommend using the “Meets all criteria” (trigger the “AND” operator) when working with multiple criteria groups. For example, Group 1 = demographic criteria while using Group 2 for EHR domain criteria.
- Temporal options are available, but only work with EHR domain criteria. To enable the temporal feature, you’ll need to select one of the four temporal feature criteria.
-
Only ‘Current Age’ is available for age demographic criteria within the Data Explorer user interface (UI). To use ‘Age a CDR’ or ‘Age at event’ we recommend manually calculating this age in your notebook.
- To filter your inclusion criteria, first select the criterion you want to add. Then hover over the criterion to display the available filter options. For example, choose the +Ethnicity criterion to add it to your group, and then select the ‘Ethnicity’ label to view its filter options.
-
To edit, apply modifiers or delete inclusion or exclusion criteria, use the following icons:
- Pencil icon = edit criteria option
- Modifiers = apply specified modifies about criteria
- Filter slash = disable criteria for export, but will not delete criteria.
- Trash bin = delete criteria
Appendix A - Known Issues
Although the Data Explorer shares several features with the Cohort Builder and Dataset Builder, its backend methodology may not always align with those of the Researcher Workbench 1.0. As a result, you may notice differences in participant counts or changes in the user interface. The following list outlines known issues or anticipated differences between these products.
Known Issue - Personal and Family Health History (PFHH) survey
The Personal and Family Health History (PFHH) survey is currently missing from the Data Explorer when using v8 Controlled Tier data collection. However, it is available in the Data Explorer when using the Registered Tier v8 data collection. Our team is actively working to resolve this issue. In the meantime, the PFHH survey can be accessed by querying it directly using a custom SQL query.
Known Issue - Temporal feature count differences
You may see count differences between the Cohort Builder and the Data Explorer when using temporal criteria. The Data Explorer calculates date differences with TIMESTAMP_DIFF, which measures time down to the millisecond before rounding to days, while the Cohort Builder uses a different SQL method. Although both apply the same inclusive 30‑day logic, small variations, such as which event date is used as the reference point or how fractional days near the boundary are handled, can result in slight count discrepancies. This is to be expected. If you’ve already begun an analysis in Researcher Workbench 1.0 that relies on temporal criteria, we recommend continuing that work in the original environment for consistency. For any new cohort development, you can use the Data Explorer tool in Researcher Workbench 2.0.
Known Issue - Null or unknown age values
When using the ‘Current Age’ flag in the Data Explorer, some field values may return as ‘null’ or ‘unknown.’ This is due to the Data Explorer not returning results on deceased participants. To calculate age for these participants, you can query the date of birth directly in a Jupyter Notebook.
Known Issue - Age cohort count difference
Age-cohort discrepancies may appear between Cohort Builder and Data Explorer at maximum age range categories that are multiples of four year boundaries. The Data Explorer approximates leap years using a 365.25‑day divisor. Every four‑year cycle aligns exactly with this calculation, making the age value highly sensitive to small timestamp differences. As a result, even a few hours of difference between the query timestamp and the participant’s birth timestamp can shift the computed age across the boundary (e.g., from 39 to 40). At non‑multiples of four, the fractional part of the age calculation provides a buffer, so small timing differences don’t affect the final rounded‑down age. For example, if the participant is born on March 1, 1984 at 2:00 AM, and a query for age was performed using Data Explorer on March 1, 2024 at 2:00 PM, the Cohort Builder may show 40 years old, but the Data Explorer may show 39 years old. This is an expected difference between the two tools.
Known Issue - Error messages when using source codes
Some source codes may occasionally trigger errors in the Data Explorer user interface, or pull in additional tables outside of your selection. You may encounter messages such as:
- “Tables that contain cohort-defining criteria" tables AND CPT-4, ICD-9-Proc, and ICD-10-PCS runs an error “"Error Previewing Data Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.”
- Intermittent timeout error when searching for source concepts: [Something went wrong Not found: Table wb-affable-acorn-7941:R2024Q3R8_index_110625 T_ENT_icd10pcsConcept was not found in location us-central1]
Our team is actively working on a fix for this issue. In the meantime, we recommend using standard concept IDs to avoid these error messages.
Known Issues - Variant Search UI
The variant search function in the Data Explorer is still under development. Therefore some aspects of this feature may be inaccessible such as gene sorting and filtering, to include filtering by allele count, allele number and allele frequency. Additionally, when using the “+ Select All Results” button in the interface, it may populate a “Error: “Something went wrong — Cohort not found” message on the side panel view. Despite the message, you can still select all variants and proceed through the Data Explorer workflow.
Known Issues - Data Explorer Support for CDRv7
Data Explorer is not available for use with CDR v7 in Researcher Workbench 2.0. To continue your analysis, please use the most recent CDR v8 in the Researcher Workbench.
Comments
0 comments
Article is closed for comments.