Featured Workspaces - Table of Contents

  • Updated

Scope:

The purpose of the All of Us Featured Workspaces Table of Contents is to assist users in navigating the Featured Workspaces tab within the Researcher Workbench.  

 

Background:

All researchers with a Researcher Workbench account have access to “Featured Workspaces,” which are workspaces designed to provide examples of cohorts, concept sets, and data analyses that can be used to inform or enhance your work. These workspaces can be accessed from the left-hand navigation bar in the Researcher Workbench by clicking a section labeled “Featured Workspaces.” In order for you to edit these workspaces, you need to clone the workspace of interest by clicking on the left side of the workspace icon on the “snowman” icon   and then selecting “Duplicate.”  This will open the cloned workspace's “About” section, where you can choose to change the name or any other description within this space and then click “Duplicate Workspace.” If you do not clone these workspaces, you will not be able to use these and copy the code for your own analyses. These workspaces are currently divided into three categories: Tutorial Workspaces, Demonstration Projects and Phenotype Library. 

 

Tutorial Workspaces

If you are new to the Researcher Workbench, the Tutorial Workspaces tab within Featured Workspaces is a great starting point for learning how to analyze data within the All of Us dataset.  Workspaces in this section will walk you through basic data manipulation, analysis techniques specific to the All of Us data, and backing up your research within the Researcher Workbench. The list below describes the Tutorial Workspaces currently available in the Researcher Workbench: 

 

Skills Assessment Training Notebooks For Users

This Featured Workspace contains multiple notebooks that assess users' understanding of the workbench and OMOP. These notebooks are meant to help users check their knowledge not only on Python, R, and SQL, but also on the general data structure and data model used by the All of Us program. If you are new to the All of Us Researcher Workbench or just need a refresher, this Featured Workspace is for you. The notebook suite in this workspace will teach you the basics of what you need to know to start and complete your analysis. 

How to get started with Registered Tier Data (v7)

This Featured Workspace will give you an overview of what data is available in the current Registered Tier Curated Data Repository (CDR). It will also teach you how to retrieve information about Electronic Health Records (EHRs), Physical Measurements (PM), and Survey data. Depending on your preferred programming language, there is both a notebook in Python and in R. The notebooks will walk through the supplied code to explain what is being performed. The code blocks can be copied over into a new notebook, set in either edit or playground mode, to be run.

How to get started with Controlled Tier Data (v7)

This Featured Workspace gives an overview of the data types available in the current Controlled Tier Curated Data Repository (CDR) that are not available in the Registered Tier. The Featured Workspace teaches the users how to set up associated notebook, install and import software packages, and select the correct version of the CDR. Depending on your preferred programming language, there is both a notebook in Python and in R. The notebooks will walk through the supplied code to explain what is being performed. The code blocks can be copied over into a new notebook, set in either edit or playground mode, to be run.

Data Wrangling in the All of Us Program (v7)

This Featured Workspace is targeted to new users that covers basic data wrangling in the Workbench, related to the popular Office Hours session about this topic. It expands significantly on the Office Hours demo, and provides a step-by-step walkthrough about building cohorts, pulling specific types of data associated with the cohort, merging/combining this data into one final data frame, and then visualizing it as well as analyzing it through some common statistical tests.

How to work with All of Us Survey Data (v7)

This Featured Workspace should help you become familiar with how to query PPI (Participant Provided Information) questions/surveys, and what the frequencies of answers for each question in the PPI module are. The tutorial notebooks will walk you through examples of various ways in which to extract and visualize Participant Provided Information (PPI) survey data from the All of Us CDR. Ultimately, you should be able to use example code from these notebooks and (with a few minor changes) see similar results in your own research. 

How to work with All of Us COPE Survey Data (v7) (CT)

This Featured Workspace will give an overview of COPE survey data available in the current Controlled Tier Curated Data Repository (CDR) and how to retrieve them. 

How to work with All of Us Physical Measurement Data (v7)

This Featured Workspace helps you get familiar with how to navigate around physical measurements data. The tutorial notebooks in this Featured Workspace demonstrate how to extract Physical Measurement (PM) data on All of Us participants, get an idea of what is in the PM data, or determine whether PM data is derived from Participant Provided Information (PPI) or Electronic Health Records (EHR). Measurements collected include: height, weight, BMI, waist circumference, hip circumference, pregnancy-status, blood pressure, heart rate, and wheelchair use. 

How to work with Wearable Device Data (v7)

This Featured Workspace will give an overview characterization of the Fitbit data elements currently available in the current Curated Data Repository (CDR) and provide best practices and tips for how to retrieve them. 

How to Backup Notebooks and Intermediate Results

The notebooks in this Featured Workspace will give an overview of how to create snapshots of notebooks and backups of intermediate results stored in other files such as plot images and derived data. This includes how to save snapshots of a notebook for later review, allowing users to track changes to results in notebooks over time, and how to create files such as image files of plots or CSVs of intermediate results that you would like to retain. 

How to Run Notebooks in the Background

Some analyses take some time to run. To avoid interruption for long running jobs, this notebook will run codes in the background. If you wish to capture all notebook cell outputs, use the notebook in this Featured Workspace to run your long-running notebook (or any other long-running notebook). Please note, for your analysis the cluster will auto-pause after 24 hours. To prevent your cluster from shutting down if your background job takes longer than 24 hours, be sure to log in and start a notebook, any notebook, to reset the auto-pause timer.

How to work with Genomic Data (Hail - Plink) (v7)

This Featured Workspace has a series of notebooks for you to get started with All of Us genomic data and tools. The All of Us Research Program provide whole genome sequencing (WGS) and microarray genotyping (arrays) data in different formats, such as variant call format (VCF), Hail MatrixTable, Hail VariantDataset, CRAM files, IDAT files, PLINK BED files and other auxiliary files. The notebooks in this Featured Workspace demonstrate example analysis how to use Hail and PLINK to perform genome-wide association studies using the All of Us genomic data and phenotypic data.

How to work with Genomics Data (CRAM_processing and IGV) _v7HC

The notebooks of this Featured Workspace demonstrate how to copy or localize All of Us CRAM files to your workspace bucket and active cloud environment in order to look at their contents with the Integrated Genome Viewer (IGV). 

How to Run WDLs using Cromwell in the Researcher Workbench (v7)

This notebook in this Featured Workspace shows how to set up Cromwell, how to use the automatically created Cromwell configuration file and how to write a WDL script to use All of Us genomic data as an input. In this tutorial workspace, you will learn how to set up your Cloud Environment and use Cromwell to execute an example script, validate_vcf.wdl. This workflow uses the GATK ValidateVariants tool to validate VCF files. VCF files, corresponding index files, and the human reference genome assembly are provided to the workflow as inputs. The duration of this tutorial should be around 20 minutes.

How to use Nextflow in the Researcher Workbench (v7)

This Featured Workspace shows how to set up Nextflow, how to use the automatically created Nextflow configuration file and how to write a Nextflow script to use All of Us genomic data as an input. This notebook is designed to be an example of how to use a Nextflow script within the Researcher Workbench with the array single sample variant call format (VCF) files as inputs. 

How to use dsub in the Researcher Workbench (v7)

This Featured Workspace has a series of notebooks to demonstrate how to set up dsub in the Researcher Workbench. This includes how to run a single job, how to check job status, how to debug a failed job, and demonstrates how to run the wc command in a parallel manner.

Genomics Undergrad Lesson Plan Exemplar (v6)

This Featured Workspace is user support content to help other users mentor students and trainees on projects in the Restricted and Controlled Tiers. Our plan is to produce documents outlining a recommended path to onboarding with resources for mentors as well as sample workspaces for each type of Tier project using the R coding language for All of Us to use in the workbench as they see fit.

How to Reproduce All of Us SARS-CoV-2 Antibody Study - Original Study (v4)

This notebooks in this Featured Workspace will give an overview of the Antibodies to SARS-CoV-2 in All of Us Research Program Participants, January 2-March 18, 2020 study in the current Curated Data Repository (CDR) and how to reproduce it.

How to Reproduce All of Us SARS-CoV-2 Antibody Study (v5)

This Featured Workspace provides instructions for the replication of the "Antibodies to SARS-CoV-2 in All of Us Research Program participants, January 2 - March 18, 2020" study published in Clinical Infectious Diseases. Full citation: Althoff, K., Schlueter, D.J., Anton-Culver, H., Cherry, J., Denny, J., Thomsen, I., ... Schully, S. (2021). Antibodies to SARS-CoV-2 in All of Us Research Program participants, January 2 - March 18, 2020. Clinical Infectious Diseases, ciab519, https://doi.org/10.1093/cid/ciab519

How to work with Long Read Data (v7)

The purpose of this Featured Workspace is to serve as a tutorial which shows how to localize the All of Us (AoU) long read BAM files individually in addition to showing how to render the Integrated Genome Viewer (IGV) on the All of Us Researcher Workbench to explore the BAM files. This workspace contains two functional notebooks: A notebook (1) dedicated to various ways of analyzing BAM localization written in Python, and another notebook (2) for looking at the BAM files in your current environment with IGV written in Python.

 

Demonstration Projects

Workspaces in the Demonstration Projects section of Featured Workspaces will show you end-to-end analysis performed using All of Us data. These projects demonstrate the quality, utility, and diversity of the All of Us data by replicating findings in previously published studies. The list below describes the Demonstration Projects currently available in the Researcher Workbench:  

 

Data Quality Reports - 2022Q2R2 v6 CDR

This workspace provides detailed demographic information about the participants in the 2022Q2R2 version 6 curated data repository (CDR), as well as a summary of participants by data type. Tables and useful graphs are included. Notebooks include: summary of participants by data type, demographic characteristics of participants by data types, and UBR (underrepresented in biomedical research) breakdown.

Data Quality Reports - C2022Q4R9 v7 CDR

This workspace provides an overview on the data availability and demographic characterization. It also provides an overview on the number of participants who meet the UBR (underrepresented in biomedical research) definition. Notebooks include: summary of participants by data type, demographic characteristics of participants by data types, UBR (underrepresented in biomedical research) breakdown, UBR by data types, and genomics by race and datatypes. 

Demo -N3C Machine Learning PASC/Long COVID Phenotype Algorithm in the All of Us dataset

This demonstration workspace is a collaboration between the All of Us Research Program, National COVID Cohort Collaborative (N3C), PCORnet, and NIH/RECOVER to examine and identify participant risk of Long COVID utilizing the N3C's machine learning (ML) PASC/Long COVID Phenotype algorithm within the All of Us Researcher Workbench. The XGBoost machine learning model was used to identify potential patients with PASC/Long COVID, which was initially published by Emily et al. These models were subsequently implemented within the All of Us Controlled Tier dataset (C2022Q2R2; v6).

Wearables and the Human Phenome

This demonstration workspace and corresponding notebook examines the associations between physical activity over time (measured using participant wearables) and incident chronic diseases as determined by EHR data. It's an example of how the Workbench can be leveraged to work with FitBit data using R. 

Demo - PheWAS Smoking 

In this demonstration project, this study will present the results of Phenome-Wide Association Studies (PheWAS) to show how the various sources of data contained within All of Us research dataset can be used to inform scientific discovery. Specific goals of this workspace are to demonstrate how to implement a Phenome-wide Association Study within the All of Us Researcher Workbench, demonstrate use of heterogeneous data sources within the All of Us dataset, and develop plots that compare the results of EHR smoking with PPI ever smoking and PPI smoking every day PheWAS routines.

Demo - Cardiovascular Risk Scoring

In this demonstration project,  the AHA algorithm/equation is used to calculate the cardiovascular risk scores. Further, it demonstrates the usage of smoking and race data collected by the program, which are data that usually researchers use natural language processing to extract, to facilitate the calculation of cardiovascular risk score.

Demo - Medication Sequencing 

This demonstration project used the medication sequencing developed at Columbia University and the OHDSI network as a means to characterize treatment pathways at scale. Further, the project wants to demonstrate implementation of these medication sequencing algorithms in the All of Us research dataset to show how the various sources of data contained within the program can be used to characterize treatment pathways at scale.

Demo - All of Us Descriptive Statistics 

In this demonstration project, we will apply data visualization libraries to aggregate information about the Cohort. We will measure age by using the age reflected when the CDR was generated. Specific goals include: describe an overview of data types included in beta release Curated Data Repository (CDR), describe participants by age, race, and ethnicity using all the available data types, and describe the underrepresented biomedical research (UBR) population within the All of Us Research Program participants.

Demo - Family History in EHR and PPI

This demonstration project will summarize structured data elements available in the All of Us registered tier and compare to published survey results to describe data for reuse in disease specific outcomes. Specific questions include: 1. Could harnessing informatics tools like predictive modeling and clinical decision support to detect and alert healthcare providers to these preventative measures significantly improve the precise care we deliver to patients? 2. How can one evaluate the availability of family medical history information within the All of Us registered tier data and characterize the structured data elements from both data sources?

Demo - Hypertensive Prevalence 

This demonstration project aim was to use published methods to replicate known differences in hypertension prevalence in UBR groups and illustrate variation in hypertension prevalence in geographic regions of the U.S. We compared our results to the 2015–2016 National Health and Nutrition Examination Survey (NHANES) hypertension prevalence results. https://www.cdc.gov/nchs/products/databriefs/db289.htm

Demo - Systemic Disease and Glaucoma 

The aims of this demonstration project were to: (1) externally validate our single-center model’s performance with All of Us data, (2) develop models trained by the All of Us data and compare their performance to our single-center model, and (3) share insights from our experience using All of Us data and the Researcher Workbench with other ophthalmology researchers who may be interested in using this novel data source. 

Demo - Siloed Analysis of All of Us and UK Biobank Genomic Data

The primary goal of this demonstration project is to demonstrate the potential of the All of Us Researcher Workbench for pooled analyses of All of Us and UK Biobank data. Specifically, we aim to: 1. Develop and describe an approved, secure path for connecting UK Biobank data to the All of Us Researcher Workbench. 2. Conduct a genome-wide association study of blood lipids on the pooled dataset aimed at demonstrating that biomedical researchers can be more productive when permitted to analyze the union of the cohorts, as opposed to computing aggregate results in separate data silos for each cohort and then combining those aggregates. In this workspace are all the notebooks needed to perform a regenie genome-wide association study of lipids over the exonic variants within the All of Us (AoU) alpha3 release of genomic data in a siloed fashion.

Replication of Dissecting Racial Bias Paper 

This demonstration project is a replication study for the paper: Dissecting racial bias in an algorithm used to manage the health of populations. The associated notebook will be used in the AIM-AHEAD workshop to answer the questions:  1- Can we predict the health status of participants in the next year using their health status and demographics? 2- does the prediction for the health status differ by participants' race and education? 3- What are the most important features that are associated with the health status of participants? Notebooks in this workspace demonstrate how to train a machine learning model that will predict the health status of participant the year following the participants enrollment.

Demo - Geographic Variation in Obesity (v4) 

This demonstration project examined the quality and utility of the All of Us Research Hub Workbench for accelerating precision medicine by replicating methods from existing studies that examine the prevalence of obesity at the population level. We evaluated the measurements of obesity in the participant measurement (PM) data set and the electronic health record (EHR) data set using methods similar to the Ward et al. NEJM December 2019 publication that assessed prevalence of obesity in the US by state using BRFSS data.

Phenotype Library

Workspaces in the Phenotype Library section of Featured Workspaces demonstrate how computable electronic phenotypes can be implemented within the All of Us dataset using examples of previously published phenotype algorithms. The list below describes the workspaces in the Phenotype Library currently available in the Researcher Workbench: 

 

Phenotype - Breast Cancer

By reading and running the notebooks in this Phenotype Library workspace, researchers can implement the following published phenotype algorithms: Ning Shang, George Hripcsak, Chunhua Weng, Wendy K. Chung, & Katherine Crew. Breast Cancer. Retrieved from https://phekb.org/phenotype/breast-cancer. Notebooks provide a supplement to the validated Breast Cancer Phenotype by demonstrating a method to identify additional participants according to their answers to breast cancer-related questions on participant provided information (PPI) surveys, and describes the query performed to capture a cohort according to a predefined algorithm for breast cancer.

Phenotype - Breast Cancer (Controlled Tier)

By reading and running the notebooks in this Phenotype Library workspace, researchers can implement the following published phenotype algorithms: Ning Shang, George Hripcsak, Chunhua Weng, Wendy K. Chung, & Katherine Crew. Breast Cancer. Retrieved from https://phekb.org/phenotype/breast-cancer.

Phenotype - Dementia 

By reading and running the notebooks in this Phenotype Library workspace, researchers can implement the following published phenotype algorithms: Ritchie, M., Denny, J., Crawford, D., Ramirez, A., Weiner, J., … Roden, D. (2010). Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. American Journal of Human Genetics. 87(2):310 doi: 10.1016/j.ajhg.2010.03.003. Notebooks describes the query performed to capture a cohort according to a predefined phenotype algorithm for dementia.

Phenotype - Depression

By reading and running the notebooks in this Phenotype Library workspace, researchers can implement the following published phenotype algorithm from the eMERGE network: TBA. KPWA/UW. Depression. PheKB; 2018 Available from: https://phekb.org/phenotype/1095 dolor. The associated notebook describes the query performed to capture 3 cohorts according to a predefined phenotype algorithm for depression.

Phenotype - Ischemic Heart Disease 

By reading and running the notebooks in this Phenotype Library workspace, researchers can implement the following published phenotype algorithm: Christianne L. Roumie; Jana Shirey-Rice, Sunil Kripalani. Vanderbilt University. MidSouth CDRN - Coronary Heart Disease Algorithm. PheKB; 2014. Available from https://phekb.org/phenotype/234. The associated notebook describes the query performed to capture a cohort according to a predefined phenotype algorithm for ischemic heart disease, also known as coronary artery disease.

Phenotype - Ischemic Heart Disease (Controlled Tier)

By reading and running the notebooks in this Phenotype Library workspace, researchers can implement the following published phenotype algorithm: Christianne L. Roumie; Jana Shirey-Rice, Sunil Kripalani. Vanderbilt University. MidSouth CDRN - Coronary Heart Disease Algorithm. PheKB; 2014. Available from https://phekb.org/phenotype/234. The associated notebook describes the query performed to capture a cohort according to a predefined phenotype algorithm for ischemic heart disease, also known as coronary artery disease.

Phenotype - Diabetes

By reading and running the notebooks in this Phenotype Library workspace, researchers can implement the following published phenotype algorithm: Jennifer Pacheco and Will Thompson. Northwestern University. Type 2 Diabetes Mellitus. PheKB; 2012. Available from: https://phekb.org/phenotype/18. The associated notebook describes the query performed to capture a cohort according to a predefined phenotype algorithm for Type 2 Diabetes according to four different cases.

Phenotype - Diabetes (Controlled Tier)

By reading and running the notebooks in this Phenotype Library workspace, researchers can implement the following published phenotype algorithm: Jennifer Pacheco and Will Thompson. Northwestern University. Type 2 Diabetes Mellitus. PheKB; 2012. Available from: https://phekb.org/phenotype/18. The associated notebook describes the query performed to capture a cohort according to a predefined phenotype algorithm for Type 2 Diabetes according to four different cases.

Was this article helpful?

2 out of 5 found this helpful

Have more questions? Submit a request