Featured Workspaces - Table of Contents

  • Updated

Featured workspaces provide examples of data querying, wrangling, and analysis to support you when using the All of Us Researcher Workbench. There are four types of featured workspaces: tutorial workspaces, demonstration projects, phenotype library, and community workspaces.

Tutorial Workspaces

From basic data manipulation to analysis techniques specific to the All of Us data, tutorial workspaces are a great place to start for learning how to analyze data within the All of Us Researcher Workbench.

Beginner Intro to AoU Data and the Workbench

This Featured Workspace contains multiple notebooks that assess users' understanding of the workbench and OMOP. These notebooks are meant to help users check their knowledge not only on Python, R, and SQL, but also on the general data structure and data model used by the All of Us program. If you are new to the All of Us Researcher Workbench or just need a refresher, this Featured Workspace is for you. The notebook suite in this workspace will teach you the basics of what you need to know to start and complete your analysis. 

How to get started with Registered Tier Data (v7)

This Featured Workspace will give you an overview of what data is available in the current Registered Tier Curated Data Repository (CDR). It will also teach you how to retrieve information about Electronic Health Records (EHRs), Physical Measurements (PM), and Survey data. Depending on your preferred programming language, there is both a notebook in Python and in R. The notebooks will walk through the supplied code to explain what is being performed. The code blocks can be copied over into a new notebook, set in either edit or playground mode, to be run. Within this Featured Workspace there is also a tutorial on how to us RStudio in the Researcher Workbench: Data_101_Fundamentals_Rstudio.Rmd

How to get started with Controlled Tier Data (v7)

This Featured Workspace gives an overview of the data types available in the current Controlled Tier Curated Data Repository (CDR) that are not available in the Registered Tier. The Featured Workspace teaches the users how to set up associated notebook, install and import software packages, and select the correct version of the CDR. Depending on your preferred programming language, there is both a notebook in Python and in R. The notebooks will walk through the supplied code to explain what is being performed. The code blocks can be copied over into a new notebook, set in either edit or playground mode, to be run.

Data Wrangling in the All of Us Program (v7)

This Featured Workspace is targeted to new users that covers basic data wrangling in the Workbench, related to the popular Office Hours session about this topic. It expands significantly on the Office Hours demo, and provides a step-by-step walkthrough about building cohorts, pulling specific types of data associated with the cohort, merging/combining this data into one final data frame, and then visualizing it as well as analyzing it through some common statistical tests.

How to work with All of Us Survey Data (v7)

This Featured Workspace should help you become familiar with how to query PPI (Participant Provided Information) questions/surveys, and what the frequencies of answers for each question in the PPI module are. The tutorial notebooks will walk you through examples of various ways in which to extract and visualize Participant Provided Information (PPI) survey data from the All of Us CDR. Ultimately, you should be able to use example code from these notebooks and (with a few minor changes) see similar results in your own research. 

How to work with All of Us COPE Survey Data (v7) (CT)

This Featured Workspace will give an overview of COPE survey data available in the current Controlled Tier Curated Data Repository (CDR) and how to retrieve them. 

How to work with All of Us Physical Measurement Data (v7)

This Featured Workspace helps you get familiar with how to navigate around physical measurements data. The tutorial notebooks in this Featured Workspace demonstrate how to extract Physical Measurement (PM) data on All of Us participants, get an idea of what is in the PM data, or determine whether PM data is derived from Participant Provided Information (PPI) or Electronic Health Records (EHR). Measurements collected include: height, weight, BMI, waist circumference, hip circumference, pregnancy-status, blood pressure, heart rate, and wheelchair use. 

How to work with Wearable Device Data (v7)

This Featured Workspace will give an overview characterization of the Fitbit data elements currently available in the current Curated Data Repository (CDR) and provide best practices and tips for how to retrieve them. 

How to Backup Notebooks and Intermediate Results

The notebooks in this Featured Workspace will give an overview of how to create snapshots of notebooks and backups of intermediate results stored in other files such as plot images and derived data. This includes how to save snapshots of a notebook for later review, allowing users to track changes to results in notebooks over time, and how to create files such as image files of plots or CSVs of intermediate results that you would like to retain. 

How to Run Notebooks in the Background

Some analyses take some time to run. To avoid interruption for long running jobs, this notebook will run codes in the background. If you wish to capture all notebook cell outputs, use the notebook in this Featured Workspace to run your long-running notebook (or any other long-running notebook). Please note, for your analysis the cluster will auto-pause after 24 hours. To prevent your cluster from shutting down if your background job takes longer than 24 hours, be sure to log in and start a notebook, any notebook, to reset the auto-pause timer.

How to work with Genomic Data (Hail - Plink) (v7)

This Featured Workspace has a series of notebooks for you to get started with All of Us genomic data and tools. The All of Us Research Program provide whole genome sequencing (WGS) and microarray genotyping (arrays) data in different formats, such as variant call format (VCF), Hail MatrixTable, Hail VariantDataset, CRAM files, IDAT files, PLINK BED files and other auxiliary files. The notebooks in this Featured Workspace demonstrate example analysis how to use Hail and PLINK to perform genome-wide association studies using the All of Us genomic data and phenotypic data.

How to work with Genomics Data (CRAM_processing and IGV) _v7HC

The notebooks of this Featured Workspace demonstrate how to copy or localize All of Us CRAM files to your workspace bucket and active cloud environment in order to look at their contents with the Integrated Genome Viewer (IGV). 

How to Run WDLs using Cromwell in the Researcher Workbench (v7)

This notebook in this Featured Workspace shows how to set up Cromwell, how to use the automatically created Cromwell configuration file and how to write a WDL script to use All of Us genomic data as an input. In this tutorial workspace, you will learn how to set up your Cloud Environment and use Cromwell to execute an example script, validate_vcf.wdl. This workflow uses the GATK ValidateVariants tool to validate VCF files. VCF files, corresponding index files, and the human reference genome assembly are provided to the workflow as inputs. The duration of this tutorial should be around 20 minutes.

How to use Nextflow in the Researcher Workbench (v7)

This Featured Workspace shows how to set up Nextflow, how to use the automatically created Nextflow configuration file and how to write a Nextflow script to use All of Us genomic data as an input. This notebook is designed to be an example of how to use a Nextflow script within the Researcher Workbench with the array single sample variant call format (VCF) files as inputs. 

How to use dsub in the Researcher Workbench (v7)

This Featured Workspace has a series of notebooks to demonstrate how to set up dsub in the Researcher Workbench. This includes how to run a single job, how to check job status, how to debug a failed job, and demonstrates how to run the wc command in a parallel manner.

Genomics Undergrad Lesson Plan Exemplar (v6)

This Featured Workspace is user support content to help other users mentor students and trainees on projects in the Restricted and Controlled Tiers. Our plan is to produce documents outlining a recommended path to onboarding with resources for mentors as well as sample workspaces for each type of Tier project using the R coding language for All of Us to use in the workbench as they see fit.

How to Reproduce All of Us SARS-CoV-2 Antibody Study - Original Study (v4)

This notebooks in this Featured Workspace will give an overview of the Antibodies to SARS-CoV-2 in All of Us Research Program Participants, January 2-March 18, 2020 study in the current Curated Data Repository (CDR) and how to reproduce it.

How to Reproduce All of Us SARS-CoV-2 Antibody Study (v5)

This Featured Workspace provides instructions for the replication of the "Antibodies to SARS-CoV-2 in All of Us Research Program participants, January 2 - March 18, 2020" study published in Clinical Infectious Diseases. Full citation: Althoff, K., Schlueter, D.J., Anton-Culver, H., Cherry, J., Denny, J., Thomsen, I., ... Schully, S. (2021). Antibodies to SARS-CoV-2 in All of Us Research Program participants, January 2 - March 18, 2020. Clinical Infectious Diseases, ciab519, https://doi.org/10.1093/cid/ciab519

How to work with Long Read Data (v7)

The purpose of this Featured Workspace is to serve as a tutorial which shows how to localize the All of Us (AoU) long read BAM files individually in addition to showing how to render the Integrated Genome Viewer (IGV) on the All of Us Researcher Workbench to explore the BAM files. This workspace contains two functional notebooks: A notebook (1) dedicated to various ways of analyzing BAM localization written in Python, and another notebook (2) for looking at the BAM files in your current environment with IGV written in Python.

Workshop: Intro to the All of Us Genomics Data 
This workspace is meant for use with the Introduction to Analyzing All of Us Genomic Data workshop. In this workshop, users will get hands-on experience using the genomics data running a genome-wide association study (GWAS) using Hail.

Best Practices for AoU Data Science

This tutorial workspace provides Python tutorial notebooks demonstrating best practices to query All of Us data and work with environment variables based on frequently asked questions by Researcher Workbench users during Office Hours or support tickets.

SAS 101 Data Fundamentals

This tutorial workspace provides multiple SAS files demonstrating best practices to work with the data using SAS Studio. This workspace includes best practices for performing common SAS procedures and how to explore the All of Us dataset using SAS. This tutorial is meant to educate users on the general data structure and data model used by the All of Us program.

How to query All by All results and analysis details

This Featured Workspace is focused on demonstrating methods to effectively filter and export summary statistics of interest from the All by All tables

All by All - Drug Phenotypes Curation

This Featured Workspace provides more information about the drug exposure phenotype for the All by All data

All by All - Lab Measurements Phenotypes Curation

This Featured Workspace provides more information about the lab measurements phenotypes for the All by All data.  

All by All - PFHH Survey Phenotypes Curation

This Featured Workspace provides more information about the PFHH phenotypes for the All by All data

All by All - Phecode Phenotypes Curation

This Featured Workspace provides more information about the phecode phenotypes for the All by All data

All by All - PhecodeX Phenotypes Curation

This Featured Workspace provides more information about the phecode and phecodeX phenotypes for the All by All data

All by All - Physical Measurements Phenotypes Curation

This Featured Workspace provides more information about the physical measurement phenotypes for the All by All data

Introduction to Phenotypic and Survey Data

This Featured Workspace is intended to familiarize researchers with the survey, electronic health record (EHR) and Fitbit data on the Researcher Workbench. By running the exercises in this workspace, researchers will have a better understanding of how to build a cohort of participants with these data types. 

Demonstration Projects

Demonstration projects showcase the quality, utility, and diversity of the All of Us data by replicating end-to-end analyses in previously published studies in the All of Us Researcher Workbench.

Data Quality Reports - 2022Q2R2 v6 CDR

This workspace provides detailed demographic information about the participants in the 2022Q2R2 version 6 curated data repository (CDR), as well as a summary of participants by data type. Tables and useful graphs are included. Notebooks include: summary of participants by data type, demographic characteristics of participants by data types, and UBR (underrepresented in biomedical research) breakdown.

Data Quality Reports - C2022Q4R9 v7 CDR

This workspace provides an overview on the data availability and demographic characterization. It also provides an overview on the number of participants who meet the UBR (underrepresented in biomedical research) definition. Notebooks include: summary of participants by data type, demographic characteristics of participants by data types, UBR (underrepresented in biomedical research) breakdown, UBR by data types, and genomics by race and datatypes. 

Demo -N3C Machine Learning PASC/Long COVID Phenotype Algorithm in the All of Us dataset

This demonstration workspace is a collaboration between the All of Us Research Program, National COVID Cohort Collaborative (N3C), PCORnet, and NIH/RECOVER to examine and identify participant risk of Long COVID utilizing the N3C's machine learning (ML) PASC/Long COVID Phenotype algorithm within the All of Us Researcher Workbench. The XGBoost machine learning model was used to identify potential patients with PASC/Long COVID, which was initially published by Emily et al. These models were subsequently implemented within the All of Us Controlled Tier dataset (C2022Q2R2; v6).

Wearables and the Human Phenome

This demonstration workspace and corresponding notebook examines the associations between physical activity over time (measured using participant wearables) and incident chronic diseases as determined by EHR data. It's an example of how the Workbench can be leveraged to work with FitBit data using R. 

Demo - PheWAS Smoking 

In this demonstration project, this study will present the results of Phenome-Wide Association Studies (PheWAS) to show how the various sources of data contained within All of Us research dataset can be used to inform scientific discovery. Specific goals of this workspace are to demonstrate how to implement a Phenome-wide Association Study within the All of Us Researcher Workbench, demonstrate use of heterogeneous data sources within the All of Us dataset, and develop plots that compare the results of EHR smoking with PPI ever smoking and PPI smoking every day PheWAS routines.

Demo - Cardiovascular Risk Scoring

In this demonstration project,  the AHA algorithm/equation is used to calculate the cardiovascular risk scores. Further, it demonstrates the usage of smoking and race data collected by the program, which are data that usually researchers use natural language processing to extract, to facilitate the calculation of cardiovascular risk score.

Demo - Medication Sequencing 

This demonstration project used the medication sequencing developed at Columbia University and the OHDSI network as a means to characterize treatment pathways at scale. Further, the project wants to demonstrate implementation of these medication sequencing algorithms in the All of Us research dataset to show how the various sources of data contained within the program can be used to characterize treatment pathways at scale.

Demo - All of Us Descriptive Statistics 

In this demonstration project, we will apply data visualization libraries to aggregate information about the Cohort. We will measure age by using the age reflected when the CDR was generated. Specific goals include: describe an overview of data types included in beta release Curated Data Repository (CDR), describe participants by age, race, and ethnicity using all the available data types, and describe the underrepresented biomedical research (UBR) population within the All of Us Research Program participants.

Demo - Family History in EHR and PPI

This demonstration project will summarize structured data elements available in the All of Us registered tier and compare to published survey results to describe data for reuse in disease specific outcomes. Specific questions include: 1. Could harnessing informatics tools like predictive modeling and clinical decision support to detect and alert health care providers to these preventative measures significantly improve the precise care we deliver to patients? 2. How can one evaluate the availability of family medical history information within the All of Us registered tier data and characterize the structured data elements from both data sources?

Demo - Hypertensive Prevalence 

This demonstration project aim was to use published methods to replicate known differences in hypertension prevalence in UBR groups and illustrate variation in hypertension prevalence in geographic regions of the U.S. We compared our results to the 2015–2016 National Health and Nutrition Examination Survey (NHANES) hypertension prevalence results. https://www.cdc.gov/nchs/products/databriefs/db289.htm

Demo - Systemic Disease and Glaucoma 

The aims of this demonstration project were to: (1) externally validate our single-center model’s performance with All of Us data, (2) develop models trained by the All of Us data and compare their performance to our single-center model, and (3) share insights from our experience using All of Us data and the Researcher Workbench with other ophthalmology researchers who may be interested in using this novel data source. 

Demo - Siloed Analysis of All of Us and UK Biobank Genomic Data

The primary goal of this demonstration project is to demonstrate the potential of the All of Us Researcher Workbench for pooled analyses of All of Us and UK Biobank data. Specifically, we aim to: 1. Develop and describe an approved, secure path for connecting UK Biobank data to the All of Us Researcher Workbench. 2. Conduct a genome-wide association study of blood lipids on the pooled dataset aimed at demonstrating that biomedical researchers can be more productive when permitted to analyze the union of the cohorts, as opposed to computing aggregate results in separate data silos for each cohort and then combining those aggregates. In this workspace are all the notebooks needed to perform a regenie genome-wide association study of lipids over the exonic variants within the All of Us (AoU) alpha3 release of genomic data in a siloed fashion.

Replication of Dissecting Racial Bias Paper 

This demonstration project is a replication study for the paper: Dissecting racial bias in an algorithm used to manage the health of populations. The associated notebook will be used in the AIM-AHEAD workshop to answer the questions:  1- Can we predict the health status of participants in the next year using their health status and demographics? 2- does the prediction for the health status differ by participants' race and education? 3- What are the most important features that are associated with the health status of participants? Notebooks in this workspace demonstrate how to train a machine learning model that will predict the health status of participant the year following the participants enrollment.

Demo - Geographic Variation in Obesity (v4) 

This demonstration project examined the quality and utility of the All of Us Research Hub Workbench for accelerating precision medicine by replicating methods from existing studies that examine the prevalence of obesity at the population level. We evaluated the measurements of obesity in the participant measurement (PM) data set and the electronic health record (EHR) data set using methods similar to the Ward et al. NEJM December 2019 publication that assessed prevalence of obesity in the US by state using BRFSS data.

All of Us v6 GWAS on LDL Cholesterol with Regenie and dsub

The purpose of this demonstration project is to recreate an efficient and scalable Genome Wide Association Study (GWAS) across whole genome sequenced data on an LDL Cholesterol phenotype.

All of Us v7 GWAS on LDL Cholesterol with Regenie

The purpose of this demonstration project is to recreate an efficient and scalable Genome Wide Association Study (GWAS) across whole genome sequenced data on an LDL Cholesterol phenotype with Regenie and dsub with the v7 CDR.

Demo - Genetic Ancestry

This demonstration project explores how genetic ancestry can affect health outcomes via differences in the frequencies of variants associated with disease and drug response.

Regenie LDL GWAS using Cromwell (v7)

This demonstration project features notebooks using Cromwell to run regenie via WDL and a notebook to do the analysis of the regenie GWAS results. The phenotype of interest is LDL cholesterol and we'll be using participant age and sex assigned at birth as covariates along with the top 15 ancestry PCs.

Demo - Social Determinants of Health

This demonstration project reviews descriptive data analysis of the survey responses, psychometric analysis of SDOH scales, and item non-response rates across demographic variables.

Demo-Polygenic Risk Score Genetic Ancestry Calibration 

This demonstration workspace aims to improve the ability to correct the genetic ancestry-dependent bias in polygenic risk score (PRS) for 10 conditions (Asthma, Atrial fibrillation, Breast Cancer, Chronic Kidney Disease, Coronary heart disease, Hypercholesterolemia, Obesity/BMI, Prostate cancer, Type 1 Diabetes, Type 2 Diabetes) using the All of Us dataset.

Demo - General Health and Well-Being of SGM Participants (v6)

This demonstration project demonstrates the diversity and utility of All of Us Research Program by characterizing the demographics and health conditions/behaviors of sexual and gender minority (SGM) participants.

SAS Demonstration: Diabetes mellitus medication prescription patterns (v7)

This demonstration project is related to this SAS analytics guide: SAS Analytics Guide - How to perform logistic regression. This project is an exploratory study designed to understand differences in prescriptions of newer generation diabetes medications such as GLP-1 agonists and SGLT-2 inhibitors by looking at patterns within the All of Us research dataset.

SAS Demonstration: Sociodemographic differences in treatment of mental disorders

This demonstration project is related to this SAS analytics guide: SAS Analytics Guide - How to perform binary logistic regression. This project provides an example of statistical analysis processes which were used to assess concordance between self-reported lifetime depression diagnosis and depressive disorder diagnoses documented in available electronic health records (EHR) using survey and EHR data from the All of Us dataset.

Demo - Pharmacogenomics (PGx) variant frequency and medication exposures

This demonstration project assesses the All of Us srWGS variant data for the presence of specific alleles and predicted phenotypes known to be associated with adverse drug reactions or altered dosage recommendations. Specifically, this project reviewed frequencies of pharmacogenomics variants/haplotypes in All of Us Research Program participants. Specific Controlled Tier CDR paths related to this project can be found here under srWGS PGx Haplotype Calls.

 

Phenotype Library

Phenotype library workspaces demonstrate how computable electronic phenotypes can be implemented within the All of Us dataset using examples of previously published phenotype algorithms.

Phenotype - Breast Cancer

By reading and running the notebooks in this Phenotype Library workspace, researchers can implement the following published phenotype algorithms: Ning Shang, George Hripcsak, Chunhua Weng, Wendy K. Chung, & Katherine Crew. Breast Cancer. Retrieved from https://phekb.org/phenotype/breast-cancer. Notebooks provide a supplement to the validated Breast Cancer Phenotype by demonstrating a method to identify additional participants according to their answers to breast cancer-related questions on participant provided information (PPI) surveys, and describes the query performed to capture a cohort according to a predefined algorithm for breast cancer.

Phenotype - Breast Cancer (Controlled Tier)

By reading and running the notebooks in this Phenotype Library workspace, researchers can implement the following published phenotype algorithms: Ning Shang, George Hripcsak, Chunhua Weng, Wendy K. Chung, & Katherine Crew. Breast Cancer. Retrieved from https://phekb.org/phenotype/breast-cancer.

Phenotype - Dementia 

By reading and running the notebooks in this Phenotype Library workspace, researchers can implement the following published phenotype algorithms: Ritchie, M., Denny, J., Crawford, D., Ramirez, A., Weiner, J., … Roden, D. (2010). Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. American Journal of Human Genetics. 87(2):310 doi: 10.1016/j.ajhg.2010.03.003. Notebooks describes the query performed to capture a cohort according to a predefined phenotype algorithm for dementia.

Phenotype - Depression

By reading and running the notebooks in this Phenotype Library workspace, researchers can implement the following published phenotype algorithm from the eMERGE network: TBA. KPWA/UW. Depression. PheKB; 2018 Available from: https://phekb.org/phenotype/1095 dolor. The associated notebook describes the query performed to capture 3 cohorts according to a predefined phenotype algorithm for depression.

Phenotype - Ischemic Heart Disease 

By reading and running the notebooks in this Phenotype Library workspace, researchers can implement the following published phenotype algorithm: Christianne L. Roumie; Jana Shirey-Rice, Sunil Kripalani. Vanderbilt University. MidSouth CDRN - Coronary Heart Disease Algorithm. PheKB; 2014. Available from https://phekb.org/phenotype/234. The associated notebook describes the query performed to capture a cohort according to a predefined phenotype algorithm for ischemic heart disease, also known as coronary artery disease.

Phenotype - Ischemic Heart Disease (Controlled Tier)

By reading and running the notebooks in this Phenotype Library workspace, researchers can implement the following published phenotype algorithm: Christianne L. Roumie; Jana Shirey-Rice, Sunil Kripalani. Vanderbilt University. MidSouth CDRN - Coronary Heart Disease Algorithm. PheKB; 2014. Available from https://phekb.org/phenotype/234. The associated notebook describes the query performed to capture a cohort according to a predefined phenotype algorithm for ischemic heart disease, also known as coronary artery disease.

Phenotype - Diabetes

By reading and running the notebooks in this Phenotype Library workspace, researchers can implement the following published phenotype algorithm: Jennifer Pacheco and Will Thompson. Northwestern University. Type 2 Diabetes Mellitus. PheKB; 2012. Available from: https://phekb.org/phenotype/18. The associated notebook describes the query performed to capture a cohort according to a predefined phenotype algorithm for Type 2 Diabetes according to four different cases.

Phenotype - Diabetes (Controlled Tier)

By reading and running the notebooks in this Phenotype Library workspace, researchers can implement the following published phenotype algorithm: Jennifer Pacheco and Will Thompson. Northwestern University. Type 2 Diabetes Mellitus. PheKB; 2012. Available from: https://phekb.org/phenotype/18. The associated notebook describes the query performed to capture a cohort according to a predefined phenotype algorithm for Type 2 Diabetes according to four different cases.

Community Workspaces

Community workspaces foster knowledge-sharing, collaboration, and learning by allowing registered All of Us Researcher Workbench users to share their workspace with all other registered Researcher Workbench users.

For a full list of community workspaces, log in to the Researcher Workbench.

Was this article helpful?

4 out of 9 found this helpful

Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.