Data Curation Process

  • Updated

Understanding the data curation process for both the phenotypic and genomic data available in the All of Us Researcher Workbench is essential for researchers utilizing the dataset.

Phenotypic data

Phenotypic data are collected from participants through surveys (Participant Provided Information [PPI]), physical measurements, electronic health records (EHRs), and wearables upon enrolling in the program and periodically throughout the duration of the program.

  • PPI data are collected in the Participant Portal via surveys when participants enroll in the program and periodically throughout the duration of the program. PPI data are transferred to the Data and Research Center (DRC) in Fast Health Interoperability Resources (FHIR) format and transformed to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) by designated data stewards.
  • Physical measurements data are collected at health partnering organizations (HPOs) when participants enroll in the program and/or via EHRs. Physical measurements data are transferred to the DRC in FHIR format and transformed to the OMOP CDM.
  • Electronic Health Record (EHR) data are collected when participants enroll and consent to share their EHRs with the program. EHR data are transformed to the OMOP CDM by designated data stewards at participant enrollment centers and sent to the DRC. These data are combined with the PPI and physical measurements data via participant identity linkage.
  • Wearables data are collected when participants enroll and consent to share their wearables data with the program. Wearables data are transferred to the DRC and transformed to BigQuery tables.

Once collected, the data undergo transformation, harmonization, and anonymization before being stored in the Curated Data Repository (CDR). Researchers may access both the CDR_base and CDR via the Researcher Workbench.

  • CDR_base directly reflects the source data with changes to obfuscate participant identity, to correct errors from EHR sites or PPI data collection, and to better conform to the All of Us Research Program data model.
    Note: CDR_base is only intended to be used by users who would like to make their own decisions on how to further clean the data. It is only available by directly querying in a Jupyter Notebook.
  • CDR is the “clean CDR,” which is more user-friendly and is the default repository available in all Researcher Workbench tools, such as the Cohort Builder and Dataset Builder. The CDR contains the same data as the CDR_base with additional cleaning applied to help harmonize and standardize the data.

Using the All of Us Researcher Workbench, researchers may access the CDR for research analysis. Read the All of Us Data Dictionaries for detailed descriptions of all the metadata for the data tables in the Controlled Tier and Registered Tier.

Flowchart of the curation process for the phenotypic data including surveys (Participant Provided Information), physical measurements, electronic health records (EHRs), and wearables.

Genomic data

The genomic data are collected from participants through biospecimens (e.g., blood, saliva, and urine samples) upon enrolling in the program.

  • Biospecimens are collected at health partnering organizations (HPOs) when participants enroll in the program. Biospecimens are transferred to the Genome Center (GC) for array genotyping and short-read whole genome sequencing (srWGS).

Quality control is performed by the GC and the DRC and includes processing based on metrics (e.g., coverage in whole genome sequencing [WGS]) and variants (e.g., call rate for arrays) and detecting sample swaps.

For WGS variants, the DRC creates a joint call set with additional quality control to improve accuracy based on information across samples. The final call sets are stored in the CDR and are accessible in multiple formats (e.g., VariantDataset [VDS], Variant Call Format [VCF], Hail MatrixTable, etc.) or subsets of the WGS variants through the Researcher Workbench.

Flowchart of the curation process for the genomic data including the SNP/Indel variant processing.

Next articles

Participant Privacy Protections

Explore how the All of Us Research Program protects participant privacy within the All of Us Researcher Workbench

Was this article helpful?

12 out of 12 found this helpful

Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.