Data curation process for the All of Us data

  • Updated

Data Curation


Figure 1 -- Overview of the curation process for survey, physical measurement, electronic health record (EHR) and wearables (Fitbit) data.


As seen in Figure 1, the combined dataset goes directly to the Public Tier where it is accessible through the Data Browser. This data is aggregated and rounded to the nearest 20th. For the Registered Tier and Controlled Tier, we transform, harmonize, and anonymize all of the survey (“Participant Provided Information” or “PPI”), physical measurement (PM) and electronic health record (EHR) data before storing in the Curated Data Repository (CDR).  PPI is obtained from surveys the participants complete when they enroll in the program, in the Participant Portal, and periodically throughout the duration of the program. Physical measurements (PM) are taken at the time of enrollment and can also be provided through EHRs. Both data types are transferred to the Data and Research Center (DRC) in FHIR format. 

Electronic health record data are transformed to the OMOP CDM by designated data stewards at participant enrollment centers and sent to the DRC. These data are combined with the PPI and PM data via participant identity linkage. Data then undergo conformance checks, the removal of participant identifiers, and the application of a privacy methodology. The resulting data is called the Curated Data Repository (CDR) base.

Researchers may access both the CDR_base and CDR via the Researcher Workbench. 
The CDR_base directly reflects the source data, with changes to obfuscate participant identity, to correct errors from EHR sites or PPI data collection, and to better conform to the the All of Us Research Program data model. CDR_base is only intended to be used by users who would like to make their own decisions on how to further clean the data. It is only available by directly querying in a Jupyter Notebook. The "CDR" is the clean CDR which is more user-friendly and is the default repository available in all Researcher Workbench tools, such as the Cohort Builder and Dataset Builder. The CDR contains the same data as the CDR_Base with additional cleaning applied to help harmonize and standardize the data. 

See the All of Us data dictionary for a detailed description of all metadata for the data tables used to populate the Registered Tier dataset. 


Genomic Data Curation 

As seen in Figure 2, sample aliquots, from participants in the CDR, are sent to Genome Centers (GCs) for array genotyping and short-read whole genome sequencing (WGS).  Quality Control (QC) is performed at both the GCs and the Data and Research Center (DRC).  This includes quality based on processing metrics (eg, coverage in WGS), quality based on variants (eg, callrate for arrays), and detecting sample swaps.  For WGS variants, the DRC creates a joint callset, with additional QC, to improve accuracy based on information across samples.  The final callsets are published in the Researcher Workbench.  Researchers can access the variants in multiple formats (eg, VDS, VCF, Hail MT, etc) or subsets of the WGS variants through the Cohort Builder.


Figure 2 -- Overview of the SNP/Indel variant processing in AoU.   Final versions of the variants (light blue) are published in the Researcher Workbench.

Was this article helpful?

8 out of 8 found this helpful

Have more questions? Submit a request



Article is closed for comments.