Data curation process for the All of Us data

  • Updated

Data Curation

Data_Curation_Map.png

Figure 1 -- Overview of the curation process for survey, physical measurement, electronic health record (EHR) and wearables (Fitbit) data.

 

As seen in Figure 1, the combined dataset goes directly to the Public Tier where it is accessible through the Data Browser. This data is aggregated and rounded to the nearest 20th. For the Registered Tier and Controlled Tier, we transform, harmonize, and anonymize all of the survey (“Participant Provided Information” or “PPI”), physical measurement (PM) and electronic health record (EHR) data before storing in the Curated Data Repository (CDR).  PPI is obtained from surveys the participants complete when they enroll in the program, in the Participant Portal, and periodically throughout the duration of the program. Physical measurements (PM) are taken at the time of enrollment and can also be provided through EHRs. Both data types are transferred to the Data and Research Center (DRC) in FHIR format. 

Electronic health record data are transformed to the OMOP CDM by designated data stewards at participant enrollment centers and sent to the DRC. These data are combined with the PPI and PM data via participant identity linkage. Data then undergo conformance checks, the removal of participant identifiers, and the application of a privacy methodology. The resulting data is called the Curated Data Repository (CDR) base. Researchers may access both the CDR_base and CDR via the Researcher Workbench.

See the All of Us data dictionary for a detailed description of all metadata for the data tables used to populate the Registered Tier dataset. 

 

Genomic Data Curation 

As seen in Figure 2, sample aliquots, from participants in the CDR, are sent to Genome Centers (GCs) for array genotyping and short-read whole genome sequencing (WGS).  Quality Control (QC) is performed at both the GCs and the Data and Research Center (DRC).  This includes quality based on processing metrics (eg, coverage in WGS), quality based on variants (eg, callrate for arrays), and detecting sample swaps.  For WGS variants, the DRC creates a joint callset, with additional QC, to improve accuracy based on information across samples.  The final callsets are published in the Researcher Workbench.  Researchers can access the variants in multiple formats (eg, VDS, VCF, Hail MT, etc) or subsets of the WGS variants through the Cohort Builder.

Genomic_Data_Curation_Map.png

Figure 2 -- Overview of the SNP/Indel variant processing in AoU.   Final versions of the variants (light blue) are published in the Researcher Workbench.

Was this article helpful?

5 out of 5 found this helpful

Have more questions? Submit a request