Introduction to All of Us Electronic Health Record (EHR) Collection and Data Transformation Methods

  • Updated

Within the All of Us Research Program, participants have an opportunity to share their electronic health records (EHR) for research. The participant’s affiliated health care organization may send the data, or participants may have the program to collect it directly through the All of Us Participant Portal using API applications following their agreement to share this information.

Inclusion Criteria

Participants are eligible to share their EHRs following consent to join All of Us. Participants who agree to share sign a separate EHR consent form. See information about the All of Us consent process for more background. EHR is collected only for participants who sign the EHR consent. 

Data Collection and Transformation 

EHR data types available in the curated data repository (CDR) include demographics, visits, diagnoses, procedures, medications, laboratory tests, and vital signs. The EHR data collected by the program is submitted by healthcare provider organizations or directly shared by participants through their participant portals. The data are transformed into the OMOP common data model structure. Recorded data elements are coded as source concepts (ex. ICD 9, ICD 10) and mapped to standard concepts (ex. SNOMED) based on international health standards. OMOP categorizes clinical data concepts according to several major health domains.

During the data transformation process, a series of steps are completed to ensure data conforms to both OMOP and All of Us standards. Details can be referenced in the CDR data dictionary release notes. Steps include:

  • Validation checks that occur before and after data are transferred to the All of Us Data and Research Center (DRC) for CDR inclusion. The goal is to ensure that data conforms to OMOP naming conventions as well as data type and column structures.
  • Characterization checks that occur after data are transferred to the DRC and are completed before and after CDR inclusion. The goal is to use OMOP-supported tools like the Automated Characterization of Health Information at Large-scale Longitudinal Evidence Systems (Achilles) to assess person-level data characterization and density.
  • Select tables, fields, and concepts are suppressed (e.g., removed). The goal is to ensure that data conforms to the All of Us privacy protection principles and rules.
  • Select fields within the Measurement table undergo a small amount of cleaning. The goal is to help support data plausibility and utility for select data of high priority and research interest. 

Below is summary information about EHR data storage and structure to help with getting started. Researchers who haven’t before worked with EHR data structured according to OMOP should learn about the model by browsing the OMOP website. The CDR data dictionaries should also be referenced for more details about CDR tables, fields, and privacy applications. Table and field availability as well as privacy protections are subject to change between CDR releases. Researchers should refer to the data dictionary materials relevant to the CDR version associated with their workspace project.

OMOP Domain Tables for Categorizing EHR Data
Domain Type of Information Integration with other EHR Tables 
Condition Occurrence Acute or chronic conditions that were diagnosed during interactions with healthcare providers. Allows temporal representation of conditions (e.g., start and end dates). Data is often mapped to source vocabularies such as ICD-9 and ICD10 and standard vocabularies such as SNOMED. Project design may require linking data from this table to other tables such as the Visit Occurrence table (to associate conditions with specific healthcare visits).
Observation Range of observed findings related to health care examinations, assessments, or procedures. Includes any clinical observation or finding that can not be represented in other domain tables (e.g., symptoms, social, lifestyle, medical history observations). Data are often mapped to source vocabularies such as ICD-9 and ICD10 and standard vocabularies such as SNOMED and LOINC. Project design may require linking data from this table to other tables such as the Visit Occurrence table (to associate observations with specific care visits), and Measurement table (for measurement values.
Drug Exposure Medications and other drug exposures received during health care encounters. Prescription medications, over-the-counter drugs, vaccines, and other therapeutic substances are included. Data are often mapped to source vocabularies such as the FDA’s National Drug Code (NDC) and mapped to standard vocabularies such as RxNorm.

Project design may require linking data from this table to other tables such as the Visit Occurrence table (to associate drug exposures with specific healthcare visits), and Condition table (to link drug exposures with specific medical conditions).

 

Procedure Medical procedures and interventions including surgical procedures, diagnostic tests, therapeutic interventions, etc. Data are often mapped to source vocabularies such as ICD-9 and standard vocabularies such as CPT.

Project design may require linking data from this table to other tables such as the Visit Occurrence table (to associate procedures with specific healthcare visits), and the Condition table (to link procedures with specific medical conditions).

 

Measurement Structured measure or value of a lab test or assessment. Includes physical measurements, blood assay measurements, vital signs, etc. and values are structured either numerically or categorically depending on the type. The Measurement and Observation tables sometimes contain categorical data that overlap. In those cases, the data differ in that the Measurement table stores quantitative or qualitative results from standardized tests and the Observation table stores an observed factual determination rather than a result. Project design may require linking data from this table to other tables such as the Observation table to associate lab values with other observed findings.

 

In addition to the above domain tables, OMOP stores clinical data and accompanying metadata throughout other tables. Some tables are suppressed in the Registered and/or Controlled Tier CDRs to protect privacy. Similarly, some concepts within the available tables are suppressed. Other available tables include the Person, Condition Occurrence, Visit Occurrence, Visit Detail, Procedure Occurrence, Observation Period, and Death tables.

The condition, observation, procedure, drug exposure, measurement, and visit occurrence tables are commonly used. When creating projects in the Researcher Workbench, researchers can use the Cohort and Dataset Builder tools to quickly select project inclusion and exclusion criteria and data concepts of interest from the Condition, Observation, Procedure, Drug Exposure, and Measurement table. Researchers interested in accompanying those clinical data with additional visit details from the Visit Occurrence table will need to extract those from the CDR directly once the dataset has been imported into an analysis application given the expansive size and collection of information stored there.

The table below includes a subset of commonly used and asked about fields that are indexed within each table. Again, for more extensive details about OMOP tables and extensions, fields, and privacy application, refer to the CDR data dictionaries. Note: Selection of extension tables have been introduced at different CDR version releases. For example, the OMOP survey conduct table was first populated and introduced with CDR version 7 and the Observation Period, Drug Era, and Condition Era tables were first populated and introduced with CDR version 8.

Examples of OMOP Table Fields for Categorizing EHR Data
Condition Occurrence Table
Field Name Description
condition_occurence_id A unique identifier for each condition occurrence event.
person_id An identifier linking the condition occurrence event to the person.
condition_concept_id A standardized concept identifier representing the clinical condition in a standardized vocabulary (e.g., SNOMED-CT, ICD-10).
condition_start_date The date when the instance of the condition was first recorded.
condition_end_date The date when the condition is considered to have resolved.
condition_type_concept_id Indicates the type of occurrence (e.g., primary diagnosis, secondary diagnosis, etc) and reflects the source data from which the occurrence was recorded.
Observation Table
Field Name Description
observation_id A unique identifier for each observation.
person_id An identifier linking the observation to the person.
observation_concept_id A standardized concept identifier representing the clinical observation in a standardized vocabulary (e.g., LOINC for laboratory results, SNOMED-CT for clinical findings).
observation_date The date when the observation was recorded.
observation_type_concept_id Indicates the type of observation (e.g., laboratory test result, clinical finding, vital sign).
value_as_number, value_as_string, value_as_concept_id Depending on the nature of the observation, either of these fields will be populated. Data will represent a numeric result (value_as_number), text (value_as_string), or coded value (value_as_concept_id).
observation_source_value The original source value as recorded in the source system (e.g., specific laboratory test code)
Drug Exposure
Field Name Description
drug_exposure_id A unique identifier for each drug exposure event.
person_id An identifier linking the drug exposure to the person.
drug_concept_id A standardized concept identifier representing the specific drug or substance in a standardized vocabulary (e.g., RxNorm, SNOMED-CT).
drug_exposure_start_date The start date of the drug exposure period such as the start date of a prescription, the date a prescription was filled, or the date on which a Drug administration procedure was recorded.
drug_exposure_end_date The date of the drug exposure period. This field may not be populated as data is not available from all sources.
drug_type_concept_id Indicates the type of drug exposure (e.g., prescription dispensed, prescription written, over-the-counter purchase, vaccine administered).
quantity The quantity of the prescribed drug administered or dispensed (e.g., number of pills, dosage strength) according to the original prescription or dispensing record.
days_supply The number of days the prescribed drug supply is intended (applicable for prescriptions) according to the original prescription or dispensing record.
Procedure Occurrence Table
Field Name Description
procedure_occurrence_id A unique identifier for each procedure occurrence record.
person_id An identifier linking the procedure occurrence to the person.
procedure_concept_id A standardized concept identifier representing the specific procedure or intervention in a standardized vocabulary (e.g., SNOMED-CT, CPT-4).
procedure_date The date when the procedure was performed.
procedure_type_concept_id Indicates the type of procedure (e.g., surgical procedure, diagnostic procedure, therapeutic procedure).
modifier_concept_id Additional information about the procedure, such as technique (ex. bilateral).
Visit Occurrence Table
Field Name Description
visit_occurrence_id A unique identifier for each visit occurrence encounter.
person_id An identifier linking the visit occurrence to the person.
visit_concept_id A standardized concept identifier representing the type of visit (e.g., outpatient visit, inpatient admission, emergency department visit) in a standardized vocabulary (e.g., SNOMED-CT, CPT-4).
visit_start_date The start date for the visit.
visit_end_date The end date for the visit. The end date will match the start date if it was a one day visit.
visit_type_concept_id Indicates the type of visit (e.g., ambulatory visit, inpatient visit, emergency visit).
Measurement Table
measurement_id A unique identifier for each measurement.
person_id An identifier linking the measurement to the person
measurement_concept_id A standardized concept identifier representing the type of measurement in a standardized vocabulary (e.g., LOINC).
measurement_date The date of the measurement.
measurement_type_concept_id Additional information about the measurement type.

Value_as_number

value_as_concept_id

Depending on the nature of the measurement, either of these fields may be populated or omitted. Data will represent a numeric result (value_as_number) or coded value (value_as_concept_id).

Unit_concept_id

Range_low

range_high

If the measurement is for a lab test, these fields will be populated. The fields indicate the standard unit type in a standardized vocabulary (eg., UCUM) as well as the lower limit (range_low) and higher limit (range_high) of the normal range of the Measurement result. The lower and higher limit range is assumed to be of the same unit of measure as the Measurement value.

 

In addition to privacy protection applications, a small amount of programmatic cleaning is applied to select fields within the Measurement table.

 Cleaning Rules Applied to Table Fields for EHR Data
Application Affected Field Resulting Output
Standardize select, high priority lab and vital units and values Unit_concept_id; value_as_number Cortisol mass by volume measurements (measurement_concept_id = 3009905) submitted in thousands per cubic millimeter (unit_concept_id = 8961), the unit concept id for all measurements with this unit are set to thousands per microliter (unit_concept_id = 8848) and the value is multiplied by 1
Normalize units for height, then apply a cleaning algorithm to ensure rows seem reasonable based on history Unit_concept_id; value_as_number Height units normalized and checked for plausibility given history. If height is:
- between 0.9 and 2.3, it is assumed to be entered in meters, and is converted to cm accordingly.
- between 3 and 7.5, it is assumed to be entered in feet and is converted to cm accordingly
- between 36 and 89.9, it is assumed to be entered in inches and is converted to cm accordingly.

If height is recorded as 80 cm at one time, and 120 cm at another for the same person, then there is a disagreement between the two measures, with a standard deviation 28 ( which is > 10), so both of these records are dropped.
Remove values that are either implausible or have no utility Value as number and all other applicable fields

Value_as_number field is nulled for "9999999" values.

Value_as_number field nulled for sites with all "0" values


Duplicate rows eliminated as identified by person_id, measurement_concept_id, measurement_datetime, value_as_number, and value_as_concept_id.

Rows with measurement_concept_id = 0 dropped

Considerations for Working with EHR Data

EHRs contain data that span the participant’s entire course of care within the affiliated health care organization system. This means that the volume and expansiveness of available data will vary by participant and recorded event. In some cases, data may be recorded for as far back as early childhood and, in others, it may only span a few years or less. In addition, these data are made available for research use, but their primary purpose was for clinical care and so, although a common data model helps to standardize the data elements, researchers should expect that individual records will differ in completeness and correctness. There may also be instances where self-reported participant responses from surveys are prioritized for mapping in the OMOP Person table over EHR, such as this example. For answers to frequently asked questions about All of Us EHR data, see the FAQ page here

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.