Types of All of Us data and how they are organized

  • Updated

Data Organization

All of Us data are organized into tables according to the Observational Medical Outcomes Partnership Common Data Model  (OMOP CDM) version 5.2, when possible. Extensive information regarding the OMOP CDM relational database can be found at www.ohdsi.org. Participant provided information (PPI), physical measurements (PM), and electronic health record (EHR) data are arranged into tables according to the OMOP CDM convention as shown below. Self-reported demographic data from the Basics survey (PPI survey) populates the Person table. Other data obtained from PPI surveys are found in the Observation table (see survey codebooks for more information about PPI surveys). Program physical measurements as well as EHR measurements populate the Measurement table. EHR data concerning visits, procedures, drugs, and conditions are arranged into their respective tables. All tables relate to the Person table and the tables containing procedure, drug, condition, and measurement data relate to the Visit table.

image-0.png   

When it comes to building a cohort using the Cohort Builder tool, you will find that the data are organized by “program data” and “domains.” Program data includes demographics, surveys, and physical measurements. EHR data are arranged by domain (conditions, procedures, drugs, measurements, and visits).

 

Program Data

Demographics include age, gender, race, ethnicity, and deceased status. Demographic data are self-reported (collected via surveys) and subject to privacy methodology. The Controlled Tier dataset also contains demographic data supplied by EHRs, however, the Registered Tier does not.

 

Surveys are questions and associated response options for surveys completed by participants.

All of the survey questions and potential answers are added to the All of Us specific PPI vocabulary and assigned a concept_id (source concept ID). When possible, the PPI concepts are then mapped to standard vocabularies such as Logical Observation Identifiers Names and Codes (LOINC), International Classification of Diseases (ICD), or Systematized Nomenclature of Medicine (SNOMED), and their associated “standard” concept_id. When mapping is not possible, the PPI concept_id serves as both the source and standard.

Survey questions and answers are stored in the Observational Medical Outcomes Partnership (OMOP) observation table and can be searched and analyzed via either the standard or source vocabulary concept_id.

For more information on how to use survey data in your research, see our Introduction to Survey Collection and Data Transformation Methods article and the How to Work with All of Us Survey Data notebooks in the Researcher Workbench tutorial workspaces.

 

Physical Measurements are taken at the time of participant enrollment, including blood pressure, heart rate, height, weight, body mass index (BMI), waist and hip circumference, pregnancy status, and wheelchair use.

Physical measurements are assigned an All of Us specific PPI concept_id (source concept ID). The PPI concepts are then mapped to the standard LOINC vocabulary and the associated “standard” concept_id and stored in the OMOP Measurements table. We recommend using the program-collected measurement data when possible. To distinguish between measurements recorded at enrollment versus those recorded in a participant’s EHR, use the source concept_ids. See the table below for more detail.

*Note: if you need physical measurements data, and you use measurement_concept_id, then you need to specify the data source found in the measurement_ext table. If you use the measurement_source_concept_id, then you don’t need to specify the data source. 

measurement_concept_id

Standard

measurement_concept_id

Source

3004249

Systolic blood pressure

903109

1st systolic blood pressure

3004249

Systolic blood pressure

903114

2nd systolic blood pressure

3004249

Systolic blood pressure

903130

3rd systolic blood pressure

3012888

Diastolic blood pressure

903110

1st diastolic blood pressure

3012888

Diastolic blood pressure

903129

2nd diastolic blood pressure

3012888

Diastolic blood pressure

903106

3rd diastolic blood pressure

3025315

Body weight

903121

Weight

3027018

Heart rate

903112

1st heart rate

3027018

Heart rate

903105

2nd heart rate

3027018

Heart rate

903108

3rd heart rate

3036277

Body height

903133

Height

40759207

Adult waist circumference protocol

903127

1st waist circumference

40759207

Adult waist circumference protocol

903134

2nd waist circumference

40759207

Adult waist circumference protocol

903128

3rd waist circumference

40765148

PhenX- hip circumference protocol

903117

1st hip circumference

40765148

PhenX- hip circumference protocol

903125

2nd hip circumference

40765148

PhenX- hip circumference protocol

903123

3rd hip circumference

For more information on how to use physical measurement data in your research, see the How to Work with All of Us Physical Measurement Data notebooks in the Researcher Workbench tutorial workspaces. 

Additionally, you may browse and/or download all of the All of Us concepts via ATHENA, the Observational Medical Outcomes Partnership (OMOP) community's searchable database of standardized vocabularies it supports. To browse All of Us PPI concepts, select "Vocabulary" in the left-side navigation bar and scroll to "PPI."

 

Electronic Health Records (EHR)

EHR data are transformed into standard vocabulary across 14 structured tables. Click here for information on how privacy rules may affect access to information within a participant's EHR and see the Data Dictionary for Curated Data Repository (CDR) for a detailed description of EHR data available within each table (listed below).

  • Person
  • Visit Occurrence
  • Condition Occurrence
  • Drug Exposure 
  • Measurement
  • Procedure Occurrence
  • Observation
  • Location*
  • Provider*
  • Device Exposure
  • Death
  • Care Site*
  • Fact Relationship
  • Specimen

      *Suppressed information

 

EHR Domains:

Conditions come from EHRs and are listed by ICD9, ICD10, or SNOMED standard codes.

Procedures come from EHRs and are listed by ICD9, ICD10, CPT, or SNOMED standard codes.

Drugs or medications come from EHRs and are listed by ingredient and organized by therapeutic uses according to the Anatomical Therapeutic Chemical (ATC) Classification System.

Measurements include laboratory tests and vital signs found in the EHR and are organized in the LOINC (Logical Observation Identifiers Names and Codes) code hierarchy.

Visits describe the type of facility where the participant received medical care (e.g., emergency room, outpatient, or inpatient).

 

Data Not Structured According to OMOP CDM

Wearable Device Data

Fitbit data are available in a series of four tables within both the Registered Tier and Controlled Tier  datasets, allowing researchers the ability to parse the data themselves. The following list and grid displays all currently available tables and associated fields.

  • Heart Rate (By Zone Summary)
  • Heart Rate (Minute-Level)
  • Activity (Daily Summary)
  • Activity: Intraday Steps (Minute-Level)
  • Sleep Level (Sequence of Sleep by level)
  • Daily Sleep Summary
Table Field
steps_intraday datetime
steps_intraday steps
steps_intraday person_id
heart_rate_summary person_id
heart_rate_summary date
heart_rate_summary zone_name
heart_rate_summary min_heart_rate
heart_rate_summary max_heart_rate
heart_rate_summary minute_in_zone
heart_rate_summary calorie_count
activity_summary date
activity_summary activity_calories
activity_summary calories_bmr
activity_summary calories_out
activity_summary elevation
activity_summary fairy_active_minutes
activity_summary floors
activity_summary lightly_active_minutes
activity_summary marginal_calories
activity_summary sedentary_minutes
activity_summary steps
activity_summary very_active_minutes
activity_summary person_id
heart_rate_minute_level datetime
heart_rate_minute_level heart_rate_value
heart_rate_minute_level person_id
sleep_level person_id
sleep_level sleep_date
sleep_level sleep_datetime
sleep_level is_main_sleep
sleep_level level
sleep_level duration_in_min
sleep_summary person_id
sleep_summary sleep_date
sleep_summary is_main_sleep
sleep_summary minute_in_bed
sleep_summary minute_a_sleep
sleep_summary minute_after_wakeup
sleep_summary minute_awake
sleep_summary minute_restless
sleep_summary minute_deep
sleep_summary minute_light
sleep_summary minute_rem
sleep_summary minute_wake

 

Below are the tables with the data format for each field, along with some notes to consider when using these data.

 

Daily Activity Summary 

Each row is a daily step count for a given participant

person_id Date activity calories calories BMR calories out elevation fairly active minutes floors lightly active minutes marginal calories sedentary minutes steps very active minutes
integer date float float float float float integer float float float integer float

 

Heart Rate (By Zone Summary)

person_id Datetime Zone Name Min Heart Rate Max Heart Rate Number of Minutes in Zone Calorie Count
integer date string integer integer integer float

 

Heart Rate (Minute-Level)

Each row is a one-minute count for a given participant

person_id Datetime Heart Rate Value
integer datetime integer

 

Activity: Intraday Steps (Minute-Level)

person_id Datetime Steps
integer datetime numeric

 

Sleep Level (Sequence of Sleep by level)

Levels: awake, light, asleep, deep, restless, wake, rem, unknown

person_id sleep_date start_datetime is_main_sleep level duration_in_min
integer date datetime string string float

Sleep Daily Summary 

person_id sleep_date is_main_sleep minute_in_bed minute_asleep minute_after_wakeup minute_awake minute_restless minute_deep minute_light minute_rem minute_wake
integer date string integer integer integer integer integer integer integer integer integer

 

Considerations

  1. Daily summary data and daily goals for elevation (elevation, floors) are only included for users with a device that includes an altimeter.
  2. The steps field in Daily Active Summary entries is included only for activities that have steps (e.g. "Walking," "Running").
  3. Calorie burn goal (CaloriesOut) represents either dynamic daily target from the premium trainer plan or manual calorie burn goal. Goals are included to the response only for today and 21 days in the past.
  4. Calorie Count is the top level time series for calories burned inclusive of basal metabolic rate (BMR), tracked activity, and manually logged activities.
  5. Calories BMR only includes BMR calories.
  6. Activity Calories  are the number of calories burned during the day for periods of time when the user was active above sedentary level.
  7. Sleep stages are traditionally measured in a lab using an electroencephalogram to detect brain activity along with other systems to monitor eye and muscle activity. While this method is the gold standard for measuring sleep stages (source), Fitbit estimates user’s sleep stages using a combination of user movement and heart-rate patterns. When a user hasn't moved for about an hour, their tracker or watch assumes that the user is asleep. Additional data—such as the length of time your movements are indicative of sleep behavior (such as rolling over, etc.)—help confirm that the user is asleep. While the user is sleeping, their device tracks the beat-to-beat changes in their heart rate, known as heart rate variability (HRV), which fluctuate as they transition between light sleep, deep sleep, and REM sleep stages. When the user sync their device in the morning, Fitbit uses their movement and heart rate patterns to estimate their sleep cycles from the previous night.
  8. Researchers can have data for sleep
    1. Sleep patterns (awake, restless, asleep) - minute level and summary
    2. Sleep stages (awake, light, deep, REM) - minute level and summary.
    3. There are a few scenarios where researcher/user might see sleep pattern (which shows time asleep, restless, and awake) instead of sleep stages (awake, light, deep, REM):

        • If a user slept in a position that prevented the device from getting a consistent heart-rate reading or if the device is worn too loosely. 
        • For best results, the device should be worn higher on the wrist (about 2-3 finger widths above wrist bone). The band should feel secure but not too tight.
        • If the user used the Begin Sleep Now option in the Fitbit app (instead of simply wearing your device to bed). 
        • If the user slept for less than 3 hours.
        • If the device’s battery is critically low.
  9. Other methodological considerations for using Fitbit data within the All of Us Research Program is highlighted here: Considerations while using Fitbit Data in the All of Us Research Program – User Support (researchallofus.org)
  •  

Genomic Data 

The All of Us genomic dataset contains whole genome sequencing (WGS) data and microarray genotype data (Array).  The genomic data is accessible through the Researcher Workbench.  Bucket locations, for accessing the data in analysis notebooks, can be found in the Controlled CDR Directory.  We provide variants in Variant Call Format (VCF), Hail MatrixTables (MT), and PLINK 1.9 bed/bim/fam triplets.  PLINK files are only provided for the array variants.  We provide the auxiliary tabular data, such as the joint callset QC flagged samples or related pairs, as tab-separated values (tsv), with the column headers in the first row.  

For more detailed information on how the genomic data are organized, please see this article.

 

Externally Sourced Socioeconomic Status Data

A selection of socioeconomic status summary statistics, sourced from the U.S. Census American Community Survey via a three digit zip code linkage, are made available within the Controlled Tier. These data are stored in an appended table and cover the following domains on a per Census block basis: proportion of population receiving assisted income benefits within the past 12 months, proportion of population aged 25 years or older with educational attainment of at least high school or GED equivalent, median household income in the past 12 months (in 2015 inflation-adjusted dollars), proportion of the population with no health insurance coverage, proportion of population with income below the federal poverty level within the past 12 months, proportion of houses that are vacant, and a deprivation index (see here for more info). Note: Participant level concepts for the following related data elements are available via the Basics survey: educational attainment, household income, and health insurance coverage.

 

Datasets Unlinked from the Registered Tier CDR

All of Us SARS-Co-V-2 Antibody Study 

Data from the All of Us SARS-CoV-2 Antibody study1 are available in a series of five tables that are unlinked to the Registered Tier CDR data (see tables below). These data are provided for the purpose of study replication and will not be updated with future Registered Tier CDR data releases. Please note, that there are considerations to keep in mind when reproducing this study: not all positive controls used in study analyses were able to be included in the datasets due to data use restrictions set by the sample provider and race and ethnicity are combined categories within the paper. For more information about replicating this research, please see the “How to Reproduce the All of Us SARS-CoV-2 Antibody Study” notebook found in the Featured Workspaces section of the Researcher Workbench.

The serology dataset is only contained in the All of Us Registered Tier Dataset v4 CDR R2020Q4R2 and, therefore, must be accessed through that CDR version, following the rules associated with accessing old CDR versions

 

Serology Person Table

Field Name

Field Description

Field Type

Enumerators

Registered Tier Rules

serology_person_id

A person id created specifically for the study 

   

Distinct id generated for this dataset; not linked in any way to research_id

person_id

     

Suppressed column in Registered Tier

state

State in which the individual/patient lives

string

 

Generalized state of residence for participants who reside in all non-US states and Washington DC into a single group (Guam, Palau, Puerto Rico, American Samoa, Micronesia, Marshall Island, Virgin Islands)

race

Based on individual’s self-reported race, generalized according to existing Registered Tier privacy requirements

string

 

Generalized based on Registered Tier CDR generalization rules

ethnicity

Based on individual’s self-reported ethnicity, generalized according to existing Registered Tier privacy requirements

string

 

Generalized based on Registered Tier CDR generalization rules

sex_at_birth

Based on individual’s self-reported sex, generalized according to existing Privacy requirements

string

 

Generalized based on Registered Tier CDR generalization rules

age

Individual’s age at date of specimen collection

numeric

 

Generalized participants greater than 89 years into one group

control_status

 

string

Positive / Negative / Non-Control

 

 

Serology Test Table

Field Name

Field Description

Field Type

Enumerators

Notes

test_id

A primary key of test table

     

sample_id

ID created for each sample tested

     

serology_person_id

Foreign key to the person table

     

test_code

Code corresponding to test_name

   

From the original flat files

test_name

Name of test (e.g. Abbott, EuroImmune, etc.)

   

From the original flat files

batch

Batch number

     

run_date_time

 

date/time

 

From the original flat files

instrument_name

     

From the original flat files

position

     

From the original flat files

 

Serology Results Table

Field Name

Field Description

Field Type

Enumerators

Notes

result_id

A primary key of Result

     

test_id

A foreign key linked to Test

     

result_name

     

From the original flat files

result_value

     

From the original flat files

 

Validation Results Table

Field Name

Field Description

Field Type

Enumerators

Notes

person_serology_id

     

From the original flat files

sample_id

       

roche_date

(Original field is “Final Date”) Date the test was reported in the Mayo system

   

From the original flat files 

roche_result

(COVTI) Test result for Roche Test

 

Positive / Negative / TNP / Pending

From the original flat files 

roche_raw_result

(COVTS) Raw data for Roche test results

 

Equal to or greater than 1 considered positive

From the original flat files

ortho_date

(Original field is “Final Date”) Date the test was reported in the Mayo system

   

From the original flat files

ortho_result

(VSARS) Test result for Ortho test

 

Positive / Negative / TNP / Pending

From the original flat files

ortho_raw_result

(SCO7) Raw data for the Ortho test result

 

Equal to or greater than 1 considered positive

From the original flat files 

 

Titer

Field Name

Field Description

Field Type

Enumerators

Notes

sample_id

 

numeric

   

serology_person_id

Foreign key to person

numeric

   

batch

 

numeric

 

From the original flat files

assay_type

     

From the original flat files

material

     

From the original flat files

test

     

From the original flat files

result

     

From the original flat files

comment

     

From the original flat files

 

References

1Althoff, K., Schlueter, D.J., Anton-Culver, H., Cherry, J., Denny, J., Thomsen, I., ... Schully, S. (in press). Antibodies to SARS-CoV-2 in All of Us Research Program participants, January 2 - March 18, 2020. Clinical Infectious Diseases, ciab519, https://doi.org/10.1093/cid/ciab519 

Was this article helpful?

7 out of 7 found this helpful

Have more questions? Submit a request