Data Organization
All of Us data are organized into tables according to the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) version 5.2, when possible. Extensive information regarding the OMOP CDM relational database can be found at www.ohdsi.org. Participant provided information (PPI), physical measurements (PM), and electronic health record (EHR) data are arranged into tables according to the OMOP CDM convention as shown below. Self-reported demographic data from the Basics survey (PPI survey) populates the Person table. Other data obtained from PPI surveys are found in the Observation table (see survey codebooks for more information about PPI surveys). Program physical measurements as well as EHR measurements populate the Measurement table. EHR data concerning visits, procedures, drugs, and conditions are arranged into their respective tables. All tables relate to the Person table and the tables containing procedure, drug, condition, and measurement data relate to the Visit table.
When it comes to building a cohort using the Cohort Builder tool, you will find that the data are organized by “program data” and “domains.” Program data includes demographics, surveys, and physical measurements. EHR data are arranged by domain (conditions, procedures, drugs, measurements, and visits).
Program Data
Demographics include age, gender, race, ethnicity, and deceased status. Demographic data are self-reported (collected via surveys) and subject to privacy methodology. The Controlled Tier dataset also contains demographic data supplied by EHRs, however, the Registered Tier does not.
Surveys are questions and associated response options for surveys completed by participants.
All of the survey questions and potential answers are added to the All of Us specific PPI vocabulary and assigned a concept_id (source concept ID). When possible, the PPI concepts are then mapped to standard vocabularies such as Logical Observation Identifiers Names and Codes (LOINC), International Classification of Diseases (ICD), or Systematized Nomenclature of Medicine (SNOMED), and their associated “standard” concept_id. When mapping is not possible, the PPI concept_id serves as both the source and standard.
Survey questions and answers are stored in the Observational Medical Outcomes Partnership (OMOP) observation table and can be searched and analyzed via either the standard or source vocabulary concept_id.
For more information on how to use survey data in your research, see our Introduction to Survey Collection and Data Transformation Methods article and the How to Work with All of Us Survey Data notebooks in the Researcher Workbench tutorial workspaces.
Physical Measurements are taken at the time of participant enrollment, including blood pressure, heart rate, height, weight, body mass index (BMI), waist and hip circumference, pregnancy status, and wheelchair use.
Physical measurements are assigned an All of Us specific PPI concept_id (source concept ID). The PPI concepts are then mapped to the standard LOINC vocabulary and the associated “standard” concept_id and stored in the OMOP Measurements table. We recommend using the program-collected measurement data when possible. To distinguish between measurements recorded at enrollment versus those recorded in a participant’s EHR, use the source concept_ids. See the table below for more detail.
*Note: if you need physical measurements data, and you use measurement_concept_id, then you need to specify the data source found in the measurement_ext table. If you use the measurement_source_concept_id, then you don’t need to specify the data source.
measurement_concept_id |
Standard |
measurement_concept_id |
Source |
3004249 |
Systolic blood pressure |
903109 |
1st systolic blood pressure |
3004249 |
Systolic blood pressure |
903114 |
2nd systolic blood pressure |
3004249 |
Systolic blood pressure |
903130 |
3rd systolic blood pressure |
3012888 |
Diastolic blood pressure |
903110 |
1st diastolic blood pressure |
3012888 |
Diastolic blood pressure |
903129 |
2nd diastolic blood pressure |
3012888 |
Diastolic blood pressure |
903106 |
3rd diastolic blood pressure |
3025315 |
Body weight |
903121 |
Weight |
3027018 |
Heart rate |
903112 |
1st heart rate |
3027018 |
Heart rate |
903105 |
2nd heart rate |
3027018 |
Heart rate |
903108 |
3rd heart rate |
3036277 |
Body height |
903133 |
Height |
40759207 |
Adult waist circumference protocol |
903127 |
1st waist circumference |
40759207 |
Adult waist circumference protocol |
903134 |
2nd waist circumference |
40759207 |
Adult waist circumference protocol |
903128 |
3rd waist circumference |
40765148 |
PhenX- hip circumference protocol |
903117 |
1st hip circumference |
40765148 |
PhenX- hip circumference protocol |
903125 |
2nd hip circumference |
40765148 |
PhenX- hip circumference protocol |
903123 |
3rd hip circumference |
For more information on how to use physical measurement data in your research, see the How to Work with All of Us Physical Measurement Data notebooks in the Researcher Workbench tutorial workspaces.
Additionally, you may browse and/or download all of the All of Us concepts via ATHENA, the Observational Medical Outcomes Partnership (OMOP) community's searchable database of standardized vocabularies it supports. To browse All of Us PPI concepts, select "Vocabulary" in the left-side navigation bar and scroll to "PPI."
Electronic Health Records (EHR)
EHR data are transformed into standard vocabulary across 14 structured tables. Click here for information on how privacy rules may affect access to information within a participant's EHR and see the Data Dictionary for Curated Data Repository (CDR) for a detailed description of EHR data available within each table (listed below).
- Person
- Visit Occurrence
- Condition Occurrence
- Drug Exposure
- Measurement
- Procedure Occurrence
- Observation
- Location*
- Provider*
- Device Exposure
- Death
- Care Site*
- Fact Relationship
- Specimen
*Suppressed information
EHR Domains:
Conditions come from EHRs and are listed by ICD9, ICD10, or SNOMED standard codes.
Procedures come from EHRs and are listed by ICD9, ICD10, CPT, or SNOMED standard codes.
Drugs or medications come from EHRs and are listed by ingredient and organized by therapeutic uses according to the Anatomical Therapeutic Chemical (ATC) Classification System.
Measurements include laboratory tests and vital signs found in the EHR and are organized in the LOINC (Logical Observation Identifiers Names and Codes) code hierarchy.
Visits describe the type of facility where the participant received medical care (e.g., emergency room, outpatient, or inpatient).
Data Not Structured According to OMOP CDM
Wearable Device Data
Fitbit data are available in a series of four tables within both the Registered Tier and Controlled Tier datasets, allowing researchers the ability to parse the data themselves. The following list and grid displays all currently available tables and associated fields.
- Heart Rate (By Zone Summary)
- Heart Rate (Minute-Level)
- Activity (Daily Summary)
- Activity: Intraday Steps (Minute-Level)
- Sleep Level (Sequence of Sleep by level)
- Daily Sleep Summary
Table | Field |
steps_intraday | datetime |
steps_intraday | steps |
steps_intraday | person_id |
heart_rate_summary | person_id |
heart_rate_summary | date |
heart_rate_summary | zone_name |
heart_rate_summary | min_heart_rate |
heart_rate_summary | max_heart_rate |
heart_rate_summary | minute_in_zone |
heart_rate_summary | calorie_count |
activity_summary | date |
activity_summary | activity_calories |
activity_summary | calories_bmr |
activity_summary | calories_out |
activity_summary | elevation |
activity_summary | fairy_active_minutes |
activity_summary | floors |
activity_summary | lightly_active_minutes |
activity_summary | marginal_calories |
activity_summary | sedentary_minutes |
activity_summary | steps |
activity_summary | very_active_minutes |
activity_summary | person_id |
heart_rate_minute_level | datetime |
heart_rate_minute_level | heart_rate_value |
heart_rate_minute_level | person_id |
sleep_level | person_id |
sleep_level | sleep_date |
sleep_level | sleep_datetime |
sleep_level | is_main_sleep |
sleep_level | level |
sleep_level | duration_in_min |
sleep_summary | person_id |
sleep_summary | sleep_date |
sleep_summary | is_main_sleep |
sleep_summary | minute_in_bed |
sleep_summary | minute_a_sleep |
sleep_summary | minute_after_wakeup |
sleep_summary | minute_awake |
sleep_summary | minute_restless |
sleep_summary | minute_deep |
sleep_summary | minute_light |
sleep_summary | minute_rem |
sleep_summary | minute_wake |
Below are the tables with the data format for each field, along with some notes to consider when using these data.
Daily Activity Summary
Each row is a daily step count for a given participant
person_id | Date | activity calories | calories BMR | calories out | elevation | fairly active minutes | floors | lightly active minutes | marginal calories | sedentary minutes | steps | very active minutes |
integer | date | float | float | float | float | float | integer | float | float | float | integer | float |
Heart Rate (By Zone Summary)
person_id | Datetime | Zone Name | Min Heart Rate | Max Heart Rate | Number of Minutes in Zone | Calorie Count |
integer | date | string | integer | integer | integer | float |
Heart Rate (Minute-Level)
Each row is a one-minute count for a given participant
person_id | Datetime | Heart Rate Value |
integer | datetime | integer |
Activity: Intraday Steps (Minute-Level)
person_id | Datetime | Steps |
integer | datetime | numeric |
Sleep Level (Sequence of Sleep by level)
Levels: awake, light, asleep, deep, restless, wake, rem, unknown
person_id | sleep_date | start_datetime | is_main_sleep | level | duration_in_min |
integer | date | datetime | string | string | float |
Sleep Daily Summary
person_id | sleep_date | is_main_sleep | minute_in_bed | minute_asleep | minute_after_wakeup | minute_awake | minute_restless | minute_deep | minute_light | minute_rem | minute_wake |
integer | date | string | integer | integer | integer | integer | integer | integer | integer | integer | integer |
Considerations
- Daily summary data and daily goals for elevation (elevation, floors) are only included for users with a device that includes an altimeter.
- The steps field in Daily Active Summary entries is included only for activities that have steps (e.g. "Walking," "Running").
- Calorie burn goal (CaloriesOut) represents either dynamic daily target from the premium trainer plan or manual calorie burn goal. Goals are included to the response only for today and 21 days in the past.
- Calorie Count is the top level time series for calories burned inclusive of basal metabolic rate (BMR), tracked activity, and manually logged activities.
- Calories BMR only includes BMR calories.
- Activity Calories are the number of calories burned during the day for periods of time when the user was active above sedentary level.
- Sleep stages are traditionally measured in a lab using an electroencephalogram to detect brain activity along with other systems to monitor eye and muscle activity. While this method is the gold standard for measuring sleep stages (source), Fitbit estimates user’s sleep stages using a combination of user movement and heart-rate patterns. When a user hasn't moved for about an hour, their tracker or watch assumes that the user is asleep. Additional data—such as the length of time your movements are indicative of sleep behavior (such as rolling over, etc.)—help confirm that the user is asleep. While the user is sleeping, their device tracks the beat-to-beat changes in their heart rate, known as heart rate variability (HRV), which fluctuate as they transition between light sleep, deep sleep, and REM sleep stages. When the user sync their device in the morning, Fitbit uses their movement and heart rate patterns to estimate their sleep cycles from the previous night.
- Researchers can have data for sleep
- Sleep patterns (awake, restless, asleep) - minute level and summary
- Sleep stages (awake, light, deep, REM) - minute level and summary.
-
There are a few scenarios where researcher/user might see sleep pattern (which shows time asleep, restless, and awake) instead of sleep stages (awake, light, deep, REM):
-
- If a user slept in a position that prevented the device from getting a consistent heart-rate reading or if the device is worn too loosely.
- For best results, the device should be worn higher on the wrist (about 2-3 finger widths above wrist bone). The band should feel secure but not too tight.
- If the user used the Begin Sleep Now option in the Fitbit app (instead of simply wearing your device to bed).
- If the user slept for less than 3 hours.
- If the device’s battery is critically low.
-
- Other methodological considerations for using Fitbit data within the All of Us Research Program is highlighted here: Considerations while using Fitbit Data in the All of Us Research Program – User Support (researchallofus.org)
Genomic Data
The All of Us genomic dataset contains whole genome sequencing (WGS) data and microarray genotype data (Array). The genomic data is accessible through the Researcher Workbench. Bucket locations, for accessing the data in analysis notebooks, can be found in the Controlled CDR Directory. We provide variants in Variant Call Format (VCF), Hail MatrixTables (MT), and PLINK 1.9 bed/bim/fam triplets. PLINK files are only provided for the array variants. We provide the auxiliary tabular data, such as the joint callset QC flagged samples or related pairs, as tab-separated values (tsv), with the column headers in the first row.
For more detailed information on how the genomic data are organized, please see this article.
Externally Sourced Socioeconomic Status Data
A selection of socioeconomic status summary statistics, sourced from the U.S. Census American Community Survey via a three digit zip code linkage, are made available within the Controlled Tier. These data are stored in an appended table and cover the following domains on a per Census block basis: proportion of population receiving assisted income benefits within the past 12 months, proportion of population aged 25 years or older with educational attainment of at least high school or GED equivalent, median household income in the past 12 months (in 2015 inflation-adjusted dollars), proportion of the population with no health insurance coverage, proportion of population with income below the federal poverty level within the past 12 months, proportion of houses that are vacant, and a deprivation index (see here for more info). Note: Participant level concepts for the following related data elements are available via the Basics survey: educational attainment, household income, and health insurance coverage.
Datasets Unlinked from the Registered Tier CDR
All of Us SARS-Co-V-2 Antibody Study
Data from the All of Us SARS-CoV-2 Antibody study1 are available in a series of five tables that are unlinked to the Registered Tier CDR data (see tables below). These data are provided for the purpose of study replication and will not be updated with future Registered Tier CDR data releases. Please note, that there are considerations to keep in mind when reproducing this study: not all positive controls used in study analyses were able to be included in the datasets due to data use restrictions set by the sample provider and race and ethnicity are combined categories within the paper. For more information about replicating this research, please see the “How to Reproduce the All of Us SARS-CoV-2 Antibody Study” notebook found in the Featured Workspaces section of the Researcher Workbench.
The serology dataset is only contained in the All of Us Registered Tier Dataset v4 CDR R2020Q4R2 and, therefore, must be accessed through that CDR version, following the rules associated with accessing old CDR versions.
Serology Person Table
Field Name |
Field Description |
Field Type |
Enumerators |
Registered Tier Rules |
serology_person_id |
A person id created specifically for the study |
Distinct id generated for this dataset; not linked in any way to research_id |
||
person_id |
Suppressed column in Registered Tier |
|||
state |
State in which the individual/patient lives |
string |
Generalized state of residence for participants who reside in all non-US states and Washington DC into a single group (Guam, Palau, Puerto Rico, American Samoa, Micronesia, Marshall Island, Virgin Islands) |
|
race |
Based on individual’s self-reported race, generalized according to existing Registered Tier privacy requirements |
string |
Generalized based on Registered Tier CDR generalization rules |
|
ethnicity |
Based on individual’s self-reported ethnicity, generalized according to existing Registered Tier privacy requirements |
string |
Generalized based on Registered Tier CDR generalization rules |
|
sex_at_birth |
Based on individual’s self-reported sex, generalized according to existing Privacy requirements |
string |
Generalized based on Registered Tier CDR generalization rules |
|
age |
Individual’s age at date of specimen collection |
numeric |
Generalized participants greater than 89 years into one group |
|
control_status |
string |
Positive / Negative / Non-Control |
Serology Test Table
Field Name |
Field Description |
Field Type |
Enumerators |
Notes |
test_id |
A primary key of test table |
|||
sample_id |
ID created for each sample tested |
|||
serology_person_id |
Foreign key to the person table |
|||
test_code |
Code corresponding to test_name |
From the original flat files |
||
test_name |
Name of test (e.g. Abbott, EuroImmune, etc.) |
From the original flat files |
||
batch |
Batch number |
|||
run_date_time |
date/time |
From the original flat files |
||
instrument_name |
From the original flat files |
|||
position |
From the original flat files |
Serology Results Table
Field Name |
Field Description |
Field Type |
Enumerators |
Notes |
result_id |
A primary key of Result |
|||
test_id |
A foreign key linked to Test |
|||
result_name |
From the original flat files |
|||
result_value |
From the original flat files |
Validation Results Table
Field Name |
Field Description |
Field Type |
Enumerators |
Notes |
person_serology_id |
From the original flat files |
|||
sample_id |
||||
roche_date |
(Original field is “Final Date”) Date the test was reported in the Mayo system |
From the original flat files |
||
roche_result |
(COVTI) Test result for Roche Test |
Positive / Negative / TNP / Pending |
From the original flat files |
|
roche_raw_result |
(COVTS) Raw data for Roche test results |
Equal to or greater than 1 considered positive |
From the original flat files |
|
ortho_date |
(Original field is “Final Date”) Date the test was reported in the Mayo system |
From the original flat files |
||
ortho_result |
(VSARS) Test result for Ortho test |
Positive / Negative / TNP / Pending |
From the original flat files |
|
ortho_raw_result |
(SCO7) Raw data for the Ortho test result |
Equal to or greater than 1 considered positive |
From the original flat files |
Titer
Field Name |
Field Description |
Field Type |
Enumerators |
Notes |
sample_id |
numeric |
|||
serology_person_id |
Foreign key to person |
numeric |
||
batch |
numeric |
From the original flat files |
||
assay_type |
From the original flat files |
|||
material |
From the original flat files |
|||
test |
From the original flat files |
|||
result |
From the original flat files |
|||
comment |
From the original flat files |
References
1Althoff, K., Schlueter, D.J., Anton-Culver, H., Cherry, J., Denny, J., Thomsen, I., ... Schully, S. (in press). Antibodies to SARS-CoV-2 in All of Us Research Program participants, January 2 - March 18, 2020. Clinical Infectious Diseases, ciab519, https://doi.org/10.1093/cid/ciab519