How Does All of Us Protect Participant Privacy?
Participant privacy is of utmost importance to the All of Us Research Program. Our privacy experts have created curation methods to remove all personally identifying information (PII) from participant records. In addition to PII, some data are subject to suppression (withholding or removing selected information) or generalization based on re-identification risk.
Overview of the privacy rules
The All of Us data are released to investigators through three separate tiers based on the privacy risks inherent in the data included for that tier and the requirements for access: Public, Registered, and Controlled Tiers.
- Public Tier data include only anonymized, aggregated data and are accessible to the public through the All of Us Research Hub's Data Browser and Survey Explorer.
- Registered Tier data include participant-level data with added transformations to protect participant privacy and are only available to registered researchers. These transformations are based on modeling and empirical analysis of re-identification risks associated with each data element.
- Controlled Tier data include participant-level data with fewer transformations than those added in Registered Tier. These transformations can be summarized as follows:
Fields that are removed:
Registered Tier:
- All explicit identifiers (e.g., personal name, medical record number, and participant ID)
- All free-text fields in survey data and unstructured documents from electronic health records (EHR)
- All geolocation data smaller than US state level (e.g., location ID, provider ID)
- Race and ethnicity subcategories in survey data (e.g., Hmong, Filipino, or Caribbean)
- Demographics (race/ethnicity and sex) (from EHR data only)
- Living situation (from survey data)
- Active-duty military status
- Death causes (EHR) (e.g., diagnosis codes specifying cause of death)
- Diagnosis codes subject to public knowledge (EHR)
- Codes specifying sex, sexuality, or gender types suppressed (EHR)
- EHR structured concepts related to COVID-19 testing or diagnosis, including serology results
- COVID-19 Participant Experience (COPE) survey items detailing COVID-19 and flu tests, diagnoses, and symptoms.
Controlled Tier:
The Controlled Tier will exclude all direct identifiers (similar to the Registered Tier). The data privacy rules currently in place for the Registered Tier will be modified for the Controlled Tier as follows:
- Unstructured text, including free text survey responses and clinical documents will be suppressed in the Beta release of Controlled Tier. Future release of the Controlled Tier may include extracts of clinical documents or unstructured text after transformations such as mapping to standard concepts to protect privacy.
- Much of the transformations that have been implemented to demographic variables in the Registered Tier will be removed from the Controlled Tier CDR, enabling access to more granular demographic information.
- Genomic data currently excluded from the Registered Tier will be included in the Controlled Tier curated data repository (CDR). All direct identifies as well as other identifying information is removed using research_id as the sample key for both array and whole genome sequence data.
- COVID-19 data (e.g. ICD codes, tests indicative of COVID status), currently suppressed in the Registered Tier, will be included in the Controlled Tier.
Date transformation:
Registered Tier:
- All dates are shifted backwards by a random number between 1 - 365
- The shift is constant for each participant so that temporality of events is preserved
- Exception: COPE survey data are not date shifted following this schema.
- All participants aged > 89 are removed
Controlled Tier:
- Real (unshifted) dates of events will be available, and data from participants who are 89 or older will be included. Date of birth (generalized to year of birth) is the exception to this rule.
Demographic fields that are generalized:
Registered Tier:
Note: Race, ethnicity, sex, and gender data from EHR are excluded. Only PPI is included as the primary source for these fields in the Registered Tier. For additional information about generalization rules see Sex, Gender, and Sexual Orientation Generalizations, Education and Employment Generalizations, and Race and Ethnicity Generalizations.
- Race and ethnicity (less common races are grouped together and selections of two or more races are bundled together as "More than one population")
- Sex at birth (grouped into "Male," "Female," and "Intersex, other sex, prefer not to answer, or skipped")
- Gender identity (grouped into "Man," "Woman," and "Another gender, multiple genders, prefer not to answer, or skipped")
- Sexual orientation (grouped into "Straight" and "Not Straight, prefer not to answer, or skipped")
- Education (grouped into "College 4 years or more or advanced degree," "Some college," "High school graduate," and "Never attended school or only attended kindergarten/primary/middle school/some high school")
- Employment (grouped into "Employed for wages or self-employed" and "Not currently employed for wages")
Controlled Tier:
- Except as indicated in the summary table below, all demographic information will be released within the Controlled Tier dataset.
Overall summary
The following table summarizes privacy rules for the Controlled Tier (compared to the current rules implemented in the Registered Tier):
Data Element |
Registered Tier |
Controlled Tier |
Explicit identifiers |
Suppress |
Suppress |
Free text fields in survey and unstructured clinical documents |
Suppress |
Suppress |
Dates (of events) |
Random shift Backward by a random number between 1 to 365 |
As Collected (unshifted) |
Date of Birth |
Random shift Backward by a random number between 1 to 365 |
Generalize to year of birth |
Date of Death |
Random shift Backward by a random number between 1 to 365 |
As Collected (unshifted) |
Data of participants age >89 |
Suppress |
As Collected |
Geolocation |
Generalize to US state |
Generalize to first 3 digits of zip code |
Marital status |
As Collected |
As Collected |
Living situation PPI (survey): Where are you currently living? |
Suppress |
As Collected |
Own or rent |
As Collected |
As Collected |
Higher level Race/Ethnicity Eg: Asian, White, Black, MENA etc |
Generalize |
As Collected |
Race/Ethnicity subcategory Eg: Hmong, Filipino, Caribbean |
Suppress |
Suppress |
Sex at birth (PPI)* |
Generalize |
As Collected * Includes all branching logic questions |
Gender identity (PPI) |
Generalize |
As Collected * Includes all branching logic questions |
Sexual orientation (PPI) |
Generalize |
As Collected * Includes all branching logic questions |
Race/Ethnicity (EHR)
|
Suppress Value from EHR is suppressed to harmonize with PPI data |
As Collected |
Sex/Gender (EHR) |
Suppress Value from EHR is suppressed to harmonize with PPI data |
As Collected |
ICD codes indicative of suppressed sex/gender List of codes here |
Suppress |
As Collected |
Education |
Generalize |
As Collected |
Employment status |
Generalize |
As Collected |
Annual household income |
As Collected |
As Collected |
Death cause i.e., Death cause noted in the EHR, including relevant diagnosis codes |
Suppress |
As Collected |
Diagnosis codes subject to public knowledge List of codes here |
Suppress |
As Collected |
ICD Codes indicative of motor vehicle accidents ICD9 E80*-E84*, ICD10 V* |
Suppress |
Suppress |
Active duty military status |
Suppress |
As Collected |
Born in US or not |
As Collected |
As Collected * |
Genomic data Includes program-generated Whole Genome Sequencing and Array data |
Suppress |
As Collected |
Note: ‘As Collected’ indicates that there will be no change to the data for the purpose of privacy protection
*Free text responses will be suppressed.
Comments
0 comments
Article is closed for comments.