How All of Us protects participant privacy

  • Updated

How Does All of Us Protect Participant Privacy?

Participant privacy is of utmost importance to the All of Us Research Program. Our privacy experts have created curation methods to remove all personally identifying information (PII) from participant records. In addition to PII, some data are subject to suppression (withholding or removing selected information) or generalization based on re-identification risk. 

Overview of the privacy rules

The All of Us data are released to investigators through three separate tiers based on the privacy risks inherent in the data included for that tier and the requirements for access: Public, Registered, and Controlled Tiers. 

  • Public Tier data include only anonymized, aggregated data and are accessible to the public through the All of Us Research Hub's Data Browser and Survey Explorer.
  • Registered Tier data include participant-level data with added transformations to protect participant privacy and are only available to registered researchers. These transformations are based on modeling and empirical analysis of re-identification risks associated with each data element. 
  • Controlled Tier data include participant-level data with fewer transformations than those added in Registered Tier. These transformations can be summarized as follows:

Fields that are removed:

Registered Tier:

  • All explicit identifiers (e.g., personal name, medical record number, and participant ID)
  • All free-text fields in survey data and unstructured documents from electronic health records (EHR)
  • All geolocation data smaller than US state level (e.g., location ID, provider ID)
  • Race and ethnicity subcategories in survey data (e.g., Hmong, Filipino, or Caribbean)
  • Demographics (race/ethnicity and sex) (from EHR data only)
  • Living situation (from survey data)
  • Active-duty military status 
  • Death causes (EHR) (e.g., diagnosis codes specifying cause of death)
  • Diagnosis codes subject to public knowledge (EHR)
  • Codes specifying sex, sexuality, or gender types suppressed (EHR)
  • EHR structured concepts related to COVID-19 testing or diagnosis, including serology results
  • COVID-19 Participant Experience (COPE) survey items detailing COVID-19 and flu tests, diagnoses, and symptoms.

Controlled Tier:

The Controlled Tier will exclude all direct identifiers (similar to the Registered Tier). The data privacy rules currently in place for the Registered Tier will be modified for the Controlled Tier as follows: 

  • Unstructured text, including free text survey responses and clinical documents will be suppressed in the Beta release of Controlled Tier. Future release of the Controlled Tier may include extracts of clinical documents or unstructured text after transformations such as mapping to standard concepts to protect privacy. 
  • Much of the transformations that have been implemented to demographic variables in the Registered Tier will be removed from the Controlled Tier CDR, enabling access to more granular demographic information. 
  • Genomic data currently excluded from the Registered Tier will be included in the Controlled Tier curated data repository (CDR). All direct identifies as well as other identifying information is removed using research_id as the sample key for both array and whole genome sequence data.
  • COVID-19 data (e.g. ICD codes, tests indicative of COVID status), currently suppressed in the Registered Tier, will be included in the Controlled Tier. 

Date transformation:

Registered Tier:

  • All dates are shifted backwards by a random number between 1 - 365
    • The shift is constant for each participant so that temporality of events is preserved
    • Exception: COPE survey data are not date shifted following this schema. 
    • All participants aged > 89 are removed

Controlled Tier

  • Real (unshifted) dates of events will be available, and data from participants who are 89 or older will be included. Date of birth (generalized to year of birth) is the exception to this rule. 

Demographic fields that are generalized:

Registered Tier:

Note: Race, ethnicity, sex, and gender data from EHR are excluded. Only PPI is included as the primary source for these fields in the Registered Tier. For additional information about generalization rules see Sex, Gender, and Sexual Orientation Generalizations, Education and Employment Generalizations, and Race and Ethnicity Generalizations.

  • Race and ethnicity (less common races are grouped together and selections of two or more races are bundled together as "More than one population")
  • Sex at birth (grouped into "Male," "Female," and "Intersex, other sex, prefer not to answer, or skipped")
  • Gender identity (grouped into "Man," "Woman," and "Another gender, multiple genders, prefer not to answer, or skipped")
  • Sexual orientation (grouped into "Straight" and "Not Straight, prefer not to answer, or skipped")
  • Education (grouped into "College 4 years or more or advanced degree," "Some college," "High school graduate," and "Never attended school or only attended kindergarten/primary/middle school/some high school")
  • Employment (grouped into "Employed for wages or self-employed" and "Not currently employed for wages")

Controlled Tier:

  • Except as indicated in the summary table below, all demographic information will be released within the Controlled Tier dataset.

Overall summary 

The following table summarizes privacy rules for the Controlled Tier (compared to the current rules implemented in the Registered Tier): 

Data Element

Registered Tier

Controlled Tier 

Explicit identifiers 

Suppress

Suppress 

Free text fields in survey and unstructured clinical documents

Suppress

Suppress

Dates (of events)

Random shift 

Backward by a random number between 1 to 365

As Collected (unshifted)

Date of Birth 

Random shift

Backward by a random number between 1 to 365

Generalize to year of birth

Date of Death

Random shift

Backward by a random number between 1 to 365

As Collected (unshifted)

Data of participants age >89

Suppress

As Collected 

Geolocation

Generalize to US state

Generalize to first 3 digits of zip code

Marital status

As Collected 

As Collected 

Living situation 

PPI (survey): Where are you currently living?

Suppress

As Collected 

Own or rent

As Collected 

As Collected  

Higher level Race/Ethnicity 

Eg: Asian, White, Black, MENA etc

Generalize

As Collected 

Race/Ethnicity subcategory 

Eg: Hmong, Filipino, Caribbean

Suppress

Suppress

Sex at birth (PPI)*

Generalize

As Collected *

Includes all branching logic questions

Gender identity (PPI) 

Generalize

As Collected *

Includes all branching logic questions

Sexual orientation (PPI) 

Generalize

As Collected *

Includes all branching logic questions

Race/Ethnicity (EHR)

 

Suppress

Value from EHR is suppressed to harmonize with PPI data

As Collected 

Sex/Gender (EHR) 

Suppress

Value from EHR is suppressed to harmonize with PPI data

As Collected 

ICD codes indicative of suppressed sex/gender

List of codes here

Suppress

As Collected 

Education

Generalize

As Collected 

Employment status

Generalize

As Collected 

Annual household income

As Collected 

As Collected 

Death cause 

i.e., Death cause noted in the EHR, including relevant diagnosis codes

Suppress

As Collected 

Diagnosis codes subject to public knowledge 

List of codes here

Suppress

As Collected 

ICD Codes indicative of motor vehicle accidents

ICD9 E80*-E84*, ICD10 V*

Suppress

Suppress

Active duty military status

Suppress

As Collected 

Born in US or not

As Collected 

As Collected *

Genomic data 

Includes program-generated Whole Genome Sequencing and Array data

Suppress

As Collected

Note: ‘As Collected’ indicates that there will be no change to the data for the purpose of privacy protection 

*Free text responses will be suppressed. 

Was this article helpful?

10 out of 12 found this helpful

Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.