As part of the curation process, some participant identifiable information has been removed from the Registered and Controlled Tier datasets to protect participant privacy and reduce the risk of re-identification. This includes some participant geolocation data. Geolocation data may come from participant provided information (PPI), or from the electronic health records (EHR) data. In most cases, participants’ self-reported addresses will be used as the primary source of geolocation data. Below is information about available geolocation fields specific to the Registered and Controlled Tier datasets, respectively. For additional information on how All of Us protects participant privacy by means of data removal, suppression, or generalizations, click here.
Geolocation Data Available within the Registered Tier
State-level data are the most granular geolocation available within the Registered Tier dataset and are subject to Registered Tier state generalization rules. Researchers will not be able to identify the specific All of Us site where a participant enrolled.
Participants’ state-level geolocation data are generalized, with the value_as_concept_id = 2000000011, in the Registered Tier dataset if any of the following conditions are met:
-
A participant has EHR data available in the dataset and self-reported state of residence differs from the state of the health provider organization sharing the EHR data.
-
The participant has reported a state of residence from which there are 200 or fewer total participants enrolled.
The state location data are stored in the data tables in all rows with the observation_source_concept_id = 1585249. The state code is stored in value_as_concept_id in the observation table.
To get the state name:
-
Join the value_as_concept_id in the observation table with concept_id from the concept table.
-
Use concept_name from the concept table to extract the state.
Coding examples for how to retrieve state location are below.
Example of code in Python: ({dataset} refers to the fc-aou-...R2019Q4...)
To load the dataset:
dataset = %env WORKSPACE_CDR
To query the count of participants by states:
query = """
SELECT
concept_name AS concept,
COUNT(*) AS nparticipant
FROM `{dataset}.person` AS p
LEFT JOIN `{dataset}.concept` ON {concept} = concept_id
GROUP BY 1
ORDER BY 2
"""
state_df = pd.read_gbq(query.format(dataset = dataset,
concept = 'state_of_residence_concept_id'))
state_df['state_of_residence'] = state_df['concept'].str.replace('PII State: ', '')
state_df.head()
Example of code in R: ({dataset} refers to the fc-aou-...R2019Q4...)
To load the dataset:
library(bigrquery)
library(DBI)
library(tidyverse)
download_data <- function(query) {
tb <- bq_project_query(Sys.getenv('GOOGLE_PROJECT'), query)
bq_table_download(tb)
}
dataset <- Sys.getenv('WORKSPACE_CDR')
To query the count of participants by states:
extract_demographic <- function(concept) {
query <- str_glue("
SELECT
concept_name AS concept,
person_id
FROM
`{dataset}.person` AS p
LEFT JOIN
`{dataset}.concept` ON {concept} = concept_id
")
download_data(query)
}
Extract the data:
state_df <- extract_demographic('state_of_residence_concept_id') %>%
mutate(state_of_residence = gsub('PII State: ', '', concept))
See what the df looks like:
head(state_df)
Geolocation Data Available within the Controlled Tier
Three digit zip code level data are the most granular geolocation data available within the Controlled Tier dataset and are subject to Controlled Tier state generalization rules. In cases where there are less than 20,000 participants reported to reside in a three digit zip code area, those participants are aggregated into another nearby three digit zip code area.
We determined that three-digit zip code information isn’t available in the v6 CT CDR observation tables for zip codes starting with the number “0”, which was caused by an error with curation rules enabled to protect participant privacy in areas with less than 20,000 inhabitants. These zip codes are instead listed as "00000" like they would be if they came from regions with less than 20,000 inhabitants. This issue has been resolved for the v7 CT CDR. This issue doesn’t result in any improper sharing of participant information, but will prevent Workbench users from utilizing three-digit zip code information from the following states: Maine, Massachusetts, Connecticut, Rhode Island, New Hampshire, Vermont, and New Jersey. Our curation team confirmed that this data will be fixed in the upcoming winter release. Further questions about this can be sent to the Help Desk at support@researchallofus.org. |
Geolocation data in the Controlled Tier can be accessed via the Dataset Builder as a pre-packaged concept set or by directly querying in the Jupyter notebook. Note: Researchers will not be able to identify the specific All of Us site where a participant enrolled.
To access via the Dataset Builder, select the pre-packaged concept set "Zip Code Socioeconomic Status Data."
You can also access geolocation by zip code via the following query:
Example of code in Python:
To load the dataset:
cdr = %env WORKSPACE_CDR
To query the count of participants by states and zip code:
query = f'''SELECT COUNT(DISTINCT person_id) AS n_participant
, state_of_residence_source_value AS State
,value_as_string AS Zip_code
FROM `{cdr}.person_ext`
JOIN `{cdr}.observation` USING(person_id)
WHERE observation_source_concept_id = 1585250
GROUP BY 2,3'''
To save state in a dataframe:
df = pd.read_gbq(query)
df.head()
Example of code in R:
library(bigrquery)
library(DBI)
library(tidyverse)
To load the dataset:
cdr <- Sys.getenv('WORKSPACE_CDR')
To query the count of participants by states and zip code:
query <- str_glue("SELECT COUNT(DISTINCT person_id) AS n_participant
, state_of_residence_source_value AS State
, value_as_string AS Zip_code
FROM `{cdr}.person_ext`
INNER JOIN `{cdr}.observation` USING(person_id)
WHERE observation_source_concept_id = 1585250
GROUP BY 2,3")
To save state in a dataframe:
tb <- bq_project_query(Sys.getenv('GOOGLE_PROJECT'), query)
df <- bq_table_download(tb)
head(df)
Comments
0 comments
Article is closed for comments.