How are data cleaned in the CDR?

  • Updated


Data in the CDR have been “cleaned” according to the following rules:


  • Rows of PPI questions with incorrect mapping are dropped and replaced with duplicated PPI questions that have the correct mapping
  •  '-1' values on the pain PPI question are set to PMI_Skip (as a -1 was recorded when the participant did not touch the slider for that question)
  • PPI free-numeric values are rounded to the nearest integer
  • Value_source_concept_id and value_as_concept_id for PPI free-numeric answers were set to null for consistency
  • 5 PPI observation_source_concept_ids were dropped from the observation table as they either add no value or are incorrectly mapped
  • "Sex at birth" columns were added to the person table and moved sex at birth PPI information from gender columns to these new columns so gender PPI information could be added to the gender columns
  • Measurement table inputs are cleaned by nulling value_as_number when the value was "9999999" or similar or when all values in that field for a site were 0, dropping duplicated rows, and dropping rows where measurement_concept_id = 0
  • EHR and physical measurements height are cleaned by looking for outliers and flagging those out of range based on specific conditions present (or not) in the participant's EHR
  • Standardizing the units for a group of ~65 labs deemed "high-priority"

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request



Article is closed for comments.