Overview
In the All of Us Research Program, participants may choose to share their electronic health records (EHR) for research, either through healthcare provider organizations or directly via participant portals. These records are standardized using the OMOP Common Data Model and made available in the Researcher Workbench as structured EHR tables (e.g., condition_occurrence, observation, measurement).
Clinical notes—narrative entries within EHRs—offer rich, unstructured data that can enhance insights beyond structured fields. To harness this information, the CLAMP (Clinical Language Annotation, Modeling, and Processing) tool is used to extract medical concepts such as procedures, medications, and lab results with high accuracy. Focusing only on extracted conditions entities, these extracted entities are mapped to OMOP concept IDs and stored in the “Note_NLP” table as “NLP-derived” concepts, forming the clinical notes dataset within the CDR. Clinical notes were derived for CDR v9 in both Registered Tier and Controlled Tier. Learn more in the Data Dictionaries and Release Notes.
Methodology
CLAMP (Clinical Language Annotation, Modeling, and Processing) is a natural language processing (NLP) tool tailored for clinical text analysis, capable of extracting problems, procedures, medications, and lab results with high accuracy. This tool has been widely adopted in clinical NLP pipelines for structured data generation from unstructured clinical notes and is used to generate the clinical notes dataset for the CDR. The extracted concepts are stored in the Note_NLP Table and mapped to OMOP CDM as “NLP-derived” concepts.
CLAMP was used to extract entities (lexical variants/snippets) from unstructured clinical notes across multiple sites that participated in note initiative, which were then mapped to structured OMOP concept IDs. Two phase data quality has been conducted to evaluate the performance of NLP processes. Only condition concepts are provided in this release in accordance with our privacy methodology.
Data QC
Phase 1 - Assessing CLAMP’s named entity recognition (NER) performance
Manual annotation was performed by independent clinical experts to improve the accuracy and performance of CLAMP by training the model after readjusting or manually correcting the entities (diagnosis, problem, treatment, drug, and test) and modifiers selected by CLAMP. Specifically, we randomly selected 459 notes from three clinical sites that were annotated. The selection of clinical notes was limited to notes containing at least 250 characters to ensure the content of these notes was substantial enough for manual review. Each note was annotated by at least one clinical annotator. The manually annotated notes were then used to evaluate the extraction performance of CLAMP by calculating precision, recall, and F1-scores. The annotation was performed by four annotators who possessed strong backgrounds in biomedical sciences, including clinical knowledge and experience in medicine, dentistry, and physical therapy. All annotators underwent rigorous training and consistently achieved an agreement F1-score of at least 0.80 with the gold standard set by the lead annotator. Evaluation of CLAMP performance (Table 1) showed high accuracy under both strict (i.e., exact match) and relaxed (i.e., allowing difference up to two characters in the ending offset) matching criteria, with overall strict precision 0.867, recall 0.829, F1 0.848, and relaxed precision 0.915, recall 0.875, F1 0.894.
| Sites* | Number of Notes | Strict | Relaxed | ||||
| Precision | Recall | F-Score | Strict | Relax | F-score | ||
| Overall | 459 | 0.867 | 0.829 | 0.848 | 0.915 | 0.875 | 0.894 |
| Site 1 | 50 | 0.822 | 0.782 | 0.802 | 0.897 | 0.853 | 0.875 |
| Site 2 | 190 | 0.895 | 0.859 | 0.876 | 0.923 | 0.886 | 0.904 |
| Site 3 | 219 | 0.855 | 0.815 | 0.834 | 0.912 | 0.869 | 0.980 |
Please note - the raw notes data used for Phase 1 evaluation is a random sample (meets criteria: at least 250 characters) and may or may not be part of the actual release in CDR.
Phase 2 - Assessing the accuracy of mapping lexical variants
Phase 2 QC focused on assessing the accuracy of mapping lexical variants (i.e. Raw text extracted from the CLAMP - NLP tool) to OMOP concept names using various embedding models. We evaluated the use of CLAMP’s built-in mapping tool (based on string edit distance) to map lexical variants to OMOP concepts. After evaluation, we discovered that CLAMP’s mapping was not ideal based on exploratory evaluation. For example, “axilla tissue” was incorrectly mapped to “animal tissue,” and “biopsy proven splenic tissue” was incorrectly mapped to “biopsy of perirenal tissue.” Hence, we decided to use the latest embedding models to optimize the mapping of NLP-extracted lexical variants to the most appropriate OMOP concept. We implemented and evaluated the performance of nine embedding models (i.e., MedEmbed [Small/Large], UF NLP, MiniLM-L6-v2, Snowflake Embed-S, GTE-small, E5-small-v2, UmlsBERT, and Qwen3). After removing snippets containing only numbers or special characters (N=533,750 terms), each of the nine embedding models generated an OMOP concept and a semantic similarity score. Non-standard concepts were mapped to standard concepts before analysis. We then used a “popularity vote” as the highest number of models mapping a lexical variant to the same concept, tiered by the number of models in agreement. A gold standard set of 800 term–concept pairs were manually annotated by two expert biomedical reviewers (with adjudication by a third in case of disagreement), stratified across vote tiers (i.e., the number of models that mapped extracted entity to the same concept) to assess mapping accuracy at different levels of model agreement.
Two strategies were evaluated: (1) Popularity voting – selecting the concept with the most model agreement, tested across 9 embedding models; and (2) Weighted method – logistic regression models trained on normalized z-scores of semantic similarity scores from all nine embedders, weighted by individual model accuracy, to compute the probability of a correct mapping.
Precision, recall, and F1 scores were computed for multiple thresholds to identify a balance between accuracy and dataset coverage. The voting method reached a good balance of precision, recall, and F1 score at a threshold of ≥5 model agreements, while the weighted method performed yielded optimal F1 score at a probability threshold ≥0.45 (Table 2). Combining both methods (vote count ≥5 and weighted probability ≥0.45) provides high-confidence mappings while minimally reducing data volume. This overlap strategy enhances concept accuracy in the absence of a gold standard. Based on these results (Table 2), the recommended release criteria require note_nlp concepts to meet both vote count ≥5 and weighted confidence ≥0.45, leveraging the strengths of both methods to ensure high semantic alignment of NLP-derived OMOP condition concepts.
| Method | Threshold Applied | Precision | Recall | F1 | Notes Retained* | Participants Retained* |
| Popularity Voting | Vote count ≥5 | 0.847 | 1.000 | 0.917 | 9,727,311 | 107,912 |
| Weighted Method | Confidence ≥0.45 | 0.870 | 0.956 | 0.910 | 11,805,784 | 108,307 |
| Combined (Final Filter) | Vote count ≥5 and Confidence ≥0.45 | — | — | — | 9,647,500 | 107,729 |
Please note - the data used for evaluation and computing metrics for Phase 2 is a base/raw dataset (i.e. pre-privacy filtering and cleaning), which is not exactly the same as researcher-facing release data. Therefore, numbers for participants and note counts should not be used to estimate the data available in the Curated Data Repository (CDR).
QC Summary
This QC framework not only evaluates entity extraction accuracy and semantic mapping quality but also provides transparent metrics and methodology that give researchers clear insight into data provenance and reliability. By detailing the process and performance at each QC stage, this report enables researchers to determine the most appropriate way to use NLP-derived concepts for their own analyses, tailoring inclusion criteria or sensitivity analyses to match their specific study needs.
Comments
0 comments
Article is closed for comments.