Diving into the tangle of unstructured clinical notes: Text mining to extract lifestyle classifications for prevention-related research

Problem description

Data on lifestyle is key for prevention-related research. For example, data could be used to 1) identify individuals with an unhealthy lifestyle at high risk of developing (chronic) disease or disease progression, and 2) examine the long-term effectiveness of lifestyle prescription or programs. However, lifestyle data is often recorded as free (unstructured) text in clinical notes. Extraction and interpretation of these data is challenging due to heterogeneity in note-taking among clinicians. Furthermore, privacy and feasibility issues prohibit researchers to go through individual patient records. This complicates extracting this data at a large scale. Recently, Bi-directional Encoder Representations from Transformers (BERT) based models have been developed by researchers in LUMC to classify smoking, alcohol, and drugs use based on Dutch free text from clinical notes from HagaZiekenhuis (Muizelaar et al., 2024). These state-of-the-art NLP models processing allow efficient and valid extraction of lifestyle classifications on a large scale without violating patients’ privacy. More recently, large language models (LLMs) have been shown to outperform or match BERT-based models on several tasks for clinical information extraction (Builtjes et al., 2025). However, they have not been used and validated for lifestyle data extraction yet.

Aim

This collaborative project between RTC AI and the Prevention Hub, aims to- 1) validate, optimize and evaluate the performance of LLMs for smoking, alcohol, and drugs use classification using clinical notes from Radboudumc and the RTC Health data General Practitioners (GP) database, and 2) explore extension of these models to classification of BMI, physical activity, and diet.

This project will also provide insight into the availability and usability of lifestyle data in clinical notes and how registration of lifestyle data could be improved.

Funding

This research project is funded by the Sectorplan and falls under the themes of "Data-driven innovation" and "Data-driven & AI".

People

Anushka Ashish Kore

Anushka Ashish Kore

Research Scientist RTC AI

Silvan Quax

Silvan Quax

Head RTC AI

 Esmée Bakker

Esmée Bakker

Alina Vrieling

Alina Vrieling

Assistant professor

Health Evidence, Radboudumc

 Sander Beekhuis

Sander Beekhuis