Exploring Automatic Synonym Generation for Lexical Simplification of Swedish Electronic Health Records

Detta är en Master-uppsats från Linköpings universitet/Institutionen för hälsa, medicin och vård

Sammanfattning: Electronic health records (EHRs) are used in Sweden's healthcare systems to store patients' medical information. Patients in Sweden have the right to access and read their health records. Unfortunately, the language used in EHRs is very complex and presents a challenge for readers who lack medical knowledge. Simplifying the language used in EHRs could facilitate the transfer of information between medical staff and patients. This project investigates the possibility of generating Swedish medical synonyms automatically. These synonyms are intended to be used in future systems for lexical simplification that can enhance the readability of Swedish EHRs and simplify medical terminology. Current publicly available Swedish corpora that provide synonyms for medical terminology are insufficient in size to be utilized in a system for lexical simplification. To overcome the obstacle of insufficient corpora, machine learning models are trained to generate synonyms and terms that convey medical concepts in a more understandable way. With the purpose of establishing a foundation for analyzing complex medical terms, a simple mechanism for Complex Word Identification (CWI) is implemented. The mechanism relies on matching strings and substrings from a pre-existing corpus containing hand-curated medical terms in Swedish. To find a suitable strategy for generating medical synonyms automatically, seven different machine learning models are queried for synonym suggestions for 50 complex sample terms. To explore the effect of different input data, we trained our models on different datasets with varying sizes. Three of the seven models are based on BERT and four of them are based on Word2Vec. For each model, results for the 50 complex sample terms are generated and raters with medical knowledge are asked to assess whether the automatically generated suggestions could be considered synonyms. The results vary between the different models and seem to be connected to the amount and quality of the data they have been trained on. Furthermore, the raters involved in judging the synonyms exhibit great disagreement, revealing the complexity and subjectivity of the task to find suitable and widely accepted medical synonyms. The method and models applied in this project do not succeed in creating a stable source of suitable synonyms. The chosen BERT approach based on Masked Language Modelling cannot reliably generate suitable synonyms due to the limitation of generating one term per synonym suggestion only. The Word2Vec models demonstrate some weaknesses due to the lack of context consideration. Despite the fact that the current performance of our models in generating automatic synonym suggestions is not entirely satisfactory, we have observed a promising number of accurate suggestions. This gives us reason to believe that with enhanced training and a larger amount of input data consisting of Swedish medical text, the models could be improved and eventually effectively applied.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)