Data Augmentation: Enhancing Named Entity Recognition Performance on Swedish Medical Texts

Detta är en Master-uppsats från Göteborgs universitet/Institutionen för data- och informationsteknik

Sammanfattning: Named Entity Recognition (NER) refers to the task of locating relevant information within text sequences. Within the medical domain, it can benefit applications such as de-identifying patient records or extracting valuable data for other downstream tasks. However, achieving a highly reliable system can be challenging, particularly for low-resource languages such as Swedish, where the amount of accessible text data is relatively small compared to larger languages. To tackle this challenge, data augmentation has emerged as a promising solution, where new data samples can artificially be generated. This study explores various BERT models and data augmentation techniques to identify the best-performing method for performing NER on Swedish patient records from Karolinska University Hospital, namely the Stockholm EPR PHI Pseudo Corpus. Our findings reveal that the BERT model, SweDeClin-BERT, was the highest-performing method, yielding the highest F1 score. Additionally, we demonstrate that data augmentation can further enhance performance, especially in the context of smaller datasets. By deploying data augmentation to a portion of 50% of the training data, we demonstrate comparable results to using 100% of the original training data without any augmentation.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)