Low-Resource Domain Adaptation for Jihadi Discourse : Tackling Low-Resource Domain Adaptation for Neural Machine Translation Using Real and Synthetic Data

Detta är en Master-uppsats från Uppsala universitet/Institutionen för lingvistik och filologi

Författare: Thea Tollersrud; [2023]

Nyckelord: machine translation; domain adaptation;

Sammanfattning: In this thesis, I explore the problem of low-resource domain adaptation for jihadi discourse. Due to the limited availability of annotated parallel data, developing accurate and effective models in this domain poses a challenging task. To address this issue, I propose a method that leverages a small in-domain manually created corpus and a synthetic corpus created from monolingual data using back-translation. I evaluate the approach by fine-tuning a pre-trained language model on different proportions of real and synthetic data and measuring its performance on a held-out test set. My experiments show that fine-tuning a model on one-fifth real parallel data and synthetic parallel data effectively reduces occurrences of over-translation and bolsters the model's ability to translate in-domain terminology. My findings suggest that synthetic data can be a valuable resource for low-resource domain adaptation, especially when real parallel data is difficult to obtain. The proposed method can be extended to other low-resource domains where annotated data is scarce, potentially leading to more accurate models and better translation of these domains.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)