An End-to-End Native Language Identification Model without the Need for Manual Annotation

Detta är en Master-uppsats från Uppsala universitet/Institutionen för lingvistik och filologi

Sammanfattning: Native language identification (NLI) is a classification task which identifies the mother tongue of a language learner based on spoken or written material. The task gained popularity when it was featured in the 2017 BEA-12-workshop and since then many applications have been successfully found for NLI - ranging from language learning to authorship identification and forensic science. While a considerable amount of research has already been done in this area, we introduce a novel approach of incorporating syntactic information into the implementation of a BERT-based NLI model. In addition, we train separate models to test whether erroneous input sequences perform better than corrected sequences. To answer these questions we carry out both a quantitative and qualitative analysis. In addition, we test our idea of implementing a BERT-based GEC model to supply more training data to our NLI model without the need for manual annotation. Our results suggest that our models do not outperform the SVM baseline, but we attribute this result to the lack of training data in our dataset, as transformer-based architectures like BERT need huge amounts of data to be successfully fine-tuned. In turn, simple linear models like SVM perform well on small amounts of data. We also find that erroneous structures in data come useful when combined with syntactic information but neither boosts the performance of NLI model separately. Furthermore, our implemented GEC system performs well enough to produce more data for our NLI models, as their scores increase after implementing the additional data, resulting from our second experiment. We believe that our proposed architecture is potentially suitable for the NLI task if we incorporate extensions which we suggest in the conclusion section.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)