IŻ SWÓJ JĘZYK MAJĄ! An exploration of the computational methods for identifying language variation in Polish

Detta är en Master-uppsats från Göteborgs universitet / Institutionen för filosofi, lingvistik och vetenskapsteori

Sammanfattning: Computational approaches to language variation continue to contribute in a relevant way to various fields, including Natural Language Processing (NLP) and linguistics. Being able to accommodate variation within natural language increases the robustness of NLP models and their usefulness in real-life applications; simultaneously, detecting and describing variation and trends that govern it is one of the main goals of sociolinguistics and historical linguistics, meaning that some of the advances in NLP can contribute to these fields as well. As one of the current trends in historical linguistics appears to be quantitative and corpus research, the need for annotated historical data is becoming more and more apparent. Within this thesis, a selection of tools and methods are tested for their ability to detect variation between a manually annotated sample of non-standard historical Polish and corpora of modern Polish and tools based on them. The experiments include part-of-speech tagging with two tagsets, lemmatization, vocabulary comparisons, and an n-gram analysis. The results reveal what kinds of variation each approach can discover in the text and to what extent. Since the majority of the presented methods require the data to be annotated, they would be time- and resource-consuming if applied to larger corpora; nevertheless, they do reveal certain trends in variation as well as information on what kind of preprocessing may be needed for this sort of data to be successfully automatically annotated, which could enable the creation of a larger corpus, facilitating further research. Additionally, a comparison of tagging and lemmatizing performance of various tools on modern Polish is presented, and the annotated historical text itself constitutes a relevant contribution as well.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)