Choosing the most reasonable split of a compound word using Wikipedia

Detta är en Master-uppsats från KTH/Skolan för datavetenskap och kommunikation (CSC)

Författare: Yvonne Le; [2017]

Nyckelord: Compound splitting compounding;

Sammanfattning: The purpose of this master thesis is to make use of the category taxonomy of Wikipedia to determine the most reasonable split from the suggestions generated by an independent compound word splitter. The articles a word was found in can be seen as a group of contexts the word can occur in and also different representations of the word, i.e. an article is a representation of the word. Instead of only analysing the data of each single article, the intention is to find more data for each representation/context to perform an analysis on. The idea is to expand each article representing one context by including related articles in the same category. Two perceptions of a ”reasonable split” was studied. The first case was a split consisting of only two parts and the second case of unlimited parts. This approach is well-suited for choosing the correct split out of a several suggestions but unsuitable for identifying compound words. It would more often than not decide to not split a compound word. It is very dependant on the compound words appearing in Wikipedia.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)