General Purpose Vector Representation for Swedish Documents : An application of Neural Language Models

Detta är en Uppsats för yrkesexamina på avancerad nivå från Umeå universitet/Institutionen för fysik

Författare: Simon Hedström; [2019]

Nyckelord: ;

Sammanfattning: This thesis is a proof-of-concept for embedding Swedish documents using continuous vectors. These vectors can be used as input in any subsequent task and serves as an alternative to discrete bag of words vectors. The differences goes beyond fewer dimensions as the continuous vectors also hold contextual information. This means that documents with no shared vocabulary can be directly identified as contextually similar, which is impossible for the bag of words vectors. The continuous vectors are the result of neural language models and algorithms that pool the model output into document-level representations. This thesis has looked into the latest research regarding such models, starting from the Word2Vec algorithms. A wide variety of neural language models were selected together with algorithms for pooling word and sentence vectors into document vectors. For the training of the neural language models we have assembled a training corpus spanning 1.2 billion Swedish words. The trained neural language models were later paired with pooling algorithms to finalize an array of document vector models. The document vector models were evaluated on five classifications tasks and compared against the baseline bag of words vectors. A few models that were trained directly on the evaluation data were also included as reference. For each evaluation task the setup was held constant, which ensured that any difference in performance came from the quality of the document representations. The results show that the continuous document vectors outperform the baseline on topic and text format classifications tasks. It was noted that the best performance was achieved when a document vector model was trained directly on the evaluation data. However, this result was only marginally better than that of the best general document vector models. In conclusion it was a successful proof of concept but there are still improvements to be made, such as optimizing the composition of the training corpus. Due to its simplicity and overall performance we recommend a general Sent2Vec model as a new baseline for future projects.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)