Using word embeddings and domain specific data for information retrieval in the Swedish consumer health domain

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Författare: Cristian Osorio Bretti; [2020]

Nyckelord: ;

Sammanfattning: The amount of information in most of the world’s systems is increasing rapidly. To be able to find relevant information in the vast amount of data, good information retrieval (IR) algorithms are needed. IR algorithms can be constructed and tuned to perform well in different domains. One such domain is consumerhealth, a domain in which regular people (consumers) search for medical information. One way to improve IR algorithms is to use word embeddings (WE). WE are vector representations of words in such a way that similar words have similar vectors. Previous research has shown that using WE in IR gives promising results. In this thesis, an IR algorithm based on WE is implemented. This is evaluated in the domain of Swedish consumer health queries by using query-logs from a Swedish digital healthcare provider as evaluation data. The popular BM25 algorithm was used as a baseline algorithm. A linear combination of the WE algorithm and the BM25 algorithm, the mixture model (MM), was also implemented. Experiments revealed that a MM that is mostly similar to the WE is preferable. When evaluating these three algorithms, it became apparent that the MM algorithm performed best overall. Four different evaluation metrics were used and the MM algorithm was the second-best in three and the best in one of these. Since the MM algorithm was most similar to the WE algorithm, it indicates that integrating WE in IR has a positive effect. Although further research in the area is recommended to confirm these initial findings, this thesis indicates that WE can potentially improve IR in the domain of Swedish consumer health.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)