Trade-offs between Quality and Efficiency in Multilingual Dense Retrieval

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: As the amount of content online grows, information retrieval becomes increasingly crucial. Traditional information retrieval does not take the text order into account and is also dependent on exact text matching between the query and the document. Therefore, a query consisting of synonyms to words in a document will not retrieve that document even if it could have been relevant to the user. An alternative approach is dense retrieval which solves these issues by representing the semantic meaning of the query or document using a vector representation. Semantically similar queries and documents are represented with vectors close to each other in a vector space. Vector similarity search can be used to find the most relevant documents for a query. Since the semantic meanings of the words are used, synonyms and paraphrases are handled implicitly. There are several ways to design these representation vectors, either by using one or several vectors to represent each query or document, by changing the dimensionality of the vectors, or by changing the span of values in the vectors. Each option brings its trade-offs in terms of quality of search results, query latency, and index memory footprint. This study experimented with each of the alternatives above. Since most previous research within the area has been done in a monolingual, mainly English context, this study used four different languages to investigate if the trade-offs differed. In this study, the quality, latency, and memory footprint moved in the same direction, i.e., when the quality increased, then the latency increased as well. This was the case for all the languages. For the version that used one vector each for the document and query, decreasing the dimensionality to 128 or 64 gave significant latency improvements but did not affect the quality. For the larger version, which used 32 vectors for the query and 64 for the document, converting the values of vectors to binary had no significant effect on quality but greatly reduced the storage size.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)