The Contrastive Tension Methodand Corpus Variety

Detta är en Kandidat-uppsats från Uppsala universitet/Institutionen för informationsteknologi

Författare: Sanne Lindqvist; [2021]

Nyckelord: ;

Sammanfattning: A new method called contrastive tension for creating semantic sentencerepresentations has recently been developed. The method has previously been testedwith three different text collections as source material. This report will continue thiswork and evaluate the method on different kinds of texts. Six text collections will beused. The first four collections contain sentences from four different categories. Thetwo remaining collections are based on the complete works of William Shakespeare.The difference between the two Shakespeare collections is that they use a differentway of dividing the raw text into sentences. The goal of this thesis is to investigate three main points. Firstly, how is the performance on the benchmarks affected by using contrastive tension with different texts? Secondly, does using a text collection that is similar to one of the benchmarking subcategories improve the results on that subcategory? For instance,does a corpus of news headlines result in a higher score on the headline subcategory? Thirdly, is there a difference in performance between the two Shakespeare collections? Can this say anything about the impact of sentence length on performance? The two collections that achieved the highest correlation scores overall were the image caption corpus and, surprisingly, one of the Shakespeare collections. The correlation score is a measure of how well the meaning of sentences are represented by the sentence embeddings, which will be further described in section 2.5. Using acollection similar to a subcategory did improve the score on that subcategorysometimes, but not always. Some collections achieved higher scores than expected on seemingly unrelated subcategories. To conclude, the findings are that the text used as source material does impact the evaluation scores, but not always in the ways that one would expect. The length of the sentences in a corpus appears to have an impact, as the results on the two Shakespeare collections suggest. However, further study is needed to be able to draw definitive conclusions about the importance of sentence length.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)