Self-Supervised Fine-Tuning of sentence embedding models using a Smooth Inverse Frequency model : Automatic creation of labels with Smooth Inverse Frequency model

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: Sentence embedding models play a key role in the field of Natural Language Processing. They can be exploited for the resolution of several tasks like sentence paraphrasing, sentence similarity, and sentence clustering. Fine-tuning pre-trained models for sentence embedding extraction is a common practice that allows it to reach state-of-the-art performance on downstream tasks. Nevertheless, this practice usually requires labeled data sets. This thesis project aims to overcome this issue by introducing a novel technique for the automatic creation of a target set for fine-tuning sentence embedding models for a specific downstream task. The technique is evaluated on three distinct tasks: sentence paraphrasing, sentence similarity, and sentence clustering. The results demonstrate a significant improvement in sentence embedding models when employing the Smooth Inverse Frequency technique for automatic extraction and labeling of sentence pairs. In the paraphrasing task, the proposed technique yields a noteworthy enhancement of 2.3% in terms of F1-score compared to the baseline results. Moreover, it showcases a 0.2% improvement in F1-score when compared to the ideal scenario where real labels are utilized. For the sentence similarity task, the proposed method achieves a Pearson score of 0.71, surpassing the baseline model’s score of 0.476. However, it falls short of the ideal model trained with human annotations, which attains a Pearson score of 0.845. Regarding the clustering task, from a quantitative standpoint, the best model achieves a harmonic mean (calculated using DBCV and cophenetic score) of 0.693, outperforming the baseline score of 0.671. Nevertheless, the qualitative assessment did not demonstrate a substantial improvement for the clustering task, highlighting the need for exploring alternative techniques to enhance performance in this area.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)