Task-agnostic knowledge distillation of mBERT to Swedish

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: Large transformer models have shown great performance in multiple natural language processing tasks. However, slow inference, strong dependency on powerful hardware, and large energy consumption limit their availability. Furthermore, the best-performing models use high-resource languages such as English, which increases the difficulty of using these models for low-resource languages. Research into compressing large transformer models has been successful, using methods such as knowledge distillation. In this thesis, an existing task-agnostic knowledge distillation method is employed by using Swedish data for distillation of mBERT models further pre-trained on different amounts of Swedish data, in order to obtain a smaller multilingual model with performance in Swedish competitive with a monolingual student model baseline. It is shown that none of the models distilled from a multilingual model outperform the distilled Swedish monolingual model on Swedish named entity recognition and Swedish translated natural language understanding benchmark tasks. It is also shown that further pre-training mBERT does not significantly affect the performance of the multilingual teacher or student models on downstream tasks. The results corroborate previously published results showing that no student model outperforms its teacher.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)