Exploit Unlabeled Data with Language Model for Text Classification. Comparison of four unsupervised learning models

Detta är en Master-uppsats från Göteborgs universitet/Institutionen för filosofi, lingvistikoch vetenskapsteori

Sammanfattning: Within a situation where Semi-Supervised Learning (SSL) is available to exploit unlabeled data, this paper shows that Language Model (LM) outperforms the three models in text classification, which three models are based on Term-Frequency Inverse Document Frequency (Tf-idf) and two pre-trained word vectors. The experimental results show that the LM outperforms the other three unsupervised learning models whether the task is easy or difficult, which the difficult task consists of imbalanced data.To investigate not only how the LM outperforms the other models but also how to maximize the performance of the LM in a small quantity of labeled data, this paper suggests two techniques to improve the performance of the LM in neural networks: (1) obtaining information from the neural network layers and (2) employing a proper evaluation for trained neural networks models.Finally, this paper explores the various scenarios where SSL is not available, but only Transfer Learning (TL) is accessible to exploit unlabeled data. With two types of Self-Taught Learning and Multi-Tasks in TL, the results of the experiments show that exploiting dataset which has wider domain benefits the performance of the LM.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)