Using Semi-Supervised Learning for Email Classification

Detta är en Master-uppsats från KTH/Matematik (Avd.)

Sammanfattning: In this thesis, we investigate the use of self-training, a semi-supervised learning method, to improve binary classification of text documents. This means making use of unlabeled samples, since labeled samples can be expensive to generate. More specifically, we want to classify emails that are retrieved by Skandinaviska Enskilda Banken (SEB). The method is tested on two datasets: the first is IMDB reviews, consisting of both labeled (good or bad) and unlabeled movie reviews; the second is provided by SEB and consists of labeled and unlabeled emails. First, supervised learning was investigated. Three different vectorization methods including two bag-of-words models and one doc2vec model were included. These were tested using five different machine learning classification methods. The comparison of the F1-score showed that doc2vec vectorization and the logistic regression classification method performed well and was used in the self-training investigation. We find that self-training on the IMDB dataset only yielded improvement for low number of labeled samples. For the SEB dataset we find that by using self-training, we can achieve the same F1-score using only around 1000 labeled samples (less than 10% of the labeled dataset), as using supervised methods on the full labeled set. We conclude that self-training can improve classification performance and also be used indirectly to reduce manual labeling efforts.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)