Identifying Hateful Text on Social Media with Machine Learning Classifiers and Normalization Methods - Using Support Vector Machines and Naive Bayes Algorithm

Detta är en Kandidat-uppsats från Umeå universitet/Institutionen för datavetenskap

Författare: Sebastian Sandberg; [2018]

Nyckelord: ;

Sammanfattning: Hateful content on social media is a growing problem. In this thesis, machine learning algorithms and pre-processing methods have been combined in order to train classifiers in identifying hateful text on social media. The combinations have been compared in terms of performance, where the considered performance criteria have been F-score and accuracy in classification. Training are performed using Naive Bayes algorithm(NB) and Support Vector Machines (SVM). The pre-processing techniques that have been used are tokenization and normalization. Fortokenization, an open-source unigram tokenizer have been used while a normalization model that normalizes each tweet pre-classification have been developed in Java. Normalization include basic clean up methods such as removing stop words, URLs, and punctuation, as well as altering methods such as emoticon conversion and spell checking. Both binary and multi-class versions of the classifiers have been used on balanced and unbalanced data. Both machine learning algorithms perform on a reasonable level with accuracy between 76.70% and 93.55% and an F-score between 0.766 and 0.935. The results point towards the fact that the main purpose of normalization is to reduce noise, balancing data is necessary and that SVM seem to slightly outperform NB.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)