A Hybrid Approach to Hate Speech Detection

Detta är en Master-uppsats från Umeå universitet/Institutionen för datavetenskap

Författare: Simon Rickardsson; [2023]

Nyckelord: ;

Sammanfattning: An interesting question is to what extent can background knowledge help in the context of text classification. To address this in more detail, can a traditional rulebased classifier help boost the accuracy of learned models? We explore this here for detecting hate speech and offensive language in online text.To do this, we use two corpora where the first one is a dataset consisting of tweets with slang language, and the second is a dataset containing Wikipedia comments representing what we define as a more general language. To encode background knowledge we use simple hand-built dictionaries of offensive words associated with each dataset that we integrate into the learning process.The machine learning approaches we will experiment with are Logistic Regression, Naive Bayes, Decision Tree, Random Forest, and Linear SVM. Our technique to integratethe hand-built classifier will be to add the dictionary of offensive words as a feature in the feature matrix by creating a binary feature for each word, indicating whether the word is present in the text or not. This allows the model to consider the presence of specific offensive words and make predictions accordingly.We will compare this with an ensemble method that runs the hand-built classifier in parallel with the learned model.After performing the experiments, we found that the integration of a dictionary with traditional machine learning methods significantly improved performance, and in some cases more than others. It is clear that size and characteristics of the dictionary used play a significant role in the outcome of performance and the usefulness of ensemble methods in this context is also shown to have great potential.These findings suggest that background knowledge, when combined with machine learning approaches, can greatly enhance the effectiveness of text classification tasks if done properly. Future work should explore this further, possibly examining different kinds of models or applying our approach to other contexts.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)