Text Classification of Human Resources-related Data with Machine Learning

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Författare: Christine Rosquist; [2021]

Nyckelord: ;

Sammanfattning: Text classification has been an important application and research subject since the origin of digital documents. Today, as more and more data are stored in the form of electronic documents, the text classification approach is even more vital. There exist various studies that apply machine learning methods such as Naive Bayes and Convolutional Neural Networks (CNN) to text classification and sentiment analysis. However, most of these studies do not focus on cross- domain classification i.e., machine learning models that have been trained on a dataset from one context are tested on another dataset from another context. This is useful when there is not enough training data for the specific domain where text data is to be classified. This thesis investigates how the machine learning methods Naive Bayes and CNN perform when they are trained in one context and then tested in another slightly different context. The study uses data from employee reviews in order to train the models, and the models are then tested on both the employee-review data but also on human resources-related data. Thus, the aim with the thesis is to gain insights on how to develop a system with the capability to perform an accurate cross-domain classification, and to provide more insights to the text classification research area in general. A comparative analysis of the models Naive Bayes and CNN was done, and the results showed that both of the models performed quite similarly when classifying sentences by only using the employee-review data to train and test the models. However, CNN performed slightly better when it comes to multiclass classification for the employee data, which indicates that CNN might be a better model in that context. From a cross-domain perspective, Naive Bayes turned out to be the better model since it performed better in all of the metrics evaluated. However, both of the models can be used as guidance tools in order to classify human-resources related data quickly, even if Naive Bayes is the model that performs the best in the cross-domain context. The results can possibly be improved with more research and need to be verified with more data. Suggestions on how to improve the results are among others to enhance the hyperparameter optimization, use another approach to handle the data imbalance, and adjust the preprocessing methods used. It is also worth noting that the statistical significance could not be confirmed in all of the different test cases, meaning that no absolute conclusions can be drawn, but the results from this thesis work still provide an indication of how well the models perform. 

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)