Categorization of Customer Reviews Using Natural Language Processing

Detta är en Kandidat-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: Databases of user generated data can quickly become unmanageable. Klarna faced this issue, with a database of around 700,000 customer reviews. Ideally, the database would be cleaned of uninteresting reviews and the remaining reviews categorized. Without knowing what categories might emerge, the idea was to use an unsupervised clustering algorithm to find categories. This thesis describes the work carried out to solve this problem, and proposes a solution for Klarna that involves artificial neural networks rather than unsupervised clustering. The implementation done by us is able to categorize reviews as either interesting or uninteresting. We propose a workflow that would create means to categorize reviews not only in these two categories, but in multiple. The method revolved around experimentation with clustering algorithms and neural networks. Previous research shows that texts can be clustered, however, the datasets used seem to be vastly different from the Klarna dataset. The Klarna dataset consists of short reviews and contain a large amount of uninteresting reviews. Using unsupervised clustering yielded unsatisfactory results, as no discernible categories could be found. In some cases, the technique created clusters of uninteresting reviews. These clusters were used as training data for an artificial neural network, together with manually labeled interesting reviews. The results from this artificial neural network was satisfactory; it can with an accuracy of around 86% say whether a review is interesting or not. This was achieved using the aforementioned clusters and five feedback loops, where the model’s wrongfully predicted reviews from an evaluation dataset was fed back to it as training data. We argue that the main reason behind why unsupervised clustering failed is that the length of the reviews are too short. In comparison, other researchers have successfully clustered text data with an average length in the hundreds. These items pack much more features than the short reviews in the Klarna dataset. We show that an artificial neural network is able to detect these features despite the short length, through its intrinsic design. Further research in feature extraction of short text strings could provide means to cluster this kind of data. If features can be extracted, the clustering can thus be done on the features rather than the actual words. Our artificial neural network shows that the arbitrary features interesting and uninteresting can be extracted, so we are hopeful that future researchers will find ways of extracting more features from short text strings. In theory, this should mean that text of all lengths can be clustered unsupervised. 

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)