Data Segmentation Using NLP: Gender and Age

Detta är en Uppsats för yrkesexamina på avancerad nivå från Uppsala universitet/Avdelningen för datalogi

Sammanfattning: Natural language processing (NLP) opens the possibilities for a computer to read, decipher, and interpret human languages to eventually use it in ways that enable yet further understanding of the interaction and communication between the human and the computer. When appropriate data is available, NLP makes it possible to determine not only the sentiment information of a text but also information about the author behind an online post. Previously conducted studies show aspects of NLP potentially going deeper into the subjective information, enabling author classification from text data. This thesis addresses the lack of demographic insights of online user data by studying language use in texts. It compares four popular yet diverse machine learning algorithms for gender and age segmentation. During the project, the age analysis was abandoned due to insufficient data. The online texts were analysed and quantified into 118 parameters based on linguistic differences. Using supervised learning, the researchers succeeded in correctly predicting the gender in 82% of the cases when analysing data from English online users. The training and test data may have some correlations, which is important to notice. Language is complex and, in this case, the more complex methods SVM and Neural networks were performing better than the less complex Naive Bayes and Logistic regression.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)