Automatic Classification of text regarding Child Sexual Abusive Material

Detta är en Uppsats för yrkesexamina på avancerad nivå från Uppsala universitet/Avdelningen för systemteknik

Sammanfattning: Sexual abuse is a horrible reality for many children around the world. As technology improves the availability of encryption schemes and anonymity over the internet, the perpetrators of these acts are increasingly hard to track. There have been several advances in recent time to automate the work of trying to catch these perpetrators and especially image recognition has seen great promise. While image recognition is a natural approach to these subjects as many abuses are documented and shared between perpetrators, there are potentially many leads that go unexplored if only focusing on images and videos. This study evaluates how methods of supervised machine learning solely based on textual data can point us to posts on forums which are connected to the distribution of child sexual abusive material. Feature representation techniques such as word-vectors, paragraphvectors and the FastText algorithm were used in conjunction with supervised machine learning methods based on deep learning, including methods of multilayer perceptrons, convolutional neural networks and long-short term memory models. The models were trained and evaluated on a dataset based on forum posts from a Dark Net leak from last year, and are evaluated as well on text collected from websites that had been manually verified by Ecpat. Those models were compared to a baseline model based on logistic regression. It was found that those state-of-the-art models achieve a similar performance, all outperforming the 'benchmark' logistic regression model. Further improvements can be achieved based on the availability of more annotated data.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)