Topical Classification of Images in Wikipedia : Development of topical classification models followed by a study of the visual content of Wikipedia

Detta är en Master-uppsats från Linköpings universitet/Datorseende

Sammanfattning: With over 53 million articles and 11 million images, Wikipedia is the greatest encyclopedia in history. The number of users is equally significant, with daily views surpassing 1 billion. Such an enormous system needs automation of tasks to make it possible for the volunteers to maintain. When it comes to textual data, there is a system based on machine learning called ORES providing automation to tasks such as article quality estimation and article topic routing. A visual counterpart system also needs to be developed to support tasks such as vandalism detection in images and for a better understanding of the visual data of Wikipedia. Researchers from the Wikimedia Foundation identified a hindrance to implementing the visual counterpart of ORES: the images of Wikipedia lack topical metadata. Thus, this work aims to develop a deep learning model that classifies images into a set of topics, which have been pre-determined in parallel work. State-of-the-art image classification models and other methods to mitigate the existing class imbalance are used. The conducted experiments show, among others, that: using the data that considers the hierarchy of labels performs better; resampling techniques are ineffective at mitigating imbalance due to the high label concurrence; sample-weighting improves metrics; and that initializing parameters as pre-trained on ImageNet rather than randomly yields better metrics. Moreover, we find interesting outlier labels that, despite having fewer samples, obtain better performance metrics, which is believed to be either due to bias from pre-training or simply more signal in the label. The distribution of the visual data predicted by the models displayed. Finally, some qualitative examples of the model predictions to some images are presented, proving the ability of the model to find correct labels that are missing in the ground truth

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)