Learning Methods for Improving News Retrieval Systems
Sammanfattning: Content providers require an efficient and accurate way of retrieving relevant content with minimal human aid. News retrieval, for instance, often requires human intervention to recognize which text documents are news articles and which are not. The differences between a factual news article and an opinionated blog piece may be subtle, yet are critical for providing informative and relevant content to users. This thesis explores the problem of format classification: the task of classifying text documents based on the format in which they are written, such as a news article, blog entry or forum text. More explicitly, the goal of the thesis is to examine how well state-of-the-art supervised text classifica- tion techniques work for format classification. We select a number of classifiers that have been shown to perform well in other text classification tasks and evaluate their perfor- mance in this unexplored task. Experimental evaluation, performed on a novel dataset created from multiple existing datasets, explores both binary and multi-class classification in a bag-of-words feature space. Based on our experimental results, we have found that state-of-the-art supervised text classification techniques perform acceptably well at format classification. Furthermore, we propose a Gradient Boost model as a candidate classifier for the task of format clas- sification, and provide a discussion of future work.
HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)