Generic Data Harvester

Detta är en Kandidat-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: This report goes through the process of developing a generic article scraper which shall extract relevant information from an arbitrary web article. The extraction is implemented by searching and examining the HTML of the article, by using Python and XPath. The data that shall be extracted is the title, summary, publishing date and body text of the article. As there is no standard way that websites, and in particular news articles, is built, the extraction needs to be adapted for every different structure and language of articles. The resulting program should provide a proof of concept method of extracting the data showing that future development is possible. The thesis host company Acuminor is working with financial crime intelligence and are collecting information through articles and reports. To scale up the data collection and minimize the maintenance of the scraping programs, a general article scraper is needed. There exist an open source alternative called Newspaper, but since this is no longer being maintained and it can be argued is not properly designed, an internal implementation for the company could be beneficial. The program consists of a main class that imports extractor classes that have an API for extracting the data. Each extractor are decoupled from the rest in order to keep the program as modular as possible. The extraction for title, summary and date are similar, with the extractors looking for specific HTML tags that contain some common attribute that most websites implement. The text extraction is implemented using a tree that is built up from the existing text on the page and then searching the tree for the most likely node containing only the body text, using attributes such as amount of text, depth and number of text nodes. The resulting program does not match the performance of Newspaper, but shows promising results on every part of the extraction. The text extraction is very slow and often takes too much text of the article but provides a great blueprint for further improvement at the company. Acuminor will be able to have their in-house article extraction that suits their wants and needs.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)