Att Strukturera Data Med Nyckelord: Utvecklandet av en Skrapande Artefakt

Detta är en Kandidat-uppsats från Malmö universitet/Institutionen för datavetenskap och medieteknik (DVMT)

Sammanfattning: Development of different methods for processing information has long been a central area in computer science. Being able to structure and compile different types of information can streamline many tasks that facilitate various assignments. In addition, the web is getting bigger and as a result larger amounts of information become more accessible. It also means that it can be more difficult to find and compile relevant information. This raises the questions; Is a layered architecture suitable for extracting semi-structured data from various web-based documents such as HTML and PDF and structuring the content as generically as possible? and How can you find semi-structured data in various forms of documents on the web based on keywords to save the data in tabular form? Review of previous research shows a gap when it comes to processing different levels of structures with the web as a source of data. When processing data, previous projects have usually used a layered architecture where each layer has a specific task and it is also this architecture that was chosen for this artifact. To create the artifact, the Design and Creation method is applied with an included literature study. This method is common in assignments where the goal is to create an artifact with the purpose of answering research questions. Tests of the artifact are also performed in this method and result in how well the artifact follows instructions and whether or not it can answer the research questions. This work has resulted in an artifact that works well and lays a foundation for future work. However, there is room for improvement such as that the artifact could be able to understand context and find more relevant information, but also for future research on how other software can be implemented to streamline and improve results.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)