Knowledge Base Augmentation from Spreadsheet Data : Combining layout inference with multimodal candidate classification

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: Spreadsheets compose a valuable and notably large dataset of documents within many enterprise organizations and on the Web. Although spreadsheets are intuitive to use and equipped with powerful functionalities, extraction and transformation of the data remain a cumbersome and mostly manual task. The great flexibility they provide to the user results in data that is arbitrarily structured and hard to process for other applications. In this paper, we propose a novel architecture that combines supervised layout inference and multimodal candidate classification to allow knowledge base augmentation from arbitrary spreadsheets. In our design, we consider the need for repairing misclassifications and allow for verification and ranking of ambiguous candidates. We evaluate the performance of our system on two datasets, one with single-table spreadsheets, another with spreadsheets of arbitrary format. The evaluation result shows that the proposed system achieves similar performance on single-table spreadsheets compared to state-of-the-art rule-based solutions. Additionally, the flexibility of the system allows us to process arbitrary spreadsheet formats, including horizontally and vertically aligned tables, multiple worksheets, and contextualizing metadata. This was not possible with existing purely text-based or table-based solutions. The experiments demonstrate that it can achieve high effectiveness with an F1 score of 95.71 on arbitrary spreadsheets that require the interpretation of surrounding metadata. The precision of the system can be further increased by applying candidate schema-matching based on semantic similarity of column headers.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)