Extracting Information From PDF Invoices Using Deep Learning

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Författare: Diego Leon; [2021]

Nyckelord: ;

Sammanfattning: Manually extracting information from invoices can be time-consuming, especially when managing large amounts of documents. Finding a way to automatically extract this information could help businesses save resources. This thesis investigates the information extraction of semi-structured data from PDF invoices using deep learning methods and comparing them to a rule-based model built as a baseline for comparison. More specifically, an object detection approach based on the Faster R-CNN model is compared with a Natural Language Processing (NLP) approach based on BERT. These models were trained to extract 4 different fields, with a dataset consisting of 899 PDF invoices. These models were tested on how well they extracted each field, and their results were then compared. The NLP approach achieved the highest overall F1 score of 0.911 and attained the highest score in all fields except one. In second place came the rule-based approach, with an overall F1 score of 0.830. In last place came the object detection approach with an overall F1 score of 0.815. It is concluded that the NLP approach is best suited for the task of information extraction from PDF invoices. Because of the small dataset and Faster R-CNN requiring large amounts of data and long training, the object detection approach did not reach its full potential. However, further research is needed to prove if it could outperformthe NLP approach with those improvements. 

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)