Using Natural Language Processing to extract information from receipt text

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Författare: Marko Lazic; [2020]

Nyckelord: ;

Sammanfattning: The ability to automatically read, recognize, and extract different information from unstructured text is of key importance to many areas. Most research in this area has been focused on scanned invoices. This thesis investigates the feasibility of using natural language processing to extract information from receipt text. Three different machine learning models, BiLSTM, GCN, and BERT, were trained to extract a total of 7 different data points from a dataset consisting of 790 receipts. In addition, a simple rule-based model is built to serve as a baseline. These four models were then compered on how well they perform on different data points. The best performing machine learning model was BERT with an overall F1 score of 0.455. The second best machine learning model was BiLSTM with the F1 score of 0.278 and GCN had the F1 score of 0.167. These F1 scores are highly affected by the low performance on the product list which was observed with all three models. BERT showed promising results on vendor name, date, tax rate, price, and currency. However, a simple rule-based method was able to outperform the BERT model on all data points except vendor name and tax rate. Receipt images from the dataset were often blurred, rotated, and crumbled which introduced a high OCR error. This error then propagated through all of the steps and was most likely the main rea- son why machine learning models, especially BERT were not able to perform. It is concluded that there is potential in using natural language processing for the problem of information extraction. However, further research is needed if it is going to outperform the rule-based models.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)