Exploring Machine Learning Solutions in the Context of OCR Post-Processing of Invoices

Detta är en Uppsats för yrkesexamina på grundnivå från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: Large corporations receive and send large volumes of invoices containing various fields detailing a transaction. Such fields include VAT, due date, total amount, etc. One common way to automatize invoice processing is optical character recognition (OCR). This technology entails automatic reading of characters from scanned images. One problem with invoices is that there is no universal layout standard. This creates difficulties when processing data from invoices with different layouts. This thesis aims to examine common errors in the output from Azure's Form Recognizer general document model and the ways in which machine learning (ML) can be used to solve the aforementioned problem, by providing error detection as a first step when classifying OCR output as correct or incorrect. To examine this, an analysis of common errors was made based on OCR output from 70 real invoices, and a Bidirectional Encoder Representations from Transformers (BERT) model was fine-tuned for invoice classification. The results show that the two most common OCR errors are: (i) extra words showing up in a field and (ii) words missing from a field. Together these two types of errors account for 51% of OCR errors. For correctness classification, a BERT type Transformer model yielded an F-score of 0.982 on fabricated data. On real invoice data, the initial model yielded an F-score of 0.596. After additional fine-tuning, the F-score was raised to 0.832. The results of this thesis show that ML, while not entirely reliable, may be a viable first step in assessment and correction of OCR errors for invoices.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)