Computer-Aided Optically Scanned Document Information Extraction System

Detta är en Master-uppsats från Mittuniversitetet/Institutionen för informationssystem och –teknologi

Sammanfattning: This paper introduced a Computer-Aided Optically Scanned Document Information Extraction System. It could extract information including invoice No., issued date, buyer, etc., from the optically scanned document to meet the demand of customs declaration companies. The system output the structured information to a relational database. In detail, a software architecture for the information extraction of diverse-structure optically scanned document is designed. In this system, the original document is classified firstly. It would put into template-based extraction to improve the extraction performance if its template is pre-defined in the system. Then, a method for image enhancement to improve the image classification is proposed. This method aims to optimize the accuracy of neural network model by extracting the template-related feature and actively removing the unrelated feature. Lastly, the above system is implemented in this paper. This extraction are programed in Python which is a cross-platform languages. This system comprises three parts, classification module, template-based extraction and non-template extraction all of which have APIs and could be ran independently. This feature make this system flexible and easy to customization for the further demand. 445 real-world customs document images were input to evaluate the system. The result revealed that the introduced system ensured the diverse document support with non-template extraction and reached the overall high performance with template-based extraction showing the goal was basically achieved.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)