Object Detection via Contextual Information

Detta är en Master-uppsats från Linköpings universitet/Institutionen för systemteknik

Sammanfattning: Using computer vision to automatically process and understand images is becoming increasingly popular. One frequently used technique in this area is object detection, where the goal is to both localize and classify objects in images. Today's detection models are accurate, but there is still room for improvement. Most models process objects independently and do not take any contextual information into account in the classification step. This thesis will therefore investigate if a performance improvement can be achieved by classifying all objects jointly with the use of contextual information. An architecture that has the ability to learn relationships of this type of information is the transformer. To investigate what performance that can be achieved, a new architecture is constructed where the classification step is replaced by a transformer block. The model is trained and evaluated on document images and shows promising results with a mAP score of 87.29. This value is compared to a mAP of 88.19, which was achieved by the object detector, Mask R-CNN, that the new model is built upon.  Although the proposed model did not improve the performance, it comes with some benefits worth exploring further. By using contextual information the proposed model can eliminate the need for Non-Maximum Suppression, which can be seen as a benefit since it removes one hand-crafted process. Another benefit is that the model tends to learn relatively quickly and a single pass over the dataset seems sufficient. The model, however, comes with some drawbacks, including a longer inference time due to the increase in model parameters. The model predictions are also less secure than for Mask R-CNN. With some further investigation and optimization, these drawbacks could be reduced and the performance of the model be improved.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)