Self-Supervised Representation Learning for Content Based Image Retrieval

Detta är en Master-uppsats från Linköpings universitet/Statistik och maskininlärning

Sammanfattning: Automotive technologies and fully autonomous driving have seen a tremendous growth in recent times and have benefitted from extensive deep learning research. State-of-the-art deep learning methods are largely supervised and require labelled data for training. However, the annotation process for image data is time-consuming and costly in terms of human efforts. It is of interest to find informative samples for labelling by Content Based Image Retrieval (CBIR). Generally, a CBIR method takes a query image as input and returns a set of images that are semantically similar to the query image. The image retrieval is achieved by transforming images to feature representations in a latent space, where it is possible to reason about image similarity in terms of image content. In this thesis, a self-supervised method is developed to learn feature representations of road scenes images. The self-supervised method learns feature representations for images by adapting intermediate convolutional features from an existing deep Convolutional Neural Network (CNN). A contrastive approach based on Noise Contrastive Estimation (NCE) is used to train the feature learning model. For complex images like road scenes where mutiple image aspects can occur simultaneously, it is important to embed all the salient image aspects in the feature representation. To achieve this, the output feature representation is obtained as an ensemble of feature embeddings which are learned by focusing on different image aspects. An attention mechanism is incorporated to encourage each ensemble member to focus on different image aspects. For comparison, a self-supervised model without attention is considered and a simple dimensionality reduction approach using SVD is treated as the baseline. The methods are evaluated on nine different evaluation datasets using CBIR performance metrics. The datasets correspond to different image aspects and concern the images at different spatial levels - global, semi-global and local. The feature representations learned by self-supervised methods are shown to perform better than the SVD approach. Taking into account that no labelled data is required for training, learning representations for road scenes images using self-supervised methods appear to be a promising direction. Usage of multiple query images to emphasize a query intention is investigated and a clear improvement in CBIR performance is observed. It is inconclusive whether the addition of an attentive mechanism impacts CBIR performance. The attention method shows some positive signs based on qualitative analysis and also performs better than other methods for one of the evaluation datasets containing a local aspect. This method for learning feature representations is promising but requires further research involving more diverse and complex image aspects.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)