Single image scene-depth estimation based on self-supervised deep learning : For perception in autonomous heavy duty vehicles

Detta är en Uppsats för yrkesexamina på avancerad nivå från Uppsala universitet/Avdelningen för visuell information och interaktion

Sammanfattning: Depth information is a vital component for perception of the 3D structure of vehicle's surroundings in the autonomous scenario. Ubiquity and relatively low cost of camera equipment make image-based depth estimation very attractive compared to employment of the specialised sensors. Classical image-based depth estimation approaches typically rely on multi-view geometry, requiring alignment and calibration between multiple image sources, which is both cumbersome and error-prone. In contrast, single images lack both temporal information and multi-view correspondences. Also, depth information is lost in projection from the 3D world to a 2D image during the image formation process, making single image depth estimation problem ill-posed. In recent years, Deep Learning-based approaches have been widely proposed for single image depth estimation. The problem is typically tackled in a supervised manner, requiring access to image data with pixel-wise depth information. Acquisition of large amounts of such data that is both varied and accurate, is a laborious and costly task. As an alternative, a number of self-supervised approaches exist showing that it is possible to train models performing single image depth estimation using synchronized stereo image-pairs or sequences of monocular images instead of depth labeled data. This thesis investigates the self-supervised approach utilizing sequences of monocular images, by training and evaluating one of the state-of-the-art methods on both the popular public KITTI dataset and the data of the host company, Scania. A number of extensions are implemented for the method of choice, namely addition of weak supervision with velocity data, employment of geometry consistency constraints and incorporation of a self-attention mechanism. Resulting models showed good depth estimation performance for major components of the scene, such as nearby roads and buildings, however struggled at further ranges, and with dynamic objects and thin structures. Minor qualitative and quantitative improvements in performance were observed with introduction of geometry consistency loss and mask, as well as the self-attention mechanism. Qualitative improvements included the models' enhanced ability to identify clearer object boundaries and better distinguish objects from their background. Geometry consistency loss also proved to be informative in low-texture regions of the image and resolved artifacting behaviour that was observed when training models on Scania's data. Incorporation of the supervision of predicted translations using velocity data has proved to be effective at enforcing the metric scale of the depth network's predictions. However, a risk of overfitting to such supervision was observed when training on Scania's data. In order to resolve this issue, velocity-supervised fine-tuning procedure is proposed as an alternative to velocity-supervised training from scratch, resolving the observed overfitting issue while still enabling the model to learn the metric scale. Proposed fine-tuning procedure was effective even when training models on the KITTI dataset, where no overfitting was observed, suggesting its general applicability.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)