Are deep video architectures (also) biased toward texture rather than shape?

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Författare: Boyu Li; [2021]

Nyckelord: ;

Sammanfattning: Convolutional neural networks (CNNs) have achieved high accuracy on several different perceptual tasks, such as object recognition and action recognition. Interpretability is required due to the significant impact of CNNs and the requirement of model improvement. Geirhos et al. suggested that ImageNet-trained CNNs exhibit a bias towards learning texture rather than shape and the shape-based representation has the advantage of previously unseen robustness towards multiple image distortions. Inspired by their research, we extend it from the object recognition area to the action recognition area. In this project, we will investigate if the texture bias found for 2D CNNs is similarly present for video CNNs. Through experiments, we make comparisons of different models and different training datasets. We indicate that although Kinetics-trained, UCF101- finetuned I3D outperforms UCF101-trained TSN and UCF101-trained CNN + LSTM on utilizing both shape and texture cues, it shows the largest texture bias among the three models. Recurrent video models or TSN models are capable of reduction in texture bias, possibly due to the advantage of their temporal modeling. We also suggest that Kinetics-trained Flow I3D exhibits a smaller texture bias than the RGB one. Since the model bias depends on the training dataset, a dataset with small static representation bias can force models to rely more on the shape cue to reduce the tendency to texture bias. 

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)