A study of the temporal resolution in lipreading using event vision

Detta är en Kandidat-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Författare: Axel Alness Borg; Marcus Enström; [2020]

Nyckelord: ;

Sammanfattning: Mechanically analysing visual features from lips to extract spoken words consists of finding patterns in movements, which is why machine learning has been applied in previous research to address this problem. In previous research conventional frame based cameras have been used with good results. Classifying visual features is expensive and capturing just enough information can be of importance. Event cameras are a type of cameras which is inspired by the human vision and only capture changes in the scene and offer very high temporal resolution. In this report we investigate the importance of the temporal resolution when performing lipreading and whether event cameras can be used for lipreading. A trend of initially increasing accuracy which peaks at a maximum to later decrease when the frame rate is increased can be observed. The research was therefore able to conclude that when using a frame based representation of event data increasing the temporal resolution does not necessarily strictly increase classification accuracy. It is however difficult to be certain about this conclusion due to the fact that there are many other parameters that could effect the accuracy such as an increasing temporal resolution requiring a larger dataset and parameters of the neural network used.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)