Speech activity detection in videos

Detta är en Uppsats för yrkesexamina på avancerad nivå från Lunds universitet/Institutionen för reglerteknik

Författare: Viktor Andersson; Nelly Ostréus; [2022]

Nyckelord: Technology and Engineering;

Sammanfattning: Speech is an important way of communication all over the world. The speech information is encoded both aural and visual. More than 1.5 billion people have hearing loss and for those the visual information is even more important than for people with normal hearing. Lip reading is therefore an important research topic. In this master thesis, machine learning algorithms were used to identify speech activity in realistic video with monologues and dialogues. Each video contained three persons speaking: one performing a monologue and two performing a dialogue. Support vector machines for linear, radial basis function, sigmoid and polynomial kernels were used to classify the audio as either speech or non-speech based on faces from realistic videos. A speech envelope was calculated and resampled to four Hertz. Based on a threshold of the envelope, the ground truth was created and each audio data point was selected to be either speech or non-speech. Convolutional neural networks using max-margin object detection were used to extract facial landmarks from the videos. Six different video features were calculated and used: the mouth opening distances, the variance of the mouth opening distances and the difference of mouth opening distances between several frames, the mouth area, the variance of the area and the difference of area between several frames. The mean accuracy for the speech activity in the monologues were low. This was probably due to the unbalanced data in the monologues, since most data in the ground truth were classified as speech. For the dialogues, the accuracy were slightly higher than classifying everything as the most frequent class. The variance of the mouth area was the best performing feature. The performance varies between the videos and combining the best mouth opening distances feature with the best mouth area feature for the two best kernels, increased the accuracy for the best performing videos.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)