Speaker diarization in challenging environments using deep networks : An evaluation of a state-of-the-art system

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Författare: Mathias Näreaho; [2023]

Nyckelord: ;

Sammanfattning: Speaker diarization is the task of determining 'who spoke when' in an audio segment. Since the breakthrough of deep learning, speech technology has experienced a huge improvement in a wide range of metrics and fields, and speaker diarization is no different. This thesis aims to evaluate how a state of the art speaker diarization system, pyannote, performs when applied to more difficult acoustic environments, and to investigate how that performance can be improved, as well as discuss what acoustic environments are difficult to diarize. Pyannote initially struggled to diarize audio with a lot of reverberations, and audio where the sound quality was considerably lower, such as a phone call. By utilizing fine-tuning techniques and a technique for augmenting the training data, the performance was greatly improved for the most difficult environments, and remained fairly static for the easier ones, implying that pyannote is robust and able to adapt to significant variations in the audio signal.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)