Performance analysis of on- device streaming speech recognition

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: Speech recognition is the task where a machine processes human speech into a written format. Groundbreaking scientific progress within speech recognition has been fueled by recent advancements in deep learning research, improving both key metrics of the task; accuracy and speed. Traditional speech recognition systems listen to, and analyse, the full speech utterance before making an output prediction. Streaming speech recognition on the other hand makes predictions in real- time, word by word, as speech is received. However, the improved speed of streaming speech recognition comes at a cost of reduced accuracy given the constraint of not having access to the full speech utterance at all time. In this thesis, we investigate the accuracy of streaming speech recognition systems by implementing models with state-of-the-art Transformer-based architectures. Our results show that for two similar models, one streaming, the other non-streaming, trained on a 100hr subset of Libirspeech, achieve a word error rate of 9.99%/10.76% on test- clean without using a language model. This puts the cost of streaming at a 7.2% accuracy degradation. Furthermore, the streaming models can be used “on-device” which has many benefits, including lower inference time, privacy preservation, and the ability to operate without an internet connection. 

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)