Generative Adversarial Networks for Cross-Lingual Voice Conversion

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: Speech synthesis is a technology that increasingly influences our daily lives, in the form of smart assistants, advanced translation systems and similar applications. In this thesis, the phenomenon of making one’s voice sound like the voice of someone else is explored. This topic is called voice conversion and needs to be done without altering the linguistic content of speech. More specifically, a Cycle-Consistent Adversarial Network that has proven to work well in a monolingual setting, is evaluated in a multilingual environment. The model is trained to convert voices between native speakers from the Nordic countries. In the experiments no parallel, transcribed or aligned speech data is being used, forcing the model to focus on the raw audio signal. The goal of the thesis is to evaluate if performance is degraded in a multilingual environment, in comparison to monolingual voice conversion, and to measure the impact of the potential performance drop. In the study, performance is measured in terms of naturalness and speaker similarity between the generated speech and the target voice. For evaluation, listening tests are conducted, as well as objective comparisons of the synthesized speech. The results show that voice conversion between a Swedish and Norwegian speaker is possible and also that it can be performed without performance degradation in comparison to Swedish-to-Swedish conversion. Furthermore, conversion between Finnish and Swedish speakers, as well as Danish and Swedish speakers show a performance drop for the generated speech. However, despite the performance decrease, the model produces fluent and clearly articulated converted speech in all experiments. These results are noteworthy, especially since the network is trained on less than 15 minutes of nonparallel speaker data for each speaker. This thesis opens up for further areas of research, for instance investigating more languages, more recent Generative Adversarial Network architectures and devoting more resources to tweaking the hyperparameters to further optimize the model for multilingual voice conversion. 

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)