Grapheme-to-phoneme transcription of English words in Icelandic text

Detta är en Master-uppsats från Uppsala universitet/Institutionen för lingvistik och filologi

Sammanfattning: Foreign words, such as names, locations or sometimes entire phrases, are a problem for any system that is meant to convert graphemes to phonemes (g2p; i.e.converting written text into phonetic transcription). In this thesis, we investigate both rule-based and neural methods of phonetically transcribing English words found in Icelandic text, taking into account the rules and constraints of how foreign phonemes can be mapped into Icelandic phonology. We implement a rule-based system by compiling grammars into finite-state transducers. In deciding on which rules to include, and evaluating their coverage, we use a list of the most frequently-found English words in a corpus of Icelandic text. The output of the rule-based system is then manually evaluated and corrected (when needed) and subsequently used as data to train a simple bidirectional LSTM g2p model. We train models both with and without length and stress labels included in the gold annotated data. Although the scores for neither model are close to the state-of-the-art for either Icelandic or English, both our rule-based system and LSTM model show promising initial results and improve on the baseline of simply using an Icelandic g2p model, rule-based or neural, on English words. We find that the greater flexibility of the LSTM model seems to give it an advantage over our rule-based system when it comes to modeling certain phenomena. Most notable is the LSTM’s ability to more accurately transcribe relations between graphemes and phonemes for English vowel sounds. Given there does not exist much previous work on g2p transcription specifically handling English words within the Icelandic phonological constraints and it remains an unsolved task, our findings present a foundation for the development of further research, and contribute to improving g2p systems for Icelandic as a whole.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)