Text to Music Audio Generation using Latent Diffusion Model : A re-engineering of AudioLDM Model

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: In the emerging field of audio generation using diffusion models, this project pioneers the adaptation of the AudioLDM model framework, initially designed for text-to-daily sounds generation, towards text-to-music audio generation. This shift addresses a gap in the current scope of audio diffusion models, predominantly focused on everyday sounds. The motivation for this thesis stems from AudioLDM’s remarkable generative capabilities in producing daily sounds from text descriptions. However, its application in music audio generation remains underexplored. The thesis aims to modify AudioLDM’s architecture and training objectives to cater to the unique nuances of musical audio. The re-engineering process involved two primary methods. First, a dataset was constructed by sourcing a variety of music audio samples from the A Dataset For Music Analysis (FMA) [1] and generating pseudo captions using a Large Language Model specified in music captioning. This dataset served as the foundation for training the adapted model. Second, the model’s diffusion backbone, a UNet architecture, was revised in its text conditioning approach by incorporating both the CLAP encoder and the T5 text encoder. This dualencoding method, coupled with a shift from the traditional noise prediction objective to the V-objective, aimed to enhance the model’s performance in generating coherent and musically relevant audio. The effectiveness of these adaptations was validated through both subjective and objective evaluations. Compared to the original AudioLDM model, the adapted version demonstrated superior quality in the audio output and a higher relevance between text prompts and generated music. This advancement not only proves the feasibility of transforming AudioLDM for music generation but also opens new avenues for research and application in text-to-music audio synthesis

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)