Where to Fuse

Detta är en Master-uppsats från Lunds universitet/Matematisk statistik

Författare: Lukas Petersson; [2024]

Nyckelord: Technology and Engineering;

Sammanfattning: This thesis investigates fusion techniques in multimodal transformer models, focusing on enhancing the capabilities of large language models in understanding not just text, but also other modalities like images, audio, and sensor data. The study compares late fusion (concatenating modality tokens after separate encoding) and early fusion (concatenating before encoding) techniques, examining their respective advantages and disadvantages. It examines a mid-fusion approach, aiming to combine the strengths of both methods. The effectiveness of this approach is evaluated in terms of accuracy and computational impact on the Visual Question Answering (VQA) task. Using a pretrained T5 model, the research incorporates image tokens (calculaed by Vision Transformer, ViT) into intermediate activations of the model. The findings indicate that standard early fusion techniques underperform with larger decoders, while late fusion with a smaller decoder yields the best results on the VQA task. This conclusion also extends to pooled modality tokens. Additionally, the thesis includes a comprehensive literature study, identifying benchmark datasets for video understanding in multimodal learning and highlighting datasets that demand a robust understanding of all involved modalities. This research contributes to the field by exploring and validating a novel fusion technique in multimodal learning, offering insights into its practical applications and limitations.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)