Incremental Re-tokenization in BPE-trained SentencePiece Models

Detta är en Kandidat-uppsats från Umeå universitet/Institutionen för datavetenskap

Författare: Simon Hellsten; [2024]

Nyckelord: BPE; Byte Pair Encoding; SentencePiece; NLP; Natural Language Processing; Tokenization; Re-tokenization;

Sammanfattning: This bachelor's thesis in Computer Science explores the efficiency of an incremental re-tokenization algorithm in the context of BPE-trained SentencePiece models used in natural language processing. The thesis begins by underscoring the critical role of tokenization in NLP, particularly highlighting the complexities introduced by modifications in tokenized text. It then presents an incremental re-tokenization algorithm, detailing its development and evaluating its performance against a full text re-tokenization. Experimental results demonstrate that this incremental approach is more time-efficient than full re-tokenization, especially evident in large text datasets. This efficiency is attributed to the algorithm's localized re-tokenization strategy, which limits processing to text areas around modifications. The research concludes by suggesting that incremental re-tokenization could significantly enhance the responsiveness and resource efficiency of text-based applications, such as chatbots and virtual assistants. Future work may focus on predictive models to anticipate the impact of text changes on token stability and optimizing the algorithm for different text contexts.

HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)

Incremental Re-tokenization in BPE-trained SentencePiece Models

Sökningar just nu

Populära sökningar

Uppsatser med många visningar igår (2024-04-26)