Duplicate Detection and Text Classification on Simplified Technical English

Detta är en Master-uppsats från Linköpings universitet/Institutionen för datavetenskap

Författare: Max Lund; [2019]

Nyckelord: NLP; CNL; transformer models; LSTM; BERT; document embeddings; word embeddings; text classification; text clustering; transfer learning; machine learning;

Sammanfattning: This thesis investigates the most effective way of performing classification of text labels and clustering of duplicate texts in technical documentation written in Simplified Technical English. Pre-trained language models from transformers (BERT) were tested against traditional methods such as tf-idf with cosine similarity (kNN) and SVMs on the classification task. For detecting duplicate texts, vector representations from pre-trained transformer and LSTM models were tested against tf-idf using the density-based clustering algorithms DBSCAN and HDBSCAN. The results show that traditional methods are comparable to pre-trained models for classification, and that using tf-idf vectors with a low distance threshold in DBSCAN is preferable for duplicate detection.

HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)

Duplicate Detection and Text Classification on Simplified Technical English

Sökningar just nu

Populära sökningar

Uppsatser med många visningar igår (2024-04-18)