Using Natural Language Processing to Identify Similar Patent Documents

Detta är en Master-uppsats från Lunds universitet/Institutionen för datavetenskap

Sammanfattning: The search for prior art documents is an important, but time-consuming task for the patent attorney. Today, these searches are carried out using keywords, which is problematic since inventions often are described using abstract and general terms in the patent applications. In addition, synonyms must be taken into account and formulated manually. This means a risk of relevant documents being overlooked. In this Master’s thesis, we investigated the use of natural language processing (NLP) on a huge database of patent applications. The aim was to create a tool that can find similar documents by comparing the title and abstract of a provided document with existing documents in the database, thus removing the need to manually extract keywords. We investigated several machine learning models that transform text into nu- merical representations, and applied them to the documents in the database. These models include a number of recent, pre-trained, word embeddings and sentence embeddings. We also developed a web application, which allows the user to perform a search using patent application number or a short text de- scribing an invention. Cosine similarity was used to compare the numerical representations of documents. We also investigated the use of clustering as a way to limit the search domain and speed up the process. Patent associates helped us to evaluate the different models on a set of test queries. Among the models, Sentence-BERT (SBERT) outperformed the others, reaching a mean average precision (MAP) of 0.7655 at finding relevant or very relevant documents.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)