Clustering SQL-queries using unsupervised machine learning

Detta är en Uppsats för yrkesexamina på avancerad nivå från Luleå tekniska universitet/Institutionen för system- och rymdteknik

Sammanfattning: Decerno has created a business system that utilizes Microsoft's Entity Framework (EF) which is an object-database mapper. It can automatically generate SQL queries from code written in C#. Some of these queries has started to display significant increase in query response time which require further examination. The generated queries can vary in length between 3 to  around 2500 tokens in length which makes it difficult to get an overview of what types of queries that are consistently slow. This thesis examines the possibility of using neural networks based on the transformer model in conjunction with the autoencoder in order to create feature rich embeddings from the SQL queries. The networks presented in this thesis are tasked with capturing the semantics of the SQL queries such that semantically similar queries will be mapped close to one another in the latent feature space. In order to investigate the impact of embedding dimension, several transformer based networks are constructed that calculate embeddings with varying embedding dimension. The dimensionality reduction algorithm UMAP is applied to the higher dimensional embeddings in order to enable the clustering algorithm DBSCAN to successfully be applied. The results show that unsupervised machine learning can be used in order to create feature-rich embeddings from SQL-queries but that higher dimensional embeddings are required as the models that encoded the SQL queries to embeddings with 5 dimensions and lower not yielded satisfactory results. Thus some sort of dimensionality reduction algorithm is required when assuming the method proposed in this thesis. Furthermore, the results did not indicate any correlation between semantic similarity and average response times.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)