Deep Learning Approaches for Clustering Source Code by Functionality

Detta är en Master-uppsats från KTH/Matematisk statistik

Sammanfattning: With the rise of artificial intelligence, applications for machine learning can be found in nearly everyaspect of modern life, from healthcare and transportation to software services like recommendationsystems. Consequently, there are now more developers engaged in the field than ever - with the numberof implementations rapidly increasing by the day. In order to meet the new demands, it would be usefulto provide services that allow for an easy orchestration of a large number of repositories. Enabling usersto easily share, access and search for source code would be beneficial for both research and industryalike. A first step towards this is to find methods for clustering source code by functionality. The problem of clustering source code has previously been studied in the literature. However, theproposed methods have so far not leveraged the capabilities of deep neural networks (DNN). In thiswork, we investigate the possibility of using DNNs to learn embeddings of source code for the purpose ofclustering by functionality. In particular, we evaluate embeddings from Code2Vec and cuBERT modelsfor this specific purpose. From the results of our work we conclude that both Code2Vec and cuBERT are capable of learningsuch embeddings. Among the different frameworks that we used to fine-tune cuBERT, we found thebest performance for this task when fine-tuning the model under the triplet loss criterion. With thisframework, the model was capable of learning embeddings that yielded the most compact and well-separated clusters. We found that a majority of the cluster assignments were semantically coherent withrespect to the functionalities implemented by the methods. With these results, we have found evidenceindicating that it is possible to learn embeddings of source code that encode the functional similaritiesamong the methods. Future research could therefore aim to further investigate the possible applicationsof the embeddings learned by the different frameworks.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)